Time-Varying Weights in Multi-Reward Architecture for Deep Reinforcement Learning

Meng Xu, Xinhong Chen, Yechao She, Yang Jin, Jianping Wang*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

10 Citations (Scopus)

Abstract

Deep Reinforcement Learning (DRL) has recently been focused on extracting more knowledge from the reward signal to improve sample efficiency. The Multi-Reward Architecture (MRA) achieves this by breaking down the original reward function into multiple sub-reward branches and training a source-specific policy branch for each one. However, existing MRAs treat all source-specific policy branches as equally important or assign a constant level of importance based on current task conditions, which hinders DRL agents from prioritizing the most important branch at different task stages. Additionally, this necessitates a manual and time-consuming reset of weights for each branch when task conditions change. Thus, it is crucial to automate the time-varying importance assignment for branches. We propose a generic MRA approach to achieve this goal, which can be applied to improve state-of-the-art (SOTA) MRA methods. Firstly, we add a policy branch corresponding to the original reward function, allowing MRA to learn from each sub-reward branch without losing the experience provided by the original reward. Then, we apply the Asynchronous Advantage Actor Critic (A3C) algorithm to learn time-varying weights for all policy branches. These weights are then shaped into a one-hot vector to select the suitable policy branch for producing an action. Extensive experiments have demonstrated that the proposed method effectively improves three SOTA MRA methods across four tasks in terms of episode reward, success rate, score difference, and episode duration. © 2024 IEEE.
Original languageEnglish
Pages (from-to)1865-1881
Number of pages17
JournalIEEE Transactions on Emerging Topics in Computational Intelligence
Volume8
Issue number2
Online published6 Feb 2024
DOIs
Publication statusPublished - Apr 2024

Bibliographical note

Information for this record is supplemented by the author(s) concerned.

Funding

This work was supported in part by the Science and Technology Innovation Committee Foundation of Shenzhen under Grant JCYJ20200109143223052 and in part by Hong Kong Research Grant Council under GRF Grant 11218621.

Research Keywords

  • Deep reinforcement learning
  • multi-reward architecture
  • time-varying importance
  • A3C algorithm

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Time-Varying Weights in Multi-Reward Architecture for Deep Reinforcement Learning'. Together they form a unique fingerprint.

Cite this