Time-Varying Weights in Multi-Reward Architecture for Deep Reinforcement Learning
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Pages (from-to) | 1865-1881 |
Number of pages | 17 |
Journal / Publication | IEEE Transactions on Emerging Topics in Computational Intelligence |
Volume | 8 |
Issue number | 2 |
Online published | 6 Feb 2024 |
Publication status | Published - Apr 2024 |
Link(s)
DOI | DOI |
---|---|
Document Link | Links
|
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85184797269&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(8ccafc3a-9b11-42bc-afde-e30192601d00).html |
Abstract
Deep Reinforcement Learning (DRL) has recently been focused on extracting more knowledge from the reward signal to improve sample efficiency. The Multi-Reward Architecture (MRA) achieves this by breaking down the original reward function into multiple sub-reward branches and training a source-specific policy branch for each one. However, existing MRAs treat all source-specific policy branches as equally important or assign a constant level of importance based on current task conditions, which hinders DRL agents from prioritizing the most important branch at different task stages. Additionally, this necessitates a manual and time-consuming reset of weights for each branch when task conditions change. Thus, it is crucial to automate the time-varying importance assignment for branches. We propose a generic MRA approach to achieve this goal, which can be applied to improve state-of-the-art (SOTA) MRA methods. Firstly, we add a policy branch corresponding to the original reward function, allowing MRA to learn from each sub-reward branch without losing the experience provided by the original reward. Then, we apply the Asynchronous Advantage Actor Critic (A3C) algorithm to learn time-varying weights for all policy branches. These weights are then shaped into a one-hot vector to select the suitable policy branch for producing an action. Extensive experiments have demonstrated that the proposed method effectively improves three SOTA MRA methods across four tasks in terms of episode reward, success rate, score difference, and episode duration. © 2024 IEEE.
Research Area(s)
- Deep reinforcement learning, multi-reward architecture, time-varying importance, A3C algorithm
Bibliographic Note
Information for this record is supplemented by the author(s) concerned.
Citation Format(s)
Time-Varying Weights in Multi-Reward Architecture for Deep Reinforcement Learning. / Xu, Meng; Chen, Xinhong; She, Yechao et al.
In: IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 8, No. 2, 04.2024, p. 1865-1881.
In: IEEE Transactions on Emerging Topics in Computational Intelligence, Vol. 8, No. 2, 04.2024, p. 1865-1881.
Research output: Journal Publications and Reviews › RGC 21 - Publication in refereed journal › peer-review