Time-Varying Weights in Multi-Reward Architecture for Deep Reinforcement Learning

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

View graph of relations

Detail(s)

Original languageEnglish
Number of pages17
Journal / PublicationIEEE Transactions on Emerging Topics in Computational Intelligence
Online published6 Feb 2024
Publication statusOnline published - 6 Feb 2024

Abstract

Deep Reinforcement Learning (DRL) has recently been focused on extracting more knowledge from the reward signal to improve sample efficiency. The Multi-Reward Architecture (MRA) achieves this by breaking down the original reward function into multiple sub-reward branches and training a source-specific policy branch for each one. However, existing MRAs treat all source-specific policy branches as equally important or assign a constant level of importance based on current task conditions, which hinders DRL agents from prioritizing the most important branch at different task stages. Additionally, this necessitates a manual and time-consuming reset of weights for each branch when task conditions change. Thus, it is crucial to automate the time-varying importance assignment for branches. We propose a generic MRA approach to achieve this goal, which can be applied to improve state-of-the-art (SOTA) MRA methods. Firstly, we add a policy branch corresponding to the original reward function, allowing MRA to learn from each sub-reward branch without losing the experience provided by the original reward. Then, we apply the Asynchronous Advantage Actor Critic (A3C) algorithm to learn time-varying weights for all policy branches. These weights are then shaped into a one-hot vector to select the suitable policy branch for producing an action. Extensive experiments have demonstrated that the proposed method effectively improves three SOTA MRA methods across four tasks in terms of episode reward, success rate, score difference, and episode duration. © 2024 IEEE.

Research Area(s)

  • Deep reinforcement learning, multi-reward architecture, time-varying importance, A3C algorithm

Bibliographic Note

Information for this record is supplemented by the author(s) concerned.