Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

4 Scopus Citations
View graph of relations

Detail(s)

Original languageEnglish
Article number107
Number of pages28
Journal / PublicationACM Transactions on Intelligent Systems and Technology
Volume14
Issue number6
Online published14 Nov 2023
Publication statusPublished - Dec 2023

Abstract

In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Research Area(s)

  • Compound agent learning, deep reinforcement learning, policy fusion, dynamic weights, prior reward

Bibliographic Note

Research Unit(s) information for this publication is provided by the author(s) concerned.