Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Qiuhao Wang (Co-first Author), Yuqi Zha (Co-first Author), Chin Pang Ho*, Marek Petrik*

*Corresponding author for this work

Research output: Conference PapersPosterpeer-review

Abstract

Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance. Copyright 2025 by the author(s)
Original languageEnglish
Number of pages32
Publication statusPublished - Jul 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver Convention Center, Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025
https://icml.cc/Conferences/2025

Conference

Conference42nd International Conference on Machine Learning, ICML 2025
Abbreviated titleICML 2025
PlaceCanada
CityVancouver
Period13/07/2519/07/25
Internet address

Bibliographical note

Information for this record is supplemented by the author(s) concerned.

Funding

This work was supported, in part, by CityUHK Start-Up Grant (Project No. 9610481) and the Research Grants Council of Hong Kong (General Research Fund, Project No. 11508623). This work was also supported, in part, by NSF grants 2144601 and 2218063.

Research Keywords

  • Markov decision process (MDP)
  • Average-reward Markov decision process
  • Policy gradient
  • Robust optimization

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity'. Together they form a unique fingerprint.

Cite this