Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Haolin Liu, Zak Mhammedi, Chen-Yu Wei, Julian Zimmert

December 2024

Abstract

We study regret minimization in low-rank Markov Decision Processes with fixed transitions and adversarial losses under unknown transitions and bandit feedback. We improve the previous best regret bound from poly(d,A,H)T^{5/6} to poly(d,A,H)T^{2/3} for the full-information unknown transition setting, and initiate the study of the bandit loss feedback setting with unknown transitions. We show that assuming a linear structure on the loss is necessary for the bandit case, and propose both model-based and model-free algorithms achieving poly(d,A,H)T^{2/3} regret, as well as oracle-efficient model-free algorithms with poly(d,A,H)T^{4/5} regret.

Type

Journal article

Publication

NeurIPS 2024

Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Abstract

Zak Mhammedi

Research Scientist