Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Abstract

We study regret minimization in low-rank Markov Decision Processes with fixed transitions and adversarial losses under unknown transitions and bandit feedback. We improve the previous best regret bound from poly(d,A,H)T^{5/6} to poly(d,A,H)T^{2/3} for the full-information unknown transition setting, and initiate the study of the bandit loss feedback setting with unknown transitions. We show that assuming a linear structure on the loss is necessary for the bandit case, and propose both model-based and model-free algorithms achieving poly(d,A,H)T^{2/3} regret, as well as oracle-efficient model-free algorithms with poly(d,A,H)T^{4/5} regret.

Publication
NeurIPS 2024
Zak Mhammedi
Zak Mhammedi
Research Scientist

I work on the theoretical foundations of Reinforcement Learning, Controls, and Optimization.