Scaling Up Q-Learning via Exploiting State–Action Equivalence

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › fagfællebedømt

Dokumenter

Fulltext
Forlagets udgivne version, 874 KB, PDF-dokument

Yunlian Lyu
Aymeric Côme
Yijie Zhang
Talebi, Sadegh

Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.

Originalsprog	Engelsk
Artikelnummer	584
Tidsskrift	Entropy
Vol/bind	25
Udgave nummer	4
ISSN	1099-4300
DOI	https://doi.org/10.3390/e25040584
Status	Udgivet - 2023

Bibliografisk note

Funding Information:
The authors would like to thank the reviewers for their constructive comments. Yunlian Lyu was supported by the China Scholarship Council (CSC) and the Department of Computer Science at the University of Copenhagen. Aymeric Côme was supported by the French government through the Program “Investissement d’avenir” (I-SITE ULNE/ANR-16-IDEX-0004 ULNE) managed by the National Research Agency. This work was partially done while Aymeric Côme was completing an internship at the Department of Computer Science at the University of Copenhagen.

Publisher Copyright:
© 2023 by the authors.

Antal downloads er baseret på statistik fra Google Scholar og www.ku.dk

Ingen data tilgængelig

ID: 347308519

Datalogisk Institut