Learn to make decisions in an environment by maximizing long-term rewards.
3 papers
Written by Junkun Yuan.
Click here to go back to main contents.
Table of contents:
Papers are displayed in reverse chronological order. High-impact or inspiring works are highlighted in red.
Thompson Sampling(NeurIPS 2011) ε-greedy & UCB(Machine Learning 2002) RL Introduction(Cambridge 1998)
An Empirical Evaluation of Thompson Sampling
Olivier Chapelle, Lihong Li
Yahoo! Research
Advances in Neural Information Processing Systems (NeurIPS), 2011
Dec 12, 2011 | Thompson Sampling
It introduces the first large-scale empirical demonstration that Thompson sampling achieves SOTA in real-world bandit problems.
Finite-time Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer
University of Technology Graz, University of Milan, University Dortmund
Machine Learning, 2002
May 01, 2002 | ε-greedy & UCB
It fundamentally shifted bandit research by providing the first distribution-free, finite-horizon regret bounds that enabled practical, anytime performance guarantees and sparked a wave of refined algorithms and analyses. It has over 9,000 citations (as of Aug 2025).
It proposes index-based and ε-greedy policies that achieve finite-time logarithmic regret bounds for multi-armed bandit with bounded rewards.
Reinforcement Learning: An Introduction
Richard S. Sutton, Andrew G. Barto
University of Massachusetts Amherst, Carnegie Mellon University
Cambridge, 1998
Jan 01, 1998 | RL Introduction
It systematizes the foundations of RL by unifying dynamic programming, Monte Carlo methods, and temporal-difference learning into a coherent framework, establishing the theoretical and algorithmic basis for modern RL research. It has over 80,000 citations (as of Aug 2025).
It formalizes the core concepts, algorithms, and theoretical foundations of RL, establishing it as a coherent and accessible discipline.
Last updated on May 18, 2026 at 10:47 (UTC-7).