Reinforcement Learning (RL)

Learn to make decisions in an environment by maximizing long-term rewards.

3 papers

Written by Junkun Yuan.

Click here to go back to main contents.


Table of contents:

Papers are displayed in reverse chronological order. High-impact or inspiring works are highlighted in red.

Foundation Algorithms & Models

An Empirical Evaluation of Thompson Sampling

Olivier Chapelle, Lihong Li

Yahoo! Research

Advances in Neural Information Processing Systems (NeurIPS), 2011

Dec 12, 2011   |   Thompson Sampling


It introduces the first large-scale empirical demonstration that Thompson sampling achieves SOTA in real-world bandit problems.

(see notes in jupyter)

Finite-time Analysis of the Multiarmed Bandit Problem

Peter Auer, Nicolò Cesa-Bianchi, Paul Fischer

University of Technology Graz, University of Milan, University Dortmund

Machine Learning, 2002

May 01, 2002   |   ε-greedy & UCB

It fundamentally shifted bandit research by providing the first distribution-free, finite-horizon regret bounds that enabled practical, anytime performance guarantees and sparked a wave of refined algorithms and analyses. It has over 9,000 citations (as of Aug 2025).


It proposes index-based and ε-greedy policies that achieve finite-time logarithmic regret bounds for multi-armed bandit with bounded rewards.

(see notes in jupyter)

Reinforcement Learning: An Introduction

Richard S. Sutton, Andrew G. Barto

University of Massachusetts Amherst, Carnegie Mellon University

Cambridge, 1998

Jan 01, 1998   |   RL Introduction

It systematizes the foundations of RL by unifying dynamic programming, Monte Carlo methods, and temporal-difference learning into a coherent framework, establishing the theoretical and algorithmic basis for modern RL research. It has over 80,000 citations (as of Aug 2025).


It formalizes the core concepts, algorithms, and theoretical foundations of RL, establishing it as a coherent and accessible discipline.

(see notes in jupyter)

Last updated on May 18, 2026 at 10:47 (UTC-7).