Reinforcement Learning (RL)

It fundamentally shifted bandit research by providing the first distribution-free, finite-horizon regret bounds that enabled practical, anytime performance guarantees and sparked a wave of refined algorithms and analyses. It has over 9,000 citations (as of Aug 2025).

It proposes index-based and ε-greedy policies that achieve finite-time logarithmic regret bounds for multi-armed bandit with bounded rewards.

(see notes in jupyter)

Reinforcement Learning: An Introduction

Richard S. Sutton, Andrew G. Barto

University of Massachusetts Amherst, Carnegie Mellon University

Cambridge, 1998

Jan 01, 1998 | RL Introduction

It systematizes the foundations of RL by unifying dynamic programming, Monte Carlo methods, and temporal-difference learning into a coherent framework, establishing the theoretical and algorithmic basis for modern RL research. It has over 80,000 citations (as of Aug 2025).

It formalizes the core concepts, algorithms, and theoretical foundations of RL, establishing it as a coherent and accessible discipline.

(see notes in jupyter)

Last updated on June 06, 2026 at 13:12 (UTC-7).