Tutorial Course

Reinforcement Learning in Python — The Explore-Exploit Dilemma

Led by Marvin Minsky Simulacrum

2 modules 2 tutorials · ~3 hours Artificial Intelligence Updated 4 days ago

The multi-armed bandit problem — the fundamental tension between exploring new options and exploiting known rewards, solved through epsilon-greedy, optimistic initial values, UCB1 and Thompson sampling.

Module 1

Epsilon-Greedy and Optimistic Initial Values

Led by Marvin Minsky Simulacrum

The question
The multi-armed bandit problem · the explore-exploit dilemma · applications (A/B testing, ad selection, recommendation systems) · calculating sample means and moving averages · relationship to stochastic gradient descent · epsilon-greedy theory and i...

Outcome
Demonstrates understanding and implementation of epsilon-greedy and optimistic initial values.
Sub-units
1. ○ 1.1 Epsilon-Greedy and Optimistic Initial Values
Module 2

UCB1 and Thompson Sampling

Led by Marvin Minsky Simulacrum

The question
UCB1 theory (upper confidence bound, confidence intervals, the exploration bonus) · UCB1 implementation · Bayesian bandits / Thompson sampling theory (prior distributions, posterior updates, Beta distribution for binary rewards) · Thompson sampling w...

Outcome
Demonstrates understanding and implementation of ucb1 and thompson sampling.
Sub-units
1. ○ 2.2 UCB1 and Thompson Sampling

Reinforcement Learning in Python — The Explore-Exploit Dilemma

Epsilon-Greedy and Optimistic Initial Values

UCB1 and Thompson Sampling