Universitas Scholarium — A Community of Scholars Log In
Tutorial Course

Reinforcement Learning in Python — The Explore-Exploit Dilemma

Led by Marvin Minsky Simulacrum

2 modules 2 tutorials · ~3 hours Artificial Intelligence Updated 4 days ago

The multi-armed bandit problem — the fundamental tension between exploring new options and exploiting known rewards, solved through epsilon-greedy, optimistic initial values, UCB1 and Thompson sampling.

Epsilon-Greedy and O…1UCB1 and Thompson Sa…2
  1. Module 1

    Epsilon-Greedy and Optimistic Initial Values

    Led by Marvin Minsky Simulacrum

    The question

    The multi-armed bandit problem · the explore-exploit dilemma · applications (A/B testing, ad selection, recommendation systems) · calculating sample means and moving averages · relationship to stochastic gradient descent · epsilon-greedy theory and i...

    Outcome

    Demonstrates understanding and implementation of epsilon-greedy and optimistic initial values.

    Sub-units

    1. 1.1 Epsilon-Greedy and Optimistic Initial Values
  2. Module 2

    UCB1 and Thompson Sampling

    Led by Marvin Minsky Simulacrum

    The question

    UCB1 theory (upper confidence bound, confidence intervals, the exploration bonus) · UCB1 implementation · Bayesian bandits / Thompson sampling theory (prior distributions, posterior updates, Beta distribution for binary rewards) · Thompson sampling w...

    Outcome

    Demonstrates understanding and implementation of ucb1 and thompson sampling.

    Sub-units

    1. 2.2 UCB1 and Thompson Sampling