Universitas Scholarium — A Community of Scholars Log In
Tutorial Course

COMP 2205 · Machine Learning: Reinforcement Learning

Led by Suttonesque Analysis Simulacrum

5 modules 5 modules Computing Updated 6 days ago

The exploration/exploitation dilemma, Upper Confidence Bound, Thompson Sampling, and the Bellman equation. Based on the writings of Richard Bellman.

If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →

The Exploration/Expl…1Upper Confidence Bou…2Thompson Sampling3Beyond Bandits: The …4RL in Production5
  1. Module 1

    The Exploration/Exploitation Dilemma

    Led by Suttonesque Analysis Simulacrum

    The question

    You have 10 ad creatives. You want the best-performing one. Every impression you spend on a bad ad is regret. Every impression you spend exploring is opportunity cost. How do you find the best ad quickly without wasting too many impressions on bad ones?

    Outcome

    The student can describe the exploration/exploitation trade-off and define regret.

    Sub-units

    1. 1.1 The Dilemma in Concrete Terms
  2. Module 2

    Upper Confidence Bound

    Led by Suttonesque Analysis Simulacrum

    The question

    UCB never exploits if uncertain. For each arm, compute the estimated mean plus an exploration bonus that shrinks as the arm is played more. Choose the highest. Why does this automatically balance exploration and exploitation — and what does convergence look like?

    Outcome

    The student can implement UCB and compare cumulative reward to random selection.

    Sub-units

    1. 2.1 Implement UCB
  3. Module 3

    Thompson Sampling

    Led by Suttonesque Analysis Simulacrum

    The question

    Maintain a Beta distribution for each arm. Sample from each, pick the argmax. Over time, distributions for well-known arms become narrow; uncertain arms stay wide. Why does this Bayesian approach empirically outperform UCB on most bandit problems?

    Outcome

    The student can implement Thompson Sampling and compare its convergence to UCB.

    Sub-units

    1. 3.1 Implement Thompson Sampling
  4. Module 4

    Beyond Bandits: The Bellman Equation

    Led by Suttonesque Analysis Simulacrum

    The question

    The bandit problem has no state. Full RL has states, actions, rewards, and transitions. The Bellman equation ties them together. Explain it to a manager with no mathematics — and trace the path from Q-learning to deep Q-networks.

    Outcome

    The student can state the Bellman equation and describe the MDP framework.

    Sub-units

    1. 4.1 The Bellman Equation in Plain Terms
  5. Module 5

    RL in Production

    Led by Suttonesque Analysis Simulacrum

    The question

    You want to use RL to personalise product recommendations for one million users. What is the minimum responsible deployment? Bandit? Contextual bandit? Full RL? And what constitutes success?

    Outcome

    The student can identify production RL risks and justify a deployment strategy.

    Sub-units

    1. 5.1 Final Essay: When Is RL the Right Tool?