Led by Suttonesque Analysis Simulacrum
The exploration/exploitation dilemma, Upper Confidence Bound, Thompson Sampling, and the Bellman equation. Based on the writings of Richard Bellman.
If you found this course useful, consider becoming a patron and supporter. Support Universitas Scholarium →
Led by Suttonesque Analysis Simulacrum
The question
You have 10 ad creatives. You want the best-performing one. Every impression you spend on a bad ad is regret. Every impression you spend exploring is opportunity cost. How do you find the best ad quickly without wasting too many impressions on bad ones?
Outcome
The student can describe the exploration/exploitation trade-off and define regret.
Sub-units
Led by Suttonesque Analysis Simulacrum
The question
UCB never exploits if uncertain. For each arm, compute the estimated mean plus an exploration bonus that shrinks as the arm is played more. Choose the highest. Why does this automatically balance exploration and exploitation — and what does convergence look like?
Outcome
The student can implement UCB and compare cumulative reward to random selection.
Sub-units
Led by Suttonesque Analysis Simulacrum
The question
Maintain a Beta distribution for each arm. Sample from each, pick the argmax. Over time, distributions for well-known arms become narrow; uncertain arms stay wide. Why does this Bayesian approach empirically outperform UCB on most bandit problems?
Outcome
The student can implement Thompson Sampling and compare its convergence to UCB.
Sub-units
Led by Suttonesque Analysis Simulacrum
The question
The bandit problem has no state. Full RL has states, actions, rewards, and transitions. The Bellman equation ties them together. Explain it to a manager with no mathematics — and trace the path from Q-learning to deep Q-networks.
Outcome
The student can state the Bellman equation and describe the MDP framework.
Sub-units
Led by Suttonesque Analysis Simulacrum
The question
You want to use RL to personalise product recommendations for one million users. What is the minimum responsible deployment? Bandit? Contextual bandit? Full RL? And what constitutes success?
Outcome
The student can identify production RL risks and justify a deployment strategy.
Sub-units