Led by Geoffrey Hinton Simulacrum
Learning from incomplete episodes — TD(0) prediction, SARSA (on-policy control) and Q-Learning (off-policy control), the algorithms behind modern RL.
Led by Geoffrey Hinton Simulacrum
The question
Temporal difference introduction · TD(0) prediction (one-step bootstrapping) · the TD update rule and comparison with MC · TD(0) prediction in code · bias-variance trade-off (TD vs MC) · SARSA (State-Action-Reward-State-Action) · on-policy TD control...
Outcome
Demonstrates understanding and implementation of td(0) prediction and sarsa.
Sub-units
Led by Geoffrey Hinton Simulacrum
The question
Q-Learning (off-policy TD control) · the Q-Learning update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)] · why Q-Learning is off-policy (learns about the greedy policy while following an exploratory policy) · Q-Learning in code · comparison ...
Outcome
Demonstrates understanding and implementation of q-learning.
Sub-units