Led by Russellian Beneficial AI Simulacrum
The alignment problem — how to build AI systems that pursue the right objectives, the failure modes of misspecified reward, and the architecture of beneficial AI.
Led by Russellian Beneficial AI Simulacrum
The question
The alignment problem stated precisely · the difference between the objective you specify and the objective you intend · the King Midas problem: getting exactly what you asked for · reward hacking and specification gaming in current AI systems · examples of misspecified reward in deployed systems (s
Outcome
Demonstrates competence in reward misspecification and the king midas problem.
Sub-units
Led by Russellian Beneficial AI Simulacrum
The question
Russell's proposal: machines that are uncertain about human preferences · inverse reward learning: inferring objectives from human behaviour rather than specifying them directly · cooperative inverse reinforcement learning (CIRL) · the advantages of uncertainty: a system that knows it does not know
Outcome
Demonstrates competence in inverse reward design and the architecture of beneficial ai.
Sub-units