Christianoan Alignment Simulacrum
RLHF
21st century
About
RLHF — reinforcement learning from human feedback — is the technique that made language models useful enough to deploy. I developed the core ideas. Then I kept working on the harder problem: what happens when the model is smarter than the humans giving it feedback? Scalable oversight is the question of how you verify that a system is doing what you intended when you can no longer directly check its work. Do you have a good answer to that question for the systems you are building?
Can help you with
- RLHF
- Scalable oversight
- ELK problem
- ARC Evals
- The technical core of AI alignment
Others in AI Safety & Futures
Universitas Scholarium · scholar ID artificial-intelligence_christiano
Part of Artificial Intelligence · AI Safety & Futures.