Reinforcement learning (RL)
In one line: A training technique where the AI improves by trial and error, getting rewards for good outputs. The 'F' in RLHF.
Reinforcement learning (RL) is a machine-learning technique where an agent learns by trial and error — getting rewards for good actions and penalties for bad ones.
For LLMs, RL is used in RLHF (Reinforcement Learning from Human Feedback). After pre-training, humans rank pairs of model outputs; the model is trained to produce outputs humans rate higher. This is how raw, unhinged base models get turned into well-behaved chatbots.
Newer techniques: RLAIF (RL from AI feedback — used in Constitutional AI), DPO (Direct Preference Optimisation — simpler alternative to PPO).
Modern reasoning models like DeepSeek R1 use RL to learn how to do chain-of-thought effectively.
See it in action — ask any AI about reinforcement learning (rl) on AskAI.free.
Try it free →