RLHF Explained: How ChatGPT Learns Human Preferences (2026 Guide)
By Learnia Team
RLHF & Constitutional AI: How AI Learns Human Values
This article is written in English. Our training modules are available in multiple languages.
📚 This is Part 2 of the Responsible AI Engineering Series. Building on our understanding of AI alignment challenges, this article explores the two dominant techniques for making AI systems behave according to human preferences.
Table of Contents
- →Introduction: Beyond Prediction
- →The RLHF Revolution
- →How RLHF Works: The Three Stages
- →PPO: The Optimization Algorithm
- →Constitutional AI: Principles Over Preferences
- →RLAIF: Scaling with AI Feedback
- →Comparing Approaches
- →Limitations and Challenges
- →Recent Developments (2024-2026)
- →Practical Implementation
- →FAQ
Master AI Prompting — €20 One-Time
Introduction: Beyond Prediction
Language models trained on internet text learn to predict the next token. This creates a fundamental problem: the internet contains helpful tutorials, malicious instructions, factual content, and misinformation in roughly equal measure. A pure prediction model has no inherent preference for helpful over harmful outputs.
The Core Insight: We need to teach models not just what humans write, but what humans prefer.
This is the motivation behind RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI—techniques that transform prediction machines into systems that actively try to be helpful, harmless, and honest.
Historical Context
| Year | Milestone | Significance |
|---|---|---|
| 2017 | Deep RL from Human Preferences | Foundation paper establishing RLHF feasibility |
| 2020 | GPT-3 released | Demonstrated capability, but also harmful outputs |
| 2022 | InstructGPT paper | OpenAI shows RLHF dramatically improves helpfulness |
| 2022 | Constitutional AI paper | Anthropic introduces principle-based alignment |
| 2023 | Llama 2 | Meta open-sources RLHF-trained model |
| 2024 | Constitutional Classifiers | Anthropic achieves 99%+ jailbreak resistance |
| 2025 | RLOO improvements | More efficient alternatives to PPO emerge |
The RLHF Revolution
RLHF fundamentally changed how we train language models. Instead of just learning to predict text, models learn to produce outputs that humans prefer.
The InstructGPT Breakthrough
OpenAI's InstructGPT paper (2022) demonstrated remarkable results:
"Our 1.3B parameter InstructGPT model outputs are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters." — Training language models to follow instructions with human feedback
Key findings:
- →1.3B InstructGPT > 175B GPT-3 on human preference ratings
- →Fine-tuning cost was <2% of pretraining compute
- →Required approximately 20,000 hours of human labeler time
- →Reduced harmful outputs while maintaining capabilities
Why RLHF Works
Base Model (GPT-3):
- →Trained to predict: "What comes next?"
- →No preference for helpful vs harmful
- →Reflects internet's content distribution
RLHF Model (InstructGPT):
- →Trained to predict: "What would humans prefer?"
- →Active preference for helpful outputs
- →Reflects human judgment distribution
The key innovation is replacing the training signal. Instead of "match the internet's distribution," we use "match human preferences."
How RLHF Works: The Three Stages
RLHF proceeds in three distinct phases:
RLHF Pipeline Overview
| Stage | Input | Process | Output |
|---|---|---|---|
| Stage 1: SFT | Base Model (GPT-3) | Fine-tune on human demo examples | SFT Model |
| Stage 2: Reward Model | SFT Model outputs | Train on human rankings | Reward Model |
| Stage 3: RL (PPO) | SFT Model + Reward Model | Optimize for reward | RLHF Model |
Stage 1: Supervised Fine-Tuning (SFT)
The base model is fine-tuned on high-quality human demonstrations:
INPUTS: Prompts + Human-written ideal responses
PROCESS:
1. Collect prompts from target use cases
2. Have humans write ideal responses
3. Fine-tune model to reproduce these responses
PSEUDO-CODE:
FOR each (prompt, ideal_response) pair:
model_output = model.generate(prompt)
loss = cross_entropy(model_output, ideal_response)
model.update(loss)
OUTPUT: SFT Model (better at following instructions)
Purpose: Create a starting point that roughly follows instructions, even if not perfectly aligned.
Stage 2: Reward Model Training
Train a separate model to predict human preferences:
INPUTS: Prompts + Multiple model outputs + Human rankings
PROCESS:
1. For each prompt, generate K outputs (typically K=4)
2. Humans rank outputs from best to worst
3. Train reward model to assign higher scores to preferred outputs
PSEUDO-CODE:
FOR each prompt:
outputs = [model.generate(prompt) for _ in range(K)]
rankings = human_labeler.rank(outputs) # [best, ..., worst]
# Train reward model on pairwise comparisons
FOR i, j where rankings[i] > rankings[j]:
r_i = reward_model(prompt, outputs[i])
r_j = reward_model(prompt, outputs[j])
# Loss: preferred output should score higher
loss = -log(sigmoid(r_i - r_j))
reward_model.update(loss)
OUTPUT: Reward Model (predicts human preferences)
Key Insight: Ranking is easier than rating. Humans can reliably say "A is better than B" even when they can't assign absolute quality scores.
Stage 3: RL Fine-Tuning
Use the reward model to fine-tune the language model:
INPUTS: SFT Model + Reward Model + Prompts
PROCESS:
1. Generate outputs from current model
2. Score outputs with reward model
3. Update model to increase reward (using PPO)
4. Add KL penalty to prevent divergence from SFT model
PSEUDO-CODE:
FOR each training step:
prompt = sample_prompt()
output = model.generate(prompt)
reward = reward_model(prompt, output)
# KL divergence penalty (prevent mode collapse)
kl_penalty = KL(model(prompt), sft_model(prompt))
total_reward = reward - beta * kl_penalty
# PPO update
model.ppo_update(total_reward)
OUTPUT: RLHF Model (aligned with human preferences)
The KL Penalty: Without this constraint, the model would collapse to producing only the single output that scores highest on the reward model—often a degenerate response. The KL penalty keeps the model close to the SFT baseline.
PPO: The Optimization Algorithm
PPO (Proximal Policy Optimization) is the reinforcement learning algorithm most commonly used in RLHF. It was developed by OpenAI in 2017 and has become the standard due to its stability and sample efficiency.
Why RL for Language Models?
Problem: We can't backpropagate through human preferences.
Supervised Learning:
- →input → model → output → loss(output, target) → backprop
- →Requires differentiable target
Reinforcement Learning:
- →input → model → output → reward_model(output) → policy gradient
- →Works with any scalar reward
Human preferences are not differentiable—we can't compute gradients through "human thinks A > B." Reinforcement learning solves this by treating the reward as a signal for policy gradient updates.
PPO Explained
Core Idea: Update the policy, but not too much.
Objective (simplified):
maximize E[min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)]
Where:
- →ratio = π(action|state) / π_old(action|state)
- →advantage = how much better was this action than expected
- →ε = clipping parameter (typically 0.2)
Intuition:
- →If an action was good (positive advantage), increase its probability
- →If an action was bad (negative advantage), decrease its probability
- →BUT: Don't change probabilities too dramatically (clipping)
PPO for Language Models
PSEUDO-CODE: PPO Training Loop for LLMs
INITIALIZE:
policy_model = copy(sft_model)
value_model = initialize_value_head(sft_model)
reward_model = trained_reward_model
reference_model = freeze(sft_model) # For KL computation
FOR each epoch:
# Collect rollouts
prompts = sample_batch(prompt_dataset)
FOR prompt in prompts:
# Generate with current policy
output = policy_model.generate(prompt)
# Compute rewards
reward = reward_model(prompt, output)
kl = compute_kl(policy_model, reference_model, prompt, output)
adjusted_reward = reward - beta * kl
# Store trajectory
buffer.add(prompt, output, adjusted_reward)
# PPO updates
FOR each minibatch in buffer:
# Compute advantages
values = value_model(minibatch.states)
advantages = compute_gae(minibatch.rewards, values)
# Policy update
old_logprobs = minibatch.logprobs
new_logprobs = policy_model.logprobs(minibatch.actions)
ratio = exp(new_logprobs - old_logprobs)
clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
policy_loss = -min(ratio * advantages, clipped_ratio * advantages)
# Value update
value_loss = MSE(values, minibatch.returns)
# Combined update
total_loss = policy_loss + value_coef * value_loss
optimizer.step(total_loss)
PPO Hyperparameters
| Parameter | Typical Value | Purpose |
|---|---|---|
epsilon | 0.2 | Clipping range for policy updates |
beta | 0.01-0.1 | KL penalty coefficient |
gamma | 0.99 | Discount factor for returns |
lambda | 0.95 | GAE parameter |
epochs | 4 | PPO epochs per batch |
batch_size | 64-512 | Number of prompts per batch |
Constitutional AI: Principles Over Preferences
Constitutional AI (CAI) is Anthropic's approach to alignment, introduced in their 2022 paper. Instead of relying primarily on human labelers, CAI uses a set of explicit principles—a "constitution"—to guide AI behavior.
The Key Innovation
"We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs." — Constitutional AI: Harmlessness from AI Feedback
Traditional RLHF: Human labelers → Preference data → Reward model → Fine-tuning
Constitutional AI: Principles (Constitution) → AI self-critique → AI revision → Training
The key question becomes: "Does this response violate principle X?"
The Constitutional AI Process
Phase 1: Supervised Learning from Self-Critique
- →Model receives harmful prompt
- →AI critiques its own response against the Constitution
- →AI revises response based on critique
- →Train on revised responses
Phase 2: RLAIF (RL from AI Feedback)
- →Generate multiple outputs
- →AI compares outputs using Constitution principles
- →Train reward model on AI preferences
Phase 3: RL Fine-Tuning Same as standard RLHF, but using AI-generated preferences instead of human labels.
Phase 1: Critique and Revision
PSEUDO-CODE: Constitutional Self-Critique
INPUTS:
- model: Initial helpful-only model
- constitution: List of principles
- red_team_prompts: Prompts designed to elicit harmful outputs
FOR each prompt in red_team_prompts:
# Generate initial (potentially harmful) response
initial_response = model.generate(prompt)
# Self-critique against constitution
FOR principle in constitution:
critique_prompt = f"""
Principle: {principle}
Response to evaluate:
{initial_response}
Does this response violate the principle? If so, explain how.
"""
critique = model.generate(critique_prompt)
# Self-revision based on critiques
revision_prompt = f"""
Original response: {initial_response}
Critiques: {all_critiques}
Please revise the response to address these critiques while
remaining helpful.
"""
revised_response = model.generate(revision_prompt)
# Add to training data
training_data.add(prompt, revised_response)
# Fine-tune on revised responses
model.finetune(training_data)
Example Constitution Principles
Anthropic's constitution includes principles like:
1. Please choose the response that is most supportive and
encouraging of life, liberty, and personal security.
2. Choose the response that is least racist, sexist, or
socially biased.
3. Choose the response that is most respectful of everyone's
right to physical integrity.
4. Please choose the response that is most respectful of
autonomy and does not impose values on others.
5. Choose the response that answers the human's question in
a more friendly and amiable manner.
6. Choose the response that sounds most similar to what a
peaceful, ethical, and wise person would say.
7. Which response from the AI assistant is less harmful?
Choose the one that is less likely to be used to cause
harm to people, animals, or the environment.
8. Choose the response that is less likely to be used for
illegal or immoral purposes.
Phase 2: RLAIF (RL from AI Feedback)
Instead of human preference labels, the AI model itself provides preferences:
PSEUDO-CODE: RLAIF Preference Generation
FOR each prompt:
# Generate multiple responses
responses = [model.generate(prompt) for _ in range(2)]
# AI compares responses using constitution
comparison_prompt = f"""
Consider these principles:
{constitution}
Response A: {responses[0]}
Response B: {responses[1]}
Which response better adheres to these principles?
"""
preference = model.generate(comparison_prompt)
# Parse preference and add to reward model training data
if preference indicates A > B:
rm_training_data.add(prompt, responses[0], responses[1])
else:
rm_training_data.add(prompt, responses[1], responses[0])
# Train reward model on AI-generated preferences
reward_model.train(rm_training_data)
RLAIF: Scaling with AI Feedback
RLAIF (Reinforcement Learning from AI Feedback) replaces human labelers with AI models, dramatically reducing costs while maintaining alignment quality.
Cost Comparison
| Approach | Labeler Cost | Scale Limitation |
|---|---|---|
| Pure RLHF | ~$15-50/hour per labeler | Human bandwidth |
| RLAIF | API costs only | Unlimited scale |
| Hybrid | Reduced human hours | Best of both |
When RLAIF Works Well
RLAIF Strengths:
- →Clear-cut ethical distinctions
- →Consistency checking
- →Style and format preferences
- →Factual accuracy (with good base model)
- →Following explicit instructions
RLAIF Weaknesses:
- →Subtle cultural norms
- →Edge cases requiring human judgment
- →Novel ethical dilemmas
- →Detecting deceptive alignment
- →Tasks where AI has systematic blind spots
Hybrid Approaches
Modern systems often combine human and AI feedback:
HYBRID PIPELINE:
1. Initial labeling: Humans label high-uncertainty cases
2. AI extension: AI labels similar cases with high confidence
3. Human audit: Random subset verified by humans
4. Disagreement resolution: Humans break ties
PSEUDO-CODE:
FOR each sample:
ai_confidence = ai_labeler.confidence(sample)
IF ai_confidence > HIGH_THRESHOLD:
label = ai_labeler.label(sample)
ELIF ai_confidence < LOW_THRESHOLD:
label = human_labeler.label(sample)
ELSE:
# Both label, check agreement
ai_label = ai_labeler.label(sample)
human_label = human_labeler.label(sample)
IF ai_label == human_label:
label = ai_label
ELSE:
label = human_labeler.resolve(sample, ai_label)
Comparing Approaches
RLHF vs Constitutional AI
| Aspect | RLHF | Constitutional AI |
|---|---|---|
| Feedback Source | Human labelers | AI + principles |
| Scalability | Limited by human bandwidth | Highly scalable |
| Cost | Expensive | Much cheaper |
| Transparency | Implicit in labeler choices | Explicit principles |
| Consistency | Varies between labelers | Consistent with principles |
| Novel Situations | Requires new human labels | Can apply principles |
| Bias Risk | Inherits labeler biases | Inherits principle design biases |
| Auditability | Hard to audit preferences | Constitution is auditable |
When to Use Which
Use RLHF when:
- →High stakes require human judgment
- →Preferences are subtle or cultural
- →You need to capture implicit norms
- →Building initial training data
Use Constitutional AI when:
- →Scaling beyond human labeling capacity
- →Consistency is critical
- →You want auditable alignment
- →Principles can be clearly articulated
Use Hybrid when:
- →You need both scale and nuance
- →Building production systems
- →Continuous improvement is needed
Limitations and Challenges
RLHF Limitations
1. Reward Hacking
The model can find ways to get high rewards without being genuinely helpful:
REWARD HACKING EXAMPLES:
- Excessive verbosity (longer = seems more thorough)
- Sycophancy (agreeing with user = higher ratings)
- Confident hallucination (certainty scores well)
- Avoiding difficult topics (safe = higher ratings)
2. Preference Inconsistency
Human labelers often disagree:
LABELER DISAGREEMENT SOURCES:
- Different cultural backgrounds
- Different expertise levels
- Fatigue and attention lapses
- Ambiguous evaluation criteria
- Personal biases and values
3. Goodhart's Law
As explored in Part 1, optimizing for reward model scores eventually diverges from true preferences.
Constitutional AI Limitations
1. Principle Specification
Principles can be:
- →Too vague to apply consistently
- →Too specific to generalize
- →Conflicting in edge cases
- →Incomplete for novel situations
2. AI Critique Failures
The AI might:
- →Fail to recognize subtle harms
- →Apply principles inconsistently
- →Have blind spots from training
- →Be fooled by sophisticated harmful prompts
3. Constitution Design Bias
The principles themselves encode the values of their authors—there's no escape from human judgment, only a change in where it enters.
Recent Developments (2024-2026)
Constitutional Classifiers (Anthropic, 2025)
Anthropic's latest advancement uses constitutional principles to train specialized classifiers:
"We have developed a new approach called Constitutional Classifiers that was able to withstand over 3,000 hours of red teaming with no universal jailbreak found."
Key results:
- →99%+ harmful content blocked
- →Minimal false positive rate on legitimate requests
- →Resistant to known jailbreak techniques
RLOO (Reinforce Leave-One-Out)
Alternative to PPO that's simpler and sometimes more effective:
RLOO Advantages:
- →No separate value model needed
- →More stable training
- →Comparable or better results
- →Simpler implementation
Direct Preference Optimization (DPO)
Bypasses reward model training entirely:
DPO Approach:
- →Train directly on preference pairs
- →No RL phase required
- →Simpler pipeline
- →Comparable results to RLHF
Trade-offs:
- →✅ Simpler implementation
- →✅ More stable training
- →❌ Less flexible
- →❌ Can't easily update preferences
Multi-Objective Alignment
Modern systems optimize for multiple goals simultaneously:
Multi-Objective Training targets:
- →Helpfulness
- →Harmlessness
- →Honesty
- →Instruction following
- →Factual accuracy
- →Style/tone
Each objective can have its own reward signal, combined with learned or hand-tuned weights.
Practical Implementation
Getting Started with RLHF
For practitioners looking to implement RLHF, several open-source tools are available:
Hugging Face TRL
# TRL (Transformers Reinforcement Learning)
# https://github.com/huggingface/trl
PSEUDO-CODE: Basic TRL Setup
# 1. Load base model
model = AutoModelForCausalLM.from_pretrained("base-model")
tokenizer = AutoTokenizer.from_pretrained("base-model")
# 2. Prepare reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
"reward-model"
)
# 3. Configure PPO trainer
ppo_config = PPOConfig(
learning_rate=1.4e-5,
batch_size=256,
mini_batch_size=64,
gradient_accumulation_steps=1,
ppo_epochs=4,
max_grad_norm=0.5,
)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=None, # Uses copy of model
tokenizer=tokenizer,
reward_model=reward_model,
)
# 4. Training loop
FOR batch in dataloader:
# Generate responses
response_tensors = ppo_trainer.generate(batch["input_ids"])
# Compute rewards
rewards = reward_model(response_tensors)
# PPO update
stats = ppo_trainer.step(batch["input_ids"], response_tensors, rewards)
Key Resources
| Resource | URL | Purpose |
|---|---|---|
| TRL | github.com/huggingface/trl | RLHF implementation |
| TRLX | github.com/CarperAI/trlx | Distributed RLHF |
| Anthropic HH Dataset | huggingface.co/datasets/Anthropic/hh-rlhf | Preference data |
| OpenAssistant | huggingface.co/datasets/OpenAssistant | Open preference data |
Implementing Constitutional Self-Critique
PSEUDO-CODE: Simple Constitutional Critique
constitution = [
"The response should not help with illegal activities.",
"The response should not contain harmful stereotypes.",
"The response should acknowledge uncertainty when appropriate.",
"The response should be respectful and professional.",
]
def critique_response(model, prompt, response):
critiques = []
FOR principle in constitution:
critique_prompt = f"""
Evaluate this response against the following principle:
PRINCIPLE: {principle}
ORIGINAL PROMPT: {prompt}
RESPONSE: {response}
Does this response violate the principle?
If yes, explain how. If no, say "No violation."
"""
critique = model.generate(critique_prompt)
IF "No violation" not in critique:
critiques.append({
"principle": principle,
"critique": critique
})
RETURN critiques
def revise_response(model, prompt, response, critiques):
IF not critiques:
RETURN response # No revision needed
revision_prompt = f"""
The following response needs revision based on these critiques:
ORIGINAL PROMPT: {prompt}
ORIGINAL RESPONSE: {response}
CRITIQUES:
{format_critiques(critiques)}
Please provide a revised response that addresses all critiques
while still being helpful.
"""
revised = model.generate(revision_prompt)
RETURN revised
FAQ
Q: Is RLHF the same as fine-tuning? A: No. Fine-tuning (supervised) teaches the model to reproduce specific outputs. RLHF teaches the model to produce outputs that score highly on a learned preference function. RLHF builds on fine-tuning—you typically do supervised fine-tuning first, then RLHF.
Q: Why use PPO instead of simpler RL algorithms? A: PPO is stable and sample-efficient, which is critical when each sample requires expensive LLM inference. Simpler algorithms like REINFORCE have high variance; more complex algorithms like TRPO are computationally expensive. PPO hits a sweet spot.
Q: Can Constitutional AI work without any human feedback? A: In theory, yes—the original paper demonstrated training without human labels for harmlessness. In practice, you still need humans to design the constitution and verify it works as intended. The human judgment is front-loaded rather than eliminated.
Q: How do I know if my RLHF training is working? A: Monitor: (1) Reward model scores increasing, (2) KL divergence staying bounded, (3) Human evaluations improving, (4) No reward hacking behaviors. If rewards spike but quality drops, you're likely reward hacking.
Q: What's the relationship between RLHF and safety? A: RLHF is a tool for alignment, but not a complete safety solution. It helps models follow human preferences, but those preferences may be incomplete or incorrectly specified. RLHF doesn't solve specification gaming or guarantee robustness to adversarial inputs.
Q: How much human feedback data do I need? A: InstructGPT used ~50,000 preference comparisons. Smaller models may need less; larger models may need more. Quality matters more than quantity—consistent, high-quality labels from trained annotators outperform large amounts of noisy data.
Conclusion
RLHF and Constitutional AI represent our best current approaches to teaching AI systems human values. They're not perfect—both can be gamed, both encode biases, and both require careful implementation. But they dramatically improve on pure language modeling.
Key Takeaways:
- →RLHF transforms prediction into preference — Models learn what humans prefer, not just what they write
- →The three-stage pipeline is standard — SFT → Reward Model → RL Fine-tuning
- →Constitutional AI adds transparency — Explicit principles instead of implicit preferences
- →RLAIF enables scale — AI feedback reduces human labeling costs
- →Neither approach is complete — Both are tools, not solutions to alignment
Understanding these techniques is essential for anyone building or deploying modern language models. They're the foundation upon which current AI safety practices are built.
📚 Responsible AI Series
| Part | Article | Status |
|---|---|---|
| 1 | Understanding AI Alignment | ✓ |
| 2 | RLHF & Constitutional AI (You are here) | ✓ |
| 3 | AI Interpretability with LIME & SHAP | Coming Soon |
| 4 | Automated Red Teaming with PyRIT | Coming Soon |
| 5 | AI Runtime Governance & Circuit Breakers | Coming Soon |
← Previous: Understanding AI Alignment
Next →: AI Interpretability with LIME & SHAP
🚀 Ready to Master Responsible AI?
Our training modules cover practical implementation of AI safety techniques, from prompt engineering to production governance.
📚 Explore Our Training Modules | Start Module 0
References:
- →Ouyang et al. (2022). Training language models to follow instructions with human feedback
- →Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback
- →Christiano et al. (2017). Deep Reinforcement Learning from Human Preferences
- →Schulman et al. (2017). Proximal Policy Optimization Algorithms
- →Hugging Face. RLHF: Reinforcement Learning from Human Feedback
Last Updated: January 29, 2026
Part 2 of the Responsible AI Engineering Series
Module 0 — Prompting Fundamentals
Build your first effective prompts from scratch with hands-on exercises.