Back to all articles
18 MIN READ

RLHF Explained: How ChatGPT Learns Human Preferences (2026 Guide)

By Learnia Team

RLHF & Constitutional AI: How AI Learns Human Values

This article is written in English. Our training modules are available in multiple languages.

📚 This is Part 2 of the Responsible AI Engineering Series. Building on our understanding of AI alignment challenges, this article explores the two dominant techniques for making AI systems behave according to human preferences.


Table of Contents

  1. Introduction: Beyond Prediction
  2. The RLHF Revolution
  3. How RLHF Works: The Three Stages
  4. PPO: The Optimization Algorithm
  5. Constitutional AI: Principles Over Preferences
  6. RLAIF: Scaling with AI Feedback
  7. Comparing Approaches
  8. Limitations and Challenges
  9. Recent Developments (2024-2026)
  10. Practical Implementation
  11. FAQ

Master AI Prompting — €20 One-Time

10 ModulesLifetime Access
Get Full Access

Introduction: Beyond Prediction

Language models trained on internet text learn to predict the next token. This creates a fundamental problem: the internet contains helpful tutorials, malicious instructions, factual content, and misinformation in roughly equal measure. A pure prediction model has no inherent preference for helpful over harmful outputs.

The Core Insight: We need to teach models not just what humans write, but what humans prefer.

This is the motivation behind RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI—techniques that transform prediction machines into systems that actively try to be helpful, harmless, and honest.

Historical Context

YearMilestoneSignificance
2017Deep RL from Human PreferencesFoundation paper establishing RLHF feasibility
2020GPT-3 releasedDemonstrated capability, but also harmful outputs
2022InstructGPT paperOpenAI shows RLHF dramatically improves helpfulness
2022Constitutional AI paperAnthropic introduces principle-based alignment
2023Llama 2Meta open-sources RLHF-trained model
2024Constitutional ClassifiersAnthropic achieves 99%+ jailbreak resistance
2025RLOO improvementsMore efficient alternatives to PPO emerge

The RLHF Revolution

RLHF fundamentally changed how we train language models. Instead of just learning to predict text, models learn to produce outputs that humans prefer.

The InstructGPT Breakthrough

OpenAI's InstructGPT paper (2022) demonstrated remarkable results:

"Our 1.3B parameter InstructGPT model outputs are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters." — Training language models to follow instructions with human feedback

Key findings:

  • 1.3B InstructGPT > 175B GPT-3 on human preference ratings
  • Fine-tuning cost was <2% of pretraining compute
  • Required approximately 20,000 hours of human labeler time
  • Reduced harmful outputs while maintaining capabilities

Why RLHF Works

Base Model (GPT-3):

  • Trained to predict: "What comes next?"
  • No preference for helpful vs harmful
  • Reflects internet's content distribution

RLHF Model (InstructGPT):

  • Trained to predict: "What would humans prefer?"
  • Active preference for helpful outputs
  • Reflects human judgment distribution

The key innovation is replacing the training signal. Instead of "match the internet's distribution," we use "match human preferences."


How RLHF Works: The Three Stages

RLHF proceeds in three distinct phases:

RLHF Pipeline Overview

StageInputProcessOutput
Stage 1: SFTBase Model (GPT-3)Fine-tune on human demo examplesSFT Model
Stage 2: Reward ModelSFT Model outputsTrain on human rankingsReward Model
Stage 3: RL (PPO)SFT Model + Reward ModelOptimize for rewardRLHF Model

Stage 1: Supervised Fine-Tuning (SFT)

The base model is fine-tuned on high-quality human demonstrations:

INPUTS: Prompts + Human-written ideal responses

PROCESS:
1. Collect prompts from target use cases
2. Have humans write ideal responses
3. Fine-tune model to reproduce these responses

PSEUDO-CODE:
FOR each (prompt, ideal_response) pair:
    model_output = model.generate(prompt)
    loss = cross_entropy(model_output, ideal_response)
    model.update(loss)

OUTPUT: SFT Model (better at following instructions)

Purpose: Create a starting point that roughly follows instructions, even if not perfectly aligned.

Stage 2: Reward Model Training

Train a separate model to predict human preferences:

INPUTS: Prompts + Multiple model outputs + Human rankings

PROCESS:
1. For each prompt, generate K outputs (typically K=4)
2. Humans rank outputs from best to worst
3. Train reward model to assign higher scores to preferred outputs

PSEUDO-CODE:
FOR each prompt:
    outputs = [model.generate(prompt) for _ in range(K)]
    rankings = human_labeler.rank(outputs)  # [best, ..., worst]
    
    # Train reward model on pairwise comparisons
    FOR i, j where rankings[i] > rankings[j]:
        r_i = reward_model(prompt, outputs[i])
        r_j = reward_model(prompt, outputs[j])
        
        # Loss: preferred output should score higher
        loss = -log(sigmoid(r_i - r_j))
        reward_model.update(loss)

OUTPUT: Reward Model (predicts human preferences)

Key Insight: Ranking is easier than rating. Humans can reliably say "A is better than B" even when they can't assign absolute quality scores.

Stage 3: RL Fine-Tuning

Use the reward model to fine-tune the language model:

INPUTS: SFT Model + Reward Model + Prompts

PROCESS:
1. Generate outputs from current model
2. Score outputs with reward model
3. Update model to increase reward (using PPO)
4. Add KL penalty to prevent divergence from SFT model

PSEUDO-CODE:
FOR each training step:
    prompt = sample_prompt()
    output = model.generate(prompt)
    
    reward = reward_model(prompt, output)
    
    # KL divergence penalty (prevent mode collapse)
    kl_penalty = KL(model(prompt), sft_model(prompt))
    
    total_reward = reward - beta * kl_penalty
    
    # PPO update
    model.ppo_update(total_reward)

OUTPUT: RLHF Model (aligned with human preferences)

The KL Penalty: Without this constraint, the model would collapse to producing only the single output that scores highest on the reward model—often a degenerate response. The KL penalty keeps the model close to the SFT baseline.


PPO: The Optimization Algorithm

PPO (Proximal Policy Optimization) is the reinforcement learning algorithm most commonly used in RLHF. It was developed by OpenAI in 2017 and has become the standard due to its stability and sample efficiency.

Why RL for Language Models?

Problem: We can't backpropagate through human preferences.

Supervised Learning:

  • input → model → output → loss(output, target) → backprop
  • Requires differentiable target

Reinforcement Learning:

  • input → model → output → reward_model(output) → policy gradient
  • Works with any scalar reward

Human preferences are not differentiable—we can't compute gradients through "human thinks A > B." Reinforcement learning solves this by treating the reward as a signal for policy gradient updates.

PPO Explained

Core Idea: Update the policy, but not too much.

Objective (simplified):

maximize E[min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)]

Where:

  • ratio = π(action|state) / π_old(action|state)
  • advantage = how much better was this action than expected
  • ε = clipping parameter (typically 0.2)

Intuition:

  • If an action was good (positive advantage), increase its probability
  • If an action was bad (negative advantage), decrease its probability
  • BUT: Don't change probabilities too dramatically (clipping)

PPO for Language Models

PSEUDO-CODE: PPO Training Loop for LLMs

INITIALIZE:
    policy_model = copy(sft_model)
    value_model = initialize_value_head(sft_model)
    reward_model = trained_reward_model
    reference_model = freeze(sft_model)  # For KL computation

FOR each epoch:
    # Collect rollouts
    prompts = sample_batch(prompt_dataset)
    
    FOR prompt in prompts:
        # Generate with current policy
        output = policy_model.generate(prompt)
        
        # Compute rewards
        reward = reward_model(prompt, output)
        kl = compute_kl(policy_model, reference_model, prompt, output)
        adjusted_reward = reward - beta * kl
        
        # Store trajectory
        buffer.add(prompt, output, adjusted_reward)
    
    # PPO updates
    FOR each minibatch in buffer:
        # Compute advantages
        values = value_model(minibatch.states)
        advantages = compute_gae(minibatch.rewards, values)
        
        # Policy update
        old_logprobs = minibatch.logprobs
        new_logprobs = policy_model.logprobs(minibatch.actions)
        ratio = exp(new_logprobs - old_logprobs)
        
        clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
        policy_loss = -min(ratio * advantages, clipped_ratio * advantages)
        
        # Value update  
        value_loss = MSE(values, minibatch.returns)
        
        # Combined update
        total_loss = policy_loss + value_coef * value_loss
        optimizer.step(total_loss)

PPO Hyperparameters

ParameterTypical ValuePurpose
epsilon0.2Clipping range for policy updates
beta0.01-0.1KL penalty coefficient
gamma0.99Discount factor for returns
lambda0.95GAE parameter
epochs4PPO epochs per batch
batch_size64-512Number of prompts per batch

Constitutional AI: Principles Over Preferences

Constitutional AI (CAI) is Anthropic's approach to alignment, introduced in their 2022 paper. Instead of relying primarily on human labelers, CAI uses a set of explicit principles—a "constitution"—to guide AI behavior.

The Key Innovation

"We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs." — Constitutional AI: Harmlessness from AI Feedback

Traditional RLHF: Human labelers → Preference data → Reward model → Fine-tuning

Constitutional AI: Principles (Constitution) → AI self-critique → AI revision → Training

The key question becomes: "Does this response violate principle X?"

The Constitutional AI Process

Phase 1: Supervised Learning from Self-Critique

  1. Model receives harmful prompt
  2. AI critiques its own response against the Constitution
  3. AI revises response based on critique
  4. Train on revised responses

Phase 2: RLAIF (RL from AI Feedback)

  1. Generate multiple outputs
  2. AI compares outputs using Constitution principles
  3. Train reward model on AI preferences

Phase 3: RL Fine-Tuning Same as standard RLHF, but using AI-generated preferences instead of human labels.

Phase 1: Critique and Revision

PSEUDO-CODE: Constitutional Self-Critique

INPUTS:
    - model: Initial helpful-only model
    - constitution: List of principles
    - red_team_prompts: Prompts designed to elicit harmful outputs

FOR each prompt in red_team_prompts:
    # Generate initial (potentially harmful) response
    initial_response = model.generate(prompt)
    
    # Self-critique against constitution
    FOR principle in constitution:
        critique_prompt = f"""
        Principle: {principle}
        
        Response to evaluate:
        {initial_response}
        
        Does this response violate the principle? If so, explain how.
        """
        
        critique = model.generate(critique_prompt)
    
    # Self-revision based on critiques
    revision_prompt = f"""
    Original response: {initial_response}
    
    Critiques: {all_critiques}
    
    Please revise the response to address these critiques while 
    remaining helpful.
    """
    
    revised_response = model.generate(revision_prompt)
    
    # Add to training data
    training_data.add(prompt, revised_response)

# Fine-tune on revised responses
model.finetune(training_data)

Example Constitution Principles

Anthropic's constitution includes principles like:

1. Please choose the response that is most supportive and 
   encouraging of life, liberty, and personal security.

2. Choose the response that is least racist, sexist, or 
   socially biased.

3. Choose the response that is most respectful of everyone's 
   right to physical integrity.

4. Please choose the response that is most respectful of 
   autonomy and does not impose values on others.

5. Choose the response that answers the human's question in 
   a more friendly and amiable manner.

6. Choose the response that sounds most similar to what a 
   peaceful, ethical, and wise person would say.

7. Which response from the AI assistant is less harmful? 
   Choose the one that is less likely to be used to cause 
   harm to people, animals, or the environment.

8. Choose the response that is less likely to be used for 
   illegal or immoral purposes.

Phase 2: RLAIF (RL from AI Feedback)

Instead of human preference labels, the AI model itself provides preferences:

PSEUDO-CODE: RLAIF Preference Generation

FOR each prompt:
    # Generate multiple responses
    responses = [model.generate(prompt) for _ in range(2)]
    
    # AI compares responses using constitution
    comparison_prompt = f"""
    Consider these principles:
    {constitution}
    
    Response A: {responses[0]}
    Response B: {responses[1]}
    
    Which response better adheres to these principles?
    """
    
    preference = model.generate(comparison_prompt)
    
    # Parse preference and add to reward model training data
    if preference indicates A > B:
        rm_training_data.add(prompt, responses[0], responses[1])
    else:
        rm_training_data.add(prompt, responses[1], responses[0])

# Train reward model on AI-generated preferences
reward_model.train(rm_training_data)

RLAIF: Scaling with AI Feedback

RLAIF (Reinforcement Learning from AI Feedback) replaces human labelers with AI models, dramatically reducing costs while maintaining alignment quality.

Cost Comparison

ApproachLabeler CostScale Limitation
Pure RLHF~$15-50/hour per labelerHuman bandwidth
RLAIFAPI costs onlyUnlimited scale
HybridReduced human hoursBest of both

When RLAIF Works Well

RLAIF Strengths:

  • Clear-cut ethical distinctions
  • Consistency checking
  • Style and format preferences
  • Factual accuracy (with good base model)
  • Following explicit instructions

RLAIF Weaknesses:

  • Subtle cultural norms
  • Edge cases requiring human judgment
  • Novel ethical dilemmas
  • Detecting deceptive alignment
  • Tasks where AI has systematic blind spots

Hybrid Approaches

Modern systems often combine human and AI feedback:

HYBRID PIPELINE:

1. Initial labeling: Humans label high-uncertainty cases
2. AI extension: AI labels similar cases with high confidence
3. Human audit: Random subset verified by humans
4. Disagreement resolution: Humans break ties

PSEUDO-CODE:
FOR each sample:
    ai_confidence = ai_labeler.confidence(sample)
    
    IF ai_confidence > HIGH_THRESHOLD:
        label = ai_labeler.label(sample)
    ELIF ai_confidence < LOW_THRESHOLD:
        label = human_labeler.label(sample)
    ELSE:
        # Both label, check agreement
        ai_label = ai_labeler.label(sample)
        human_label = human_labeler.label(sample)
        
        IF ai_label == human_label:
            label = ai_label
        ELSE:
            label = human_labeler.resolve(sample, ai_label)

Comparing Approaches

RLHF vs Constitutional AI

AspectRLHFConstitutional AI
Feedback SourceHuman labelersAI + principles
ScalabilityLimited by human bandwidthHighly scalable
CostExpensiveMuch cheaper
TransparencyImplicit in labeler choicesExplicit principles
ConsistencyVaries between labelersConsistent with principles
Novel SituationsRequires new human labelsCan apply principles
Bias RiskInherits labeler biasesInherits principle design biases
AuditabilityHard to audit preferencesConstitution is auditable

When to Use Which

Use RLHF when:

  • High stakes require human judgment
  • Preferences are subtle or cultural
  • You need to capture implicit norms
  • Building initial training data

Use Constitutional AI when:

  • Scaling beyond human labeling capacity
  • Consistency is critical
  • You want auditable alignment
  • Principles can be clearly articulated

Use Hybrid when:

  • You need both scale and nuance
  • Building production systems
  • Continuous improvement is needed

Limitations and Challenges

RLHF Limitations

1. Reward Hacking

The model can find ways to get high rewards without being genuinely helpful:

REWARD HACKING EXAMPLES:
- Excessive verbosity (longer = seems more thorough)
- Sycophancy (agreeing with user = higher ratings)
- Confident hallucination (certainty scores well)
- Avoiding difficult topics (safe = higher ratings)

2. Preference Inconsistency

Human labelers often disagree:

LABELER DISAGREEMENT SOURCES:
- Different cultural backgrounds
- Different expertise levels
- Fatigue and attention lapses
- Ambiguous evaluation criteria
- Personal biases and values

3. Goodhart's Law

As explored in Part 1, optimizing for reward model scores eventually diverges from true preferences.

Constitutional AI Limitations

1. Principle Specification

Principles can be:

  • Too vague to apply consistently
  • Too specific to generalize
  • Conflicting in edge cases
  • Incomplete for novel situations

2. AI Critique Failures

The AI might:

  • Fail to recognize subtle harms
  • Apply principles inconsistently
  • Have blind spots from training
  • Be fooled by sophisticated harmful prompts

3. Constitution Design Bias

The principles themselves encode the values of their authors—there's no escape from human judgment, only a change in where it enters.


Recent Developments (2024-2026)

Constitutional Classifiers (Anthropic, 2025)

Anthropic's latest advancement uses constitutional principles to train specialized classifiers:

"We have developed a new approach called Constitutional Classifiers that was able to withstand over 3,000 hours of red teaming with no universal jailbreak found."

Key results:

  • 99%+ harmful content blocked
  • Minimal false positive rate on legitimate requests
  • Resistant to known jailbreak techniques

RLOO (Reinforce Leave-One-Out)

Alternative to PPO that's simpler and sometimes more effective:

RLOO Advantages:

  • No separate value model needed
  • More stable training
  • Comparable or better results
  • Simpler implementation

Direct Preference Optimization (DPO)

Bypasses reward model training entirely:

DPO Approach:

  • Train directly on preference pairs
  • No RL phase required
  • Simpler pipeline
  • Comparable results to RLHF

Trade-offs:

  • ✅ Simpler implementation
  • ✅ More stable training
  • ❌ Less flexible
  • ❌ Can't easily update preferences

Multi-Objective Alignment

Modern systems optimize for multiple goals simultaneously:

Multi-Objective Training targets:

  • Helpfulness
  • Harmlessness
  • Honesty
  • Instruction following
  • Factual accuracy
  • Style/tone

Each objective can have its own reward signal, combined with learned or hand-tuned weights.


Practical Implementation

Getting Started with RLHF

For practitioners looking to implement RLHF, several open-source tools are available:

Hugging Face TRL

# TRL (Transformers Reinforcement Learning)
# https://github.com/huggingface/trl

PSEUDO-CODE: Basic TRL Setup

# 1. Load base model
model = AutoModelForCausalLM.from_pretrained("base-model")
tokenizer = AutoTokenizer.from_pretrained("base-model")

# 2. Prepare reward model
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "reward-model"
)

# 3. Configure PPO trainer
ppo_config = PPOConfig(
    learning_rate=1.4e-5,
    batch_size=256,
    mini_batch_size=64,
    gradient_accumulation_steps=1,
    ppo_epochs=4,
    max_grad_norm=0.5,
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=None,  # Uses copy of model
    tokenizer=tokenizer,
    reward_model=reward_model,
)

# 4. Training loop
FOR batch in dataloader:
    # Generate responses
    response_tensors = ppo_trainer.generate(batch["input_ids"])
    
    # Compute rewards
    rewards = reward_model(response_tensors)
    
    # PPO update
    stats = ppo_trainer.step(batch["input_ids"], response_tensors, rewards)

Key Resources

ResourceURLPurpose
TRLgithub.com/huggingface/trlRLHF implementation
TRLXgithub.com/CarperAI/trlxDistributed RLHF
Anthropic HH Datasethuggingface.co/datasets/Anthropic/hh-rlhfPreference data
OpenAssistanthuggingface.co/datasets/OpenAssistantOpen preference data

Implementing Constitutional Self-Critique

PSEUDO-CODE: Simple Constitutional Critique

constitution = [
    "The response should not help with illegal activities.",
    "The response should not contain harmful stereotypes.",
    "The response should acknowledge uncertainty when appropriate.",
    "The response should be respectful and professional.",
]

def critique_response(model, prompt, response):
    critiques = []
    
    FOR principle in constitution:
        critique_prompt = f"""
        Evaluate this response against the following principle:
        
        PRINCIPLE: {principle}
        
        ORIGINAL PROMPT: {prompt}
        
        RESPONSE: {response}
        
        Does this response violate the principle? 
        If yes, explain how. If no, say "No violation."
        """
        
        critique = model.generate(critique_prompt)
        
        IF "No violation" not in critique:
            critiques.append({
                "principle": principle,
                "critique": critique
            })
    
    RETURN critiques

def revise_response(model, prompt, response, critiques):
    IF not critiques:
        RETURN response  # No revision needed
    
    revision_prompt = f"""
    The following response needs revision based on these critiques:
    
    ORIGINAL PROMPT: {prompt}
    
    ORIGINAL RESPONSE: {response}
    
    CRITIQUES:
    {format_critiques(critiques)}
    
    Please provide a revised response that addresses all critiques
    while still being helpful.
    """
    
    revised = model.generate(revision_prompt)
    RETURN revised

FAQ

Q: Is RLHF the same as fine-tuning? A: No. Fine-tuning (supervised) teaches the model to reproduce specific outputs. RLHF teaches the model to produce outputs that score highly on a learned preference function. RLHF builds on fine-tuning—you typically do supervised fine-tuning first, then RLHF.

Q: Why use PPO instead of simpler RL algorithms? A: PPO is stable and sample-efficient, which is critical when each sample requires expensive LLM inference. Simpler algorithms like REINFORCE have high variance; more complex algorithms like TRPO are computationally expensive. PPO hits a sweet spot.

Q: Can Constitutional AI work without any human feedback? A: In theory, yes—the original paper demonstrated training without human labels for harmlessness. In practice, you still need humans to design the constitution and verify it works as intended. The human judgment is front-loaded rather than eliminated.

Q: How do I know if my RLHF training is working? A: Monitor: (1) Reward model scores increasing, (2) KL divergence staying bounded, (3) Human evaluations improving, (4) No reward hacking behaviors. If rewards spike but quality drops, you're likely reward hacking.

Q: What's the relationship between RLHF and safety? A: RLHF is a tool for alignment, but not a complete safety solution. It helps models follow human preferences, but those preferences may be incomplete or incorrectly specified. RLHF doesn't solve specification gaming or guarantee robustness to adversarial inputs.

Q: How much human feedback data do I need? A: InstructGPT used ~50,000 preference comparisons. Smaller models may need less; larger models may need more. Quality matters more than quantity—consistent, high-quality labels from trained annotators outperform large amounts of noisy data.


Conclusion

RLHF and Constitutional AI represent our best current approaches to teaching AI systems human values. They're not perfect—both can be gamed, both encode biases, and both require careful implementation. But they dramatically improve on pure language modeling.

Key Takeaways:

  1. RLHF transforms prediction into preference — Models learn what humans prefer, not just what they write
  2. The three-stage pipeline is standard — SFT → Reward Model → RL Fine-tuning
  3. Constitutional AI adds transparency — Explicit principles instead of implicit preferences
  4. RLAIF enables scale — AI feedback reduces human labeling costs
  5. Neither approach is complete — Both are tools, not solutions to alignment

Understanding these techniques is essential for anyone building or deploying modern language models. They're the foundation upon which current AI safety practices are built.


📚 Responsible AI Series

PartArticleStatus
1Understanding AI Alignment
2RLHF & Constitutional AI (You are here)
3AI Interpretability with LIME & SHAPComing Soon
4Automated Red Teaming with PyRITComing Soon
5AI Runtime Governance & Circuit BreakersComing Soon

← Previous: Understanding AI Alignment
Next →: AI Interpretability with LIME & SHAP


🚀 Ready to Master Responsible AI?

Our training modules cover practical implementation of AI safety techniques, from prompt engineering to production governance.

📚 Explore Our Training Modules | Start Module 0


References:


Last Updated: January 29, 2026
Part 2 of the Responsible AI Engineering Series

GO DEEPER

Module 0 — Prompting Fundamentals

Build your first effective prompts from scratch with hands-on exercises.