Glossary · WTF

GRPO: Group Relative Policy Optimization. An RL algorithm that skips the value model entirely — instead of estimating absolute rewards, it generates a group of outputs and ranks them relative to each other. The best in the group gets reinforced, the worst gets penalized. Simpler than PPO, cheaper to train, and surprisingly effective.
KL Divergence: Kullback-Leibler Divergence. A measure of how much one probability distribution differs from another. In RL fine-tuning, it's used as a penalty to stop the model from drifting too far from its original behavior. Too little KL penalty = model collapses into degenerate outputs. Too much = model barely learns anything.
MoE: Mixture of Experts. An architecture where the model has many "expert" sub-networks but only activates a few for each input. This lets you have a massive parameter count (making the model capable) while only using a fraction of the compute per token (making it cheap). GLM 4.5 Air uses MoE — big model, small per-token cost.
Nugget-Based Evaluation: An evaluation method for information retrieval where key facts ("nuggets") are identified in a reference answer, then an LLM judge checks how many of those nuggets appear in the model's output. More granular than binary correct/incorrect — captures partial credit and information completeness.
Off-Policy RL: Reinforcement learning where the agent learns from data generated by a different policy (or an older version of itself). Contrast with on-policy RL, where the agent can only learn from its own most recent outputs. Off-policy is more sample-efficient but can be less stable. KARL uses importance sampling to do off-policy training with GRPO.
Pareto Optimal: A state where you can't improve one objective without making another worse. In KARL's context: the best possible trade-off between search quality and computational cost. Being "Pareto optimal" means there's no free lunch left — you're on the efficiency frontier.
Test-Time Compute: Compute spent during inference (when the model is actually answering questions), as opposed to training compute. Agentic search is a form of test-time compute scaling — the model does more work at inference time (multiple searches, reasoning steps) to get better answers. The key insight: sometimes it's cheaper to scale test-time compute than to train a bigger model.
Value Model: In traditional RL (like PPO), a separate neural network that estimates the expected future reward from a given state. Training a value model doubles your parameter count and adds complexity. GRPO eliminates this by using group-relative comparisons instead of absolute value estimates.