Glossary
Technical terms defined without the bullshit.
- GRPO
- Group Relative Policy Optimization. An RL algorithm that skips the value model entirely — instead of estimating absolute rewards, it generates a group of outputs and ranks them relative to each other. The best in the group gets reinforced, the worst gets penalized. Simpler than PPO, cheaper to train, and surprisingly effective.
- KL Divergence
- Kullback-Leibler Divergence. A measure of how much one probability distribution differs from another. In RL fine-tuning, it's used as a penalty to stop the model from drifting too far from its original behavior. Too little KL penalty = model collapses into degenerate outputs. Too much = model barely learns anything.
- MoE
- Mixture of Experts. An architecture where the model has many "expert" sub-networks but only activates a few for each input. This lets you have a massive parameter count (making the model capable) while only using a fraction of the compute per token (making it cheap). GLM 4.5 Air uses MoE — big model, small per-token cost.
- Nugget-Based Evaluation
- An evaluation method for information retrieval where key facts ("nuggets") are identified in a reference answer, then an LLM judge checks how many of those nuggets appear in the model's output. More granular than binary correct/incorrect — captures partial credit and information completeness.
- Off-Policy RL
- Reinforcement learning where the agent learns from data generated by a different policy (or an older version of itself). Contrast with on-policy RL, where the agent can only learn from its own most recent outputs. Off-policy is more sample-efficient but can be less stable. KARL uses importance sampling to do off-policy training with GRPO.
- Pareto Optimal
- A state where you can't improve one objective without making another worse. In KARL's context: the best possible trade-off between search quality and computational cost. Being "Pareto optimal" means there's no free lunch left — you're on the efficiency frontier.
- Test-Time Compute
- Compute spent during inference (when the model is actually answering questions), as opposed to training compute. Agentic search is a form of test-time compute scaling — the model does more work at inference time (multiple searches, reasoning steps) to get better answers. The key insight: sometimes it's cheaper to scale test-time compute than to train a bigger model.
- Value Model
- In traditional RL (like PPO), a separate neural network that estimates the expected future reward from a given state. Training a value model doubles your parameter count and adds complexity. GRPO eliminates this by using group-relative comparisons instead of absolute value estimates.