KARL: Knowledge Agents via Reinforcement Learning

The One-Sentence Version

Databricks took a cheap MoEMixture of Experts — an architecture with many expert sub-networks, only a few activated per input. Big model, small per-token cost. model (GLM 4.5 Air), trained it with GRPOGroup Relative Policy Optimization — generates a group of outputs, ranks them relative to each other, reinforces the best. No value model needed. to do multi-step document search, and it now outperforms Claude 4.6 and GPT 5.2 on enterprise knowledge retrieval benchmarks — while costing dramatically less per query.

📋 TL;DR

What: An RL-trained search agent that learns when to search, what to search for, and when to stop. Why it matters: Proves you don't need frontier models for knowledge retrieval — a well-trained smaller model with good search behavior beats them. The trick: GRPO + off-policy training + MoE base model = high capability at low cost.

What Problem Are They Solving?

RAG (Retrieval-Augmented Generation) is the industry standard for knowledge retrieval, but it's fundamentally limited. Traditional RAG does a single retrieval step — it takes the user's query, searches a corpus, and stuffs the results into the context window. This works for simple questions but falls apart when:

The answer spans multiple documents
The initial query doesn't surface the right information
The model needs to refine its understanding iteratively
Complex reasoning is required to connect retrieved facts

Agentic search fixes this by letting the model decide how to search. Instead of one-shot retrieval, the model can issue multiple queries, read results, reformulate, and search again — like a human researcher would.

The problem? Running GPT 5.2 or Claude 4.6 as your search agent is expensive. You're paying frontier-model prices for every search step, every reformulation, every "let me try a different query." For enterprise use cases with millions of daily queries, this doesn't scale economically.

The Architecture

KARL's approach is straightforward once you see it:

Start with a cheap, capable base model. GLM 4.5 Air is a MoEMixture of Experts — an architecture with many expert sub-networks, only a few activated per input. Big model, small per-token cost. model — massive parameter count (for capability) but low per-token cost (because only a fraction of parameters activate per token).
Define the action space. The agent can do three things: search(query), read(document), and answer(response). That's it. No complex tool-use frameworks, no multi-modal pipelines. Just search, read, answer.
Train with GRPO. GRPOGroup Relative Policy Optimization — generates a group of outputs, ranks them relative to each other, reinforces the best. No value model needed. generates multiple search trajectories for each question, evaluates them with a nugget-based evaluationEvaluation where key facts ('nuggets') are identified in a reference answer, then checked against the model's output. Captures partial credit. reward, and reinforces the better trajectories.
Use off-policy training for efficiency. Off-policy RLRL where the agent learns from data generated by a different or older policy. More sample-efficient but can be less stable. lets them reuse trajectories from older versions of the model, dramatically reducing the number of environment interactions needed.

The Reward Signal

This is where it gets clever. The reward function uses nugget-based evaluationEvaluation where key facts ('nuggets') are identified in a reference answer, then checked against the model's output. Captures partial credit. :

Take the ground-truth answer and extract key facts (nuggets)
Have an LLM judge check how many nuggets appear in the agent's response
Score = proportion of nuggets captured

This is more informative than binary correct/incorrect because the agent gets partial credit. Found 7 out of 10 key facts? That's a 0.7 reward, not a 0. This denser reward signal makes RL training much more stable.

The Off-Policy Trick

Standard GRPO is on-policy — you generate trajectories with your current model, compute rewards, update, repeat. This is expensive because every training step requires fresh rollouts.

KARL adds importance sampling to make GRPO work off-policy RLRL where the agent learns from data generated by a different or older policy. More sample-efficient but can be less stable. . They can reuse trajectories from previous iterations, with importance weights correcting for the distribution shift. In practice, they reuse trajectories for up to 3 iterations before the importance weights become too extreme.

This roughly 3x reduction in environment interactions matters enormously when each "environment interaction" means running multiple search queries against a document corpus.

The Results

The headline numbers [Databricks 2026] :

Model	FRAMES Acc	BROWNIE F1	Cost/Query
GPT 5.2 (agentic)	72.1%	0.68	$$$$
Claude 4.6 (agentic)	70.8%	0.71	$$$$
KARL (GLM 4.5 Air)	74.3%	0.73	$
GLM 4.5 Air (vanilla RAG)	58.2%	0.52	$

The important comparisons:

KARL vs vanilla RAG on the same base model: +16.1% accuracy, +0.21 F1. The RL training is doing real work — it's not just prompt engineering.
KARL vs frontier models: Competitive or better, at a fraction of the cost. The MoE architecture means KARL's per-token cost is ~5-8x lower than GPT 5.2.

💡 Insight

The most interesting result isn't that KARL beats frontier models — it's that the gap between "cheap model + good search behavior" and "expensive model + good search behavior" is so small. Search behavior matters more than raw model capability for this task class.

What Search Behavior Did It Learn?

The paper includes a nice analysis of emergent search behaviors:

Query decomposition: KARL learns to break complex questions into sub-queries without being explicitly trained to do so.
Adaptive depth: For simple questions, KARL often answers after 1-2 searches. For complex multi-hop questions, it'll do 5-8 search iterations. It learned when to stop.
Query refinement: When initial results are poor, KARL reformulates queries using information from partial results — a form of iterative hypothesis refinement.
Redundancy avoidance: KARL rarely re-searches for information it's already found. It maintains an implicit "working memory" of what it knows.

The KL DivergenceMeasure of how much one probability distribution differs from another. Used as a penalty to stop RL-trained models from drifting too far from their original behavior. Question

One interesting detail: they use a relatively low KL penalty (β = 0.02) compared to typical RLHF work. The paper argues this is because the search task has a more constrained action space than open-ended chat — there are fewer ways to go pathologically wrong when your actions are limited to search/read/answer.

They also show ablations where higher KL penalties (β = 0.1, 0.5) significantly hurt performance — the model doesn't deviate enough from the base policy to learn good search strategies. This suggests the base model's default search behavior is quite poor and needs substantial modification.

Test-Time ComputeCompute spent during inference, as opposed to training. Agentic search trades more inference compute for better answers. Scaling

KARL demonstrates clean test-time computeCompute spent during inference, as opposed to training. Agentic search trades more inference compute for better answers. scaling. Allowing more search steps monotonically improves performance up to about 8 steps, then plateaus. The paper frames this as a Pareto optimalA state where you can't improve one objective without making another worse. The best possible trade-off between competing goals. trade-off between cost and accuracy:

1-2 steps: Fast but inaccurate (~62% on FRAMES)
4-5 steps: Good sweet spot (~71%)
8+ steps: Diminishing returns (~74%)

This means you can tune the cost-accuracy trade-off at deployment time by limiting search depth. Need cheap and fast? Cap at 2 steps. Need maximum accuracy? Let it run.

What's Actually New Here?

Let's be honest about what's novel and what isn't:

Genuinely new:

Off-policy GRPO with importance sampling for search agents
Nugget-based reward for RL training (vs binary rewards)
Systematic analysis of learned search behaviors

Not new but well-executed:

Using RL to train search agents (see WebGPT [Nakano et al. 2022] , Search-Agent [Various 2025] )
MoE models for cost efficiency
GRPO itself (from DeepSeek [DeepSeek 2024] )

The real contribution is the engineering: showing that you can combine these pieces into a system that's both cheaper and better than throwing frontier models at the problem. That's a practical contribution, not a theoretical one.

❓ Question

Open question: How well does this transfer to domains with very different document structures? The benchmarks (FRAMES, BROWNIE) test on web-style documents. Enterprise corpora often have tables, forms, structured data — would KARL's learned search behaviors generalize?

The Bigger Picture

KARL is part of a broader trend: the "small model + learned behavior" approach outperforming "big model + prompted behavior" for specific task classes. We've seen this in coding (DeepSeek Coder), math (various), and now knowledge retrieval.

The implication for enterprise AI is significant. Instead of paying for frontier model API calls, you can:

Take a cheap MoE model
Train task-specific behaviors with RL
Deploy at a fraction of the cost
Get equal or better performance on your specific use case

The catch? You need RL infrastructure, evaluation pipelines, and ML engineering talent. This isn't something you do with a prompt and an API key. But for organizations already invested in ML infrastructure (like, say, Databricks' customers), the economics are compelling.

Value modelIn traditional RL (like PPO), a separate network estimating expected future reward. GRPO eliminates this by using group-relative comparisons instead. elimination via GRPO is particularly important here — it halves the parameter overhead of RL training and removes a major source of instability.

Bottom Line

KARL is a well-executed engineering paper that demonstrates a practical path to cheaper, better knowledge retrieval. The core insight — that search behavior matters more than raw model capability — has implications beyond this specific application.

Read the paper if you're interested in RL for agents, cost-efficient LLM deployment, or the future of enterprise search. Skip it if you're looking for fundamental algorithmic breakthroughs — this is applied ML at its best, not theoretical innovation.

Paper: KARL: Knowledge Agents via Reinforcement Learning (Databricks, March 2026, 77 pages)