AO
AceOffer
·
Back
Anthropic · MLE / Research

Debug GRPO / RL Training Code

Debugging↑ Trending · 3× last 3mo4× totalLast reported March 2026
Roles: Research Eng · MLE
Teams: Research · Frontier
Interviewer style: presentation-based, debugging-focused, collaborative, project presentation (15-20 min monologue)
By AceOffer · First reported October 2025 · 4× across 190+ candidate reports

The Round

Time budget
60 min, live coding
Environment
Paired notebook (Colab or shared editor)
What they show you
A complete-but-buggy GRPO training script (~100–200 lines): rollout sampling, group-relative advantage normalization, PPO-style ratio + clip loss, and a small RL environment loop. The script runs end-to-end without changes — that's the trap. Two bugs are numerical (you find them by running the script), the third is algorithmic (you find it by reading the ratio formula). Reports indicate the rough shape is: a `Policy` module → a `rollout(env, n_steps)` function that calls `multinomial` → a `compute_advantage(returns)` helper → a PPO/GRPO clip-objective loss → a training loop with mini-epochs over each batch. (The component breakdown is inferred from the canonical bug list — not verbatim from a single report.)
What you can use
PyTorchNumPyany standard library
Reported Outcomes (4 candidates)
1 fail
3 unknown

What You Walk Out With

  • All three bugs identified and fixed
  • Code that runs without NaN errors and produces a non-trivial training signal
  • Verbal explanation of each bug's root cause + fix
  • Answers to RL-theory follow-ups about ratio clipping and on-policy/off-policy drift
  • A diagnostic — typically `print(ratio.mean())` — that surfaces the on-policy follow-up cleanly

Problem Statement

Given a simplified but complete GRPO (Group Relative Policy Optimization) training script, identify and fix all bugs in the implementation. Reported bugs span at least three issues: (1) raw logits are passed directly to multinomial sampling without first applying softmax, causing NaN errors or incorrect sampling; (2) the normalized-advantage calculation divides by the standard deviation without adding a numerical-stability epsilon, risking division-by-zero NaNs; (3) a third GRPO-specific bug discoverable by anyone familiar with the algorithm (exact nature not disclosed in reports, but likely relates to incorrect ratio computation, e.g., computing ratio as model_logprob − old_logprob instead of exp(model_logprob − old_logprob)). After the bugs are fixed, the interviewer asks a series of follow-up questions about training dynamics and RL theory.

Prerequisites

Brush up on these before sitting the round.
Math
softmaxlog-likelihoodpolicy-gradient theorem (sketch)importance-sampling ratioKL divergence (for the clip-vs-KL follow-up if it lands)
Libraries
PyTorch (multinomial, log_softmax, gather)torch.distributions.Categorical (cleaner than raw multinomial)
Concepts
PPO clip objectiveGRPO group-relative advantageon-policy vs. off-policyrollout vs. update batchmini-epoch (PPO-style multi-step update on one batch)numerical stability (epsilon, log-sum-exp)trust region (clip as a soft trust-region constraint)

Canonical Solution

Synthesized from candidate reports — the approach interviewers expect.
Read the code top-to-bottom and trace the data flow: (1) locate the multinomial call and prepend a softmax over logits; (2) find the advantage normalization and change `std` to `std + eps`; (3) verify the importance-sampling ratio is computed as `exp(new_logprob − old_logprob)` (not as a raw log-difference). Then reason about when the ratio can drift from 1 even in nominally on-policy GRPO: if multiple gradient steps are taken on the same rollout batch (PPO-style mini-epochs), the current policy diverges from the sampling policy within the batch, making the ratio ≠ 1.

Process Playbook

The process the interviewer rewards — distilled from candidate reports.
01
Skim the whole file once before fixing anything
Why: GRPO has multiple inter-dependent components (rollout, advantage, ratio, loss). Fixing in isolation can mask other bugs.
02
Run the script on a tiny env and watch what surfaces
Why: The missing-epsilon bug surfaces as NaN advantages whenever a group of rollouts has equal returns (common early in training and on sparse-reward envs — timing depends on the env). The missing-softmax bug raises a RuntimeError at sampling the moment any logit is negative; if a row happens to be all non-negative + finite + non-zero-sum, multinomial silently samples proportional to the raw logits — wrong but not crashing. The third bug doesn't crash — it only shows when you print the ratio.
03
Fix in this order: numerical stability first, then logic
Why: Numerical bugs cascade into ratio drift. Fixing them first lets you cleanly diagnose the ratio formula.
04
Print intermediate values: logits.shape, advantages.std(), ratio.mean()
Why: Interviewers comment positively when candidates instrument. `ratio.mean()` is the diagnostic that surfaces the on-policy follow-up: on the first mini-epoch step over a rollout batch it should print ≈1.0, and on subsequent mini-epoch steps the *empirical* mean and the ratio *distribution* shift even though E_old[π_new/π_old] is still 1 in expectation. (Empirical KL and clipped-fraction are the more sensitive diagnostics; ratio.mean() is what the interviewer's prompt explicitly references.)
05
Verbalize the ratio formula: ratio = exp(new_logprob − old_logprob)
Why: The third bug is computing ratio as the raw log-difference. Saying the formula out loud prevents this slip and gives the interviewer the verbal signal that you understand the surrogate objective.
06
Stage the on-policy follow-up answer before you're asked
Why: After the third bug is fixed, the interviewer asks why `ratio.mean()` still isn't exactly 1 on a nominally on-policy step. The expected answer is the mini-epoch loop: the policy updates between gradient steps within the same rollout batch, so by step k > 1 the sampling policy ≠ the current policy. Having this answer staged turns a stretch follow-up into a planted close.

Bug Catalogue

The bugs Anthropic plants are drawn from a small recurring catalogue. Recognize them by signature, not by reading line-by-line.
#1Numerical stability
Signature: Sampling raises a RuntimeError (invalid multinomial distribution) the moment any logit is negative; if all logits happen to be non-negative it silently samples from the raw unnormalized logits instead of the softmax policy — wrong, but no crash. (It does NOT surface as NaN.)
Root cause: torch.multinomial expects non-negative finite weights, NOT arbitrary logits — and the weights need not sum to 1. Logits can be negative and are not probabilities, so passing them in is either an immediate error (a negative entry) or a silently-wrong distribution (all non-negative).
Fix: Sample from F.softmax(logits, dim=-1), or more cleanly torch.distributions.Categorical(logits=logits).sample().
# Before:
action = torch.multinomial(logits, 1)

# After:
probs = F.softmax(logits, dim=-1)
action = torch.multinomial(probs, 1)

# Cleaner:
action = torch.distributions.Categorical(logits=logits).sample()
#2Numerical stability
Signature: NaN advantage values when all rollouts in a group return similar rewards
Root cause: Group-relative advantage normalization divides by std without an epsilon. When all rewards in a group are equal (early training, sparse-reward envs), std=0 and division produces NaN. (Note: GRPO groups must be size > 1 — PyTorch's default unbiased std of a size-1 group is itself NaN, which epsilon won't save; use a population std / correction=0 or guarantee group size > 1.)
Fix: Add a small epsilon to the std for numerical stability. If you're using a size-1 group anywhere, also switch to `correction=0`.
# Before:
advantage = (returns - returns.mean()) / returns.std()

# After:
advantage = (returns - returns.mean()) / (returns.std(correction=0) + 1e-8)
#3Algorithm logic
Signature: Loss decreases but reward never improves; ratio.mean() prints as a value near zero (the log-difference instead of the exponentiated ratio) instead of ≈1.0; depending on the clamp implementation, clipping may trigger constantly (because the raw log-difference ≈ 0 is far below the 1−ε=0.8 lower bound) — not the absence-of-clipping symptom you might intuit.
Root cause: Importance-sampling ratio is computed as a raw log-difference (new_logprob − old_logprob) instead of exp(new_logprob − old_logprob). Without exponentiation the surrogate objective is no longer a proper importance-weighted expectation; the clip band [1−ε, 1+ε] is interpreted relative to a value that's centered near zero rather than near 1, so the objective is biased. Whether the gradient is zeroed for any particular sample depends on the advantage sign and which branch of the PPO `min()` is active — but in aggregate the update direction is wrong.
Fix: Always compute ratio as `exp` of the log-prob difference. Verify by printing `ratio.mean()` — on the first mini-epoch step it should print exactly 1.0 (since π_new == π_old at that point). On later mini-epoch steps the empirical mean may drift, but E_old[ratio] = 1 still holds in expectation; the more sensitive drift diagnostic is the empirical KL or clipped-fraction.
# Before:
ratio = new_logprob - old_logprob   # WRONG — this is the log-ratio, not the ratio

# After:
ratio = torch.exp(new_logprob - old_logprob)
# Sanity check: on the first mini-epoch step π_new == π_old, so ratio == 1 exactly:
# assert torch.allclose(ratio, torch.ones_like(ratio)) when step == 0

Follow-up Arc

Probes the interviewer escalates with — each one has a pre-extracted canonical answer. Click to reveal.
#1WarmupAfter bugs are fixed
If the ratio were computed as `model_logprob − old_logprob` (an additive log-difference rather than the exponentiated ratio), could the model still train? Why or why not?
#2CoreAfter ratio is discussed
Why do we clip the importance-sampling ratio in PPO/GRPO? What does clipping accomplish, and when is a ratio actually clipped?
#3CoreAfter clipping is explained
In the provided code the ratio should theoretically always be 1 (on-policy). But when you print it at runtime it is not 1. Debug why.
#4StretchAfter on-policy vs. off-policy discussion
What are the practical training-stability consequences when the ratio is frequently clipped versus rarely clipped?

Alternative Approaches

Other paths candidates have tried — and the trade-offs that came with them.
ApproachTrade-offs
Unit-test each component independentlyFaster isolation of NaN sources via shape/value assertions on logits, probabilities, advantages, and ratios, but requires writing additional test harness code during the interview.
Run and inspect printed values firstQuickly reveals symptoms (e.g., ratio printed ≠ 1) but may miss root-cause logic errors without careful code reading.

Observed Variants

The same question shows up in different shapes — base, with deep dives, with extra constraints. Be ready for any of these.
  • Base: debug NaN errors in GRPO code (softmax + epsilon fixes)
  • With three explicit bugs to find
  • With deep follow-up on ratio clipping theory
  • With follow-up: why ratio is not strictly 1 at runtime (not strictly on-policy)

Pitfalls

  • Forgetting to apply softmax before multinomial sampling — `torch.multinomial` raises a RuntimeError on any negative logit, and silently samples the WRONG distribution if logits are all non-negative (it does NOT NaN-explode). Easy to miss if your test env happens to produce non-negative logits early.
  • Computing advantage as `(returns - baseline) / std` without an epsilon — division by zero on early rollouts where all returns are similar (sparse-reward envs, early training). Note: in GRPO, the group must have size > 1 — PyTorch's default `unbiased=True` std of a size-1 group is itself NaN, which epsilon won't save; use `correction=0` or guarantee group size > 1.
  • Computing the importance-sampling ratio as a log-difference instead of `exp(log-difference)` — the model still 'trains' but the surrogate objective is wrong, the clip becomes meaningless, and updates are biased. The interviewer's first follow-up question is specifically about this: 'if ratio = model_logprob − old_logprob, could the model still train? Why not?'
  • Assuming the script is on-policy and never printing the ratio — misses the entire mini-epoch follow-up. The whole point of the third follow-up is that 'on-policy' GRPO with multiple gradient steps per rollout batch ISN'T strictly on-policy after step 1.
  • Spending too much time reading top-to-bottom before running — the first run reveals 2/3 bugs immediately. Read for ~5 minutes for context, then run.
  • Answering 'why do we clip?' with only 'to prevent large updates' — the great answer goes further: clipping is a trust-region HEURISTIC that removes the policy-improvement incentive when the per-action ratio leaves `[1−ε, 1+ε]`. It does NOT analytically bound KL(new || old) — that's why PPO has a separate adaptive-KL-penalty variant and why GRPO/DeepSeek-Math add an explicit KL term to a reference policy on top of the clip. Whether the gradient drops to zero depends on the advantage sign and which branch of the `min()` is active per sample, not just on whether the ratio is in the clip band — so 'clipping zeros the gradient' is too strong a generalization.

What Each Level Looks Like

Aim for the Senior tier; be ready to push toward Staff+ when probed.
IC4 / Mid-level
Finds the two numerical bugs (softmax, epsilon) within 30 minutes. Can explain that softmax converts logits to probabilities and that division-by-zero produces NaN, but doesn't yet connect to the surrogate objective. Misses the ratio bug or finds it only after a strong hint. Answers the clipping follow-up correctly at the textbook level (ε=0.2, bounds policy change) but doesn't ground it in the trust-region intuition.
IC5 / SeniorTarget
Finds all three bugs, including the ratio formula, within the time budget. Articulates the importance-sampling ratio formula correctly the first time. Answers ratio-clipping and on-policy/off-policy follow-ups with the right intuition — knows that the ratio is clipped when `new_logprob − old_logprob` falls outside `[log(1−ε), log(1+ε)]` (asymmetric bounds in log-space), and that clipping removes the incentive to update outside that band rather than imposing a hard constraint. Reasons correctly about training-stability consequences. Gets the on-policy-drift follow-up via the mini-epoch path. Instruments the code with `print(ratio.mean())` proactively while noting that under the old policy E[ratio] = 1 holds in expectation regardless of mini-epoch updates — what actually shifts is the empirical KL and clipped-fraction.
IC6 / Staff+
All of IC5 plus: derives why the policy drifts during mini-epochs without being asked, discusses the bias-variance trade-off introduced by clipping, and is precise that clipping is a trust-region HEURISTIC — it does not analytically bound KL(new || old). Notes that PPO ships a separate adaptive-KL-penalty variant precisely because clipping alone does not constrain KL, and that GRPO/DeepSeek-Math add an explicit KL term to a reference policy on top of the clip. Connects to advanced topics: why GRPO uses group-relative baselines instead of a learned value function (eliminates the critic, reduces variance), the bias-variance trade-off vs. GAE. Suggests concrete diagnostics for production training runs: monitor clipped-fraction per batch, watch empirical KL(new || old) and KL(new || ref), alert on either above a threshold.

Insider Notes

This question is one of the most reliable RL-fluency signals Anthropic has. It has shown up in the technical deep-dive round consistently since October 2025, with 3 of the 4 reports landing in the last three months (March 2026) — meaning if you're interviewing for an MLE or Research Engineer role on a frontier or research team right now, expect a high probability of seeing it. The structure of the bug list is intentional: the first two bugs are 'do you know PyTorch,' the third bug is 'do you understand GRPO,' and the on-policy mini-epoch follow-up is 'do you understand training systems.' One candidate's verbatim report makes the rhythm explicit: '总共有三个 bug,都很直接,熟悉 GRPO 的一眼就能找到' (there are three bugs in total, all straightforward — anyone familiar with GRPO spots them at a glance) — followed by '唯一一个有点难度的是定位为什么不是 strictly on-policy' (the only hard part is figuring out why it's not strictly on-policy). Failing to find the third bug is recoverable IF you still nail the theory follow-ups, but failing to articulate the ratio formula correctly is not — that's the round-defining slip. Candidates consistently report that printing `ratio.mean()` during training is what unblocked the on-policy follow-up; practice this debugging instinct.
Anthropic · MLE / Research · Debugging · 4× reportedLast reported March 2026