Transformer Debugging

Question

Given a PyTorch implementation of a GPT-like causal transformer language model (nanoGPT-style), including a model class and a training loop, identify and fix all deliberately introduced bugs (typically 4, occasionally 5). The model is trained to overfit on a single sentence using a simple gradient descent loop. Success is verified by running the code: training loss must decrease and the model must produce correct output. Bug locations are sometimes hinted at in code comments. The four canonical bugs are: (1) incorrect initialization of learnable positional embeddings; (2) causal attention mask not set to -inf before the softmax (masking applied incorrectly); (3) wrong dimensions in the output projection nn.Linear layer; (4) a training-loop error — either missing loss.backward() call or incorrect loss label shifting for next-token prediction (off-by-one for causal LM). A fifth bug, when present, is typically a subtle typo. The candidate must run and verify tests/output after fixes.

AceOffer · Accepted Answer

Given a PyTorch implementation of a GPT-like causal transformer language model (nanoGPT-style), including a model class and a training loop, identify and fix all deliberately introduced bugs (typically 4, occasionally 5). The model is trained to overfit on a single sentence using a simple gradient descent loop. Success is verified by running the code: training loss must decrease and the model must produce correct output. Bug locations are sometimes hinted at in code comments. The four canonical bugs are: (1) incorrect initialization of learnable positional embeddings; (2) causal attention mask not set to -inf before the softmax (masking applied incorrectly); (3) wrong dimensions in the output projection nn.Linear layer; (4) a training-loop error — either missing loss.backward() call or incorrect loss label shifting for next-token prediction (off-by-one for causal LM). A fifth bug, when present, is typically a subtle typo. The candidate must run and verify tests/output after fixes.

Reported follow-ups:
1. Implement KV cache for autoregressive decoding. A KV cache class is provided (with methods to return length and stored (K, V) tensors). Modify the attention computation to append/update the cache, apply causal mask only when cache is absent or empty, adjust positional encoding offset based on cache length, and implement a custom generate() function. Verify output matches the non-cached version. (when: After all bugs are fixed with time remaining (most common follow-up))
2. Convert the language model to a sequence classifier: modify the final layer so the transformer outputs a class prediction instead of next-token logits. The interviewer prefers taking the mean of token representations before computing logits. A test harness is provided; just make the tests pass. (when: After bugs are fixed — alternative follow-up seen in multiple onsite reports)
3. What is the time complexity of autoregressive generation without KV cache, and how does KV cache improve it? (when: Conceptual discussion before or after KV cache implementation)

**Common mistakes:** Conflating causal LM (next-token autoregressive) with masked LM (BERT-style) — multiple candidates with CV/BERT backgrounds missed the loss-shifting bug entirely because they were mentally framing this as MLM, not next-token prediction (tid 1122034 verbatim: '满脑子都是BERT之类的MLM'); Trying to replace learnable positional embeddings with sinusoidal/rotary when the intended bug is just the initialization scheme — the question gives you a learnable PE and expects you to fix `nn.Parameter` initialization (e.g., `nn.init.normal_` with sensible std), not swap to sin/cos (tid 1122034); Applying the causal mask multiplicatively (zeroing the logits at masked positions before the softmax) instead of additively with `-inf` (or using `masked_fill(mask, float('-inf'))` with a boolean mask) — zeroed-out logits still produce nonzero softmax weights because `exp(0)=1`, so the model still attends to 'masked' positions. The fix is `-inf` substitution pre-softmax; the underlying dtype of the mask (bool vs float) is irrelevant when applied correctly.; Wiring `loss.backward()` correctly but forgetting to call `optimizer.step()` — or vice versa — causing loss to look noisy or flat; Misaligning the next-token prediction targets: predict from `logits[:, :-1, :]` against targets `labels[:, 1:]` (or equivalently feed `inputs[:, :-1]` and label with `inputs[:, 1:]`). Both sides must be aligned together — shifting only one side or applying the shift on the wrong axis silently changes the objective. The cleanest verification is sampling: a correctly-shifted model can reproduce the target sentence; a mis-shifted one may reduce loss without overfitting.; Running training after every single fix on CoderPad — execution is slow on the OpenAI environment and burns the entire 60-min budget if you re-run >3-4 times; batch fixes and run once at the end (tid 1145072: 'coderpad跑代码慢的令人发指，尽可能减少run吧'); Spending too long auditing the driver/training loop when the bug-locations are explicitly marked in comments — the comments are reliable; trust them (tid 1145072: candidate wasted time re-reading driver code that 'actually wasn't useful'); On the KV cache follow-up: applying the same triangular causal mask after the cache is populated as if the full prefix were being re-scored — when only one new token is being scored per step, there are no future positions to mask. For single-token decode the safe path is to skip the mask; for chunked decode (>1 new token at a time over a non-empty cache), build an offset-aware mask sized to the new chunk so the new tokens cannot attend to each other's future positions. Either failure mode breaks the parity check against the no-cache reference.

**Interviewer hints:** When bug locations are marked in code comments — a recurring variant per tid 1145072 ('comment标明哪一段代码里有bug') and tid 1160043 ('大致范围在注释里都给你标出来了') — the interviewer typically signals 'you only need to look at the marked regions.' Not all variants include this markup; ask if the code is unmarked.; On 5-bug variant: the 5th bug is described as 'a subtle typo' — interviewer typically points to the general region without naming the exact line; Per tid 1134115 the interviewer knows where every bug is but explicitly noted that 'because each bug has multiple valid fixes, if you fix it wrong I may not be able to guide you back' — frame as 'reported interviewer style' rather than a universal rule.; Time-complexity question for KV cache often delivered before implementation: 'Before you code it, walk me through why KV cache helps' — this is a soft gate; candidates who can derive O(T²d) → O(Td) get more time to implement; On the classifier follow-up: 'Take the mean over the token dimension' — interviewer stated preference, not just one valid option

**What passers do:** When the interviewer has marked bug regions in code comments — a common variant per several reports (tid 1145072, tid 1160043) — passers trust the markup and don't re-audit unmarked code under time pressure. Note: the markup isn't always present; ask if it isn't obvious.; One passing candidate (tid 1165799) prepped by hand-writing a nanoGPT-style transformer end-to-end (model class + training loop) and watching the Karpathy 'Let's build GPT' YouTube series — anecdotal but the prep recipe matches multiple other reports' 'familiarity with nanoGPT was sufficient' framing.; Same passing candidate (tid 1165799) generated practice variants by asking an LLM to inject bugs into a clean nanoGPT reference, then debugged them against a timer — anecdotal but illustrates the level of repetition that seems to correlate with passing this round.; Explained each fix in one sentence as you apply it — interviewers know the bugs but specifically reward candidates who can articulate WHY: 'I'm initializing PE with normal(std=0.02) because uninitialized nn.Parameter values cause attention to attend to noise'; 'I'm pre-softmax masking with -inf because boolean masking still leaks gradient via the softmax denominator'; Verified the fix by overfitting on the single target sentence: ran the training loop once at the end and confirmed both loss decrease AND that the model's sampled output reproduces the target sentence — these are the two criteria the interviewer explicitly uses.; On the KV cache follow-up: passers saved the cache length BEFORE appending the new K/V (so they can use it as the positional-encoding offset for the current step), then appended to the cache, then ran attention with the cache present. Ordering matters because the offset is `cache_length_at_step_start`, not `cache_length_after_append`.; On the classifier follow-up: defaulted to mean-pooling token representations before the linear head, matching the interviewer's explicit preference

**Why people fail:** Wrote all four fixes correctly but ran out of time before running training to verify — the explicit success criterion is loss decrease + correct generated output, and at least one report (tid 1134650) attributes the loss to exactly this: '挂在最后虽然都写完了，但是没时间test了'.; Background mismatch reported in tid 1122034: candidate had CV experience but no autoregressive-LM background, missed the loss-shifting bug because the next-token-prediction framing was unfamiliar. Single report — anecdotally informative on the prep gap that bites.; Got 3/4 architecture-related bugs but lost the round on the loss bug because it required understanding causal LM mechanics, not just transformer architecture (tid 1122034); Wasted ~20 min on bug #1 (positional embedding initialization) trying to switch to sin/cos when the intended fix was a 1-line `nn.init.normal_(self.pos_emb, std=0.02)` — running out of clock for the harder bugs (tid 1122034); On the KV cache follow-up: didn't get to the parity check with the no-cache reference because the cache implementation was buggy in subtle ways (forgot offset adjustment, applied mask when cache present) — the interviewer wants the parity check explicitly

**Edge cases probed:** KV cache + positional encoding offset: positional encoding for the new token must use `cache_length + 0` (or `cache_length`), not position 0 — getting this wrong silently breaks generation parity; KV cache + causal mask: for single-token decode with a non-empty cache, there are no future positions to mask, so the mask is unnecessary; for chunked decode (>1 new token at a time over a non-empty cache), an offset-aware mask sized to the new chunk is still required to prevent new tokens from attending to each other's future positions.; 5-bug variant with a subtle typo (e.g., wrong variable name, transposed argument): no canonical recipe, only careful reading catches it; Multi-head variant with bugs in head dimension reshape (B,T,n_head,head_dim swap order); Classifier conversion test harness: tests check specific tensor shape (batch, num_classes), not just convergence — candidates who change the loss but forget to change the output shape fail the harness; Time-complexity discussion for both training and inference. The interviewer typically asks about inference (autoregressive generation): without KV cache, scoring each step at context length T is O(T²·d) per step (O(T³·d) for generating T tokens); with cache, the per-step cost drops to O(T·d) (O(T²·d) total). For training/teacher-forced decoding, attention is O(T²·d) per example because all positions are computed in parallel.

**Alternative approaches:** Test-driven incremental fixing (Run training after each fix to isolate impact; slower overall but confirms individual bugs. Risky under time pressure given slow CoderPad execution.); Diff against known-good reference (Mentally compare against a memorized nanoGPT reference implementation; fast if architecture is well-known, but can miss novel or subtle bugs not in canonical reference.)

Transformer Debugging

Problem Overview

Follow-up Prompts

OpenAI Focus

More OpenAI Questions

Every question in the OpenAI catalog gets this depth