# Before: action = torch.multinomial(logits, 1) # After: probs = F.softmax(logits, dim=-1) action = torch.multinomial(probs, 1) # Cleaner: action = torch.distributions.Categorical(logits=logits).sample()
# Before: advantage = (returns - returns.mean()) / returns.std() # After: advantage = (returns - returns.mean()) / (returns.std(correction=0) + 1e-8)
# Before: ratio = new_logprob - old_logprob # WRONG — this is the log-ratio, not the ratio # After: ratio = torch.exp(new_logprob - old_logprob) # Sanity check: on the first mini-epoch step π_new == π_old, so ratio == 1 exactly: # assert torch.allclose(ratio, torch.ones_like(ratio)) when step == 0
| Approach | Trade-offs |
|---|---|
| Unit-test each component independently | Faster isolation of NaN sources via shape/value assertions on logits, probabilities, advantages, and ratios, but requires writing additional test harness code during the interview. |
| Run and inspect printed values first | Quickly reveals symptoms (e.g., ratio printed ≠ 1) but may miss root-cause logic errors without careful code reading. |