OpenAI Common Problems
System Design: CI/CD Pipeline
System DesignhardLast reported March 2026
By AceOffer · Updated March 2026 · Reported 45× across 190+ candidate reports
Insider Notes
**Common mistakes:** Stacking modules without specifying component interactions, APIs, or DB schemas; Missing real-time log streaming component entirely; Designing a stateful scheduler that becomes a bottleneck and single point of failure; Assuming at-least-once delivery is sufficient without addressing idempotency/exactly-once; Not handling stuck RUNNING jobs (no reconciler or heartbeat mechanism); Not clarifying scope (K8s vs shell scripts, DAG vs linear) before designing; Designing DAG-based system when interviewer explicitly says linear steps only; Missing multi-tenant isolation and resource fairness; Using simplistic polling-based scheduler instead of CDC/event-driven approach; Not preparing for container-specific corner cases (minimal image, log retrieval)
**Interviewer hints:** 'Think about what happens if the worker dies.' (when candidate missed fault tolerance); 'How do users actually see what's happening with their job?' (prompting log streaming discussion); 'Let's say the container has a minimal image — no curl, no HTTP client. How does the system know the job is done?' (container corner case); 'Think about at-least-once vs exactly-once — which does your design give you?' (prompting idempotency discussion); 'You can treat each job as a single task, no need for DAG.' (simplifying scope when candidate over-engineered); 'What API protocol would you use here — and why not just curl?' (prompting protocol discussion); 'Think about the container observability problem.' (after candidate missed log streaming)
**What passers do:** Explicitly framed the problem as a job scheduler from the start; Proposed stateless scheduler using CDC for status-driven job enqueuing; Proactively addressed exactly-once semantics with idempotency key + DB check; Covered real-time log streaming without being prompted; Proposed reconciler for stuck-state detection and recovery; Clarified scope upfront (linear vs DAG, containers vs shell scripts); Detailed DB schema with created_at, status enum, worker_id, step_index; Addressed multi-tenant isolation (queue partitioning, quotas); Described every layer in detail when probed — no hand-waving
**Why people fail:** Described modules at high level without component detail, then ran out of time; Missed log streaming entirely, only mentioned job status; Could not answer container corner case questions (minimal image, no HTTP client); Treated it as generic web system design without job-scheduler specifics; Did not address stuck RUNNING states at all; Used stateful scheduler without explaining how it scales or recovers; Did not clarify scope and designed overly complex DAG engine; Could not respond to exactly-once semantics follow-up
**Edge cases probed:** Minimal container image with no HTTP/RPC/curl — how to notify job completion; Container log retrieval and debugging when no log client inside container; Stuck jobs in RUNNING state (worker crash, network partition); Concurrent double execution of same job from queue retry; Workflow where a mid-step job fails — do subsequent steps run?; Very large log output streaming (backpressure handling); Multi-tenant resource starvation (noisy neighbor); Git push events arriving out of order or duplicate pushes; YAML config file missing or malformed at workflow trigger time
**Alternative approaches:** DAG-based workflow engine (e.g., Airflow-style) (More general than linear steps; supports parallel job execution within a workflow. Increases scheduler complexity significantly (dependency graph evaluation, fan-out/fan-in logic). Overkill if workflows are confirmed to be linear sequences. Interviewers may explicitly say DAG is not needed.); Push-based worker notification (vs pull-based queue) (Scheduler directly calls workers (RPC) instead of workers pulling from queue. Lower latency but tighter coupling, harder to scale workers independently, requires scheduler to track worker availability.); Polling-based scheduler (vs CDC) (Scheduler periodically polls DB for jobs to enqueue. Simpler to implement but adds latency, increases DB load, and can miss rapid status transitions. CDC is preferred for responsiveness.); K8s-native approach (CRDs + controllers) (Use Kubernetes CRDs for workflow/job resources and a custom controller as the scheduler. Leverages K8s primitives (pods, resource limits). Appropriate when K8s is explicitly in scope, but adds infrastructure complexity and is out of scope when interviewer says to ignore container/K8s concerns.)
OpenAI · System Design · Last reported March 2026