AO
AceOffer
·
Back
OpenAI Common Problems

System Design: CI/CD Pipeline

System DesignhardLast reported March 2026
By AceOffer · Updated March 2026 · Reported 45× across 190+ candidate reports

Understanding the Problem

Design a multi-tenant CI/CD system that schedules and executes user-defined workflows in response to git push events. The system receives push notifications via internal API calls containing a repository ID and the current commit hash. Workflows are defined as a sequence of jobs in a single YAML file stored at a static location within each repository. Users must be able to view the output and status of jobs in real time as they execute. The system must be highly scalable (supporting many tenants/repositories), fault-tolerant, and must implement exactly-once execution semantics. Full canonical prompt (verbatim from multiple reports): 'Design a multi-tenant CI/CD system which schedules and executes user-defined workflows in response to git pushes. The system receives information about pushes via API calls from an internal service which contain the repository id and the current state of the repository (commit hash). Workflows are a sequence of jobs which are defined within a single YAML file in a static location for each repository. Users should be able to view the output and status of jobs as they are running.' Variant framing: Also asked as 'Design a GitHub Actions-like system' or 'Design a CI/CD job scheduler using K8s and Docker.' Some interviewers explicitly simplify scope: jobs are shell scripts (no K8s/containers needed), and workflows are linear sequences (not DAGs). Clarify scope before designing.

Functional Requirements

Structured requirements coming soon. For now, see the full problem statement above and the deep-dive prompts below.

Non-Functional Requirements

Latency, throughput, availability, consistency targets — being authored.

The Set Up

Defining the Core Entities

Core entities (Request, Batch, Worker, Cache, etc.) — being authored.

The API

POST /endpoint → describe request shape GET /endpoint → describe response shape (API spec being authored)

High-Level Design

Component diagram + walkthrough mapping each functional requirement to a system flow — being authored.

Potential Deep Dives

These are the directions the interviewer is likely to push you. Each one has multiple valid solutions at different quality tiers.

1)What happens if a worker dies mid-execution? How do you ensure the job is not lost or double-executed? (when: Candidate does not address worker failure or job recovery)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

2)How do you ensure exactly-once execution? What if the same job is delivered twice from the queue? (when: Candidate addresses only at-least-once delivery without exactly-once semantics)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

3)How do users see job output while the job is running? How do you stream logs in real time? (when: Candidate does not mention real-time log streaming or user visibility)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

4)What happens if a job is in RUNNING state forever? How do you detect and recover from this? (when: Candidate misses stuck-state handling)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

5)If the worker container only has a minimal image with no HTTP/RPC client, how does it notify the system of job completion? How do you retrieve logs? (when: Candidate assumes containers have full network access)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

6)How do you prevent one tenant's high load from starving other tenants' jobs? (when: Candidate doesn't address multi-tenant isolation)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

7)How does your scheduler scale if you have millions of workflows? What happens if it crashes? (when: Candidate uses stateful scheduler)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

8)Let's simplify: assume each workflow is just a linear sequence of jobs, one after another. How does that change your design? (when: Candidate uses DAG when interviewer wants linear)

Bad
Naive approach with serious trade-off — being authored.
Good
Solid baseline with reasonable trade-offs — being authored.
Great
Production-grade approach with explicit trade-off rationale — being authored.

What is Expected at Each Level?

L4 / Mid-level
Cover happy path. Clarify scope. Identify the obvious bottleneck. Pick a reasonable storage and reasonable scaling approach.
L5 / SeniorTarget
All of the above plus: explicit failure handling, durability vs latency trade-offs, choose the right batching/caching strategy, articulate why.
L6 / Staff+
All of the above plus: organizational concerns (rollout, migration, on-call), quantitative analysis, multi-region considerations, what could go wrong with the proposed solution at 10x scale.

Insider Notes

**Common mistakes:** Stacking modules without specifying component interactions, APIs, or DB schemas; Missing real-time log streaming component entirely; Designing a stateful scheduler that becomes a bottleneck and single point of failure; Assuming at-least-once delivery is sufficient without addressing idempotency/exactly-once; Not handling stuck RUNNING jobs (no reconciler or heartbeat mechanism); Not clarifying scope (K8s vs shell scripts, DAG vs linear) before designing; Designing DAG-based system when interviewer explicitly says linear steps only; Missing multi-tenant isolation and resource fairness; Using simplistic polling-based scheduler instead of CDC/event-driven approach; Not preparing for container-specific corner cases (minimal image, log retrieval) **Interviewer hints:** 'Think about what happens if the worker dies.' (when candidate missed fault tolerance); 'How do users actually see what's happening with their job?' (prompting log streaming discussion); 'Let's say the container has a minimal image — no curl, no HTTP client. How does the system know the job is done?' (container corner case); 'Think about at-least-once vs exactly-once — which does your design give you?' (prompting idempotency discussion); 'You can treat each job as a single task, no need for DAG.' (simplifying scope when candidate over-engineered); 'What API protocol would you use here — and why not just curl?' (prompting protocol discussion); 'Think about the container observability problem.' (after candidate missed log streaming) **What passers do:** Explicitly framed the problem as a job scheduler from the start; Proposed stateless scheduler using CDC for status-driven job enqueuing; Proactively addressed exactly-once semantics with idempotency key + DB check; Covered real-time log streaming without being prompted; Proposed reconciler for stuck-state detection and recovery; Clarified scope upfront (linear vs DAG, containers vs shell scripts); Detailed DB schema with created_at, status enum, worker_id, step_index; Addressed multi-tenant isolation (queue partitioning, quotas); Described every layer in detail when probed — no hand-waving **Why people fail:** Described modules at high level without component detail, then ran out of time; Missed log streaming entirely, only mentioned job status; Could not answer container corner case questions (minimal image, no HTTP client); Treated it as generic web system design without job-scheduler specifics; Did not address stuck RUNNING states at all; Used stateful scheduler without explaining how it scales or recovers; Did not clarify scope and designed overly complex DAG engine; Could not respond to exactly-once semantics follow-up **Edge cases probed:** Minimal container image with no HTTP/RPC/curl — how to notify job completion; Container log retrieval and debugging when no log client inside container; Stuck jobs in RUNNING state (worker crash, network partition); Concurrent double execution of same job from queue retry; Workflow where a mid-step job fails — do subsequent steps run?; Very large log output streaming (backpressure handling); Multi-tenant resource starvation (noisy neighbor); Git push events arriving out of order or duplicate pushes; YAML config file missing or malformed at workflow trigger time **Alternative approaches:** DAG-based workflow engine (e.g., Airflow-style) (More general than linear steps; supports parallel job execution within a workflow. Increases scheduler complexity significantly (dependency graph evaluation, fan-out/fan-in logic). Overkill if workflows are confirmed to be linear sequences. Interviewers may explicitly say DAG is not needed.); Push-based worker notification (vs pull-based queue) (Scheduler directly calls workers (RPC) instead of workers pulling from queue. Lower latency but tighter coupling, harder to scale workers independently, requires scheduler to track worker availability.); Polling-based scheduler (vs CDC) (Scheduler periodically polls DB for jobs to enqueue. Simpler to implement but adds latency, increases DB load, and can miss rapid status transitions. CDC is preferred for responsiveness.); K8s-native approach (CRDs + controllers) (Use Kubernetes CRDs for workflow/job resources and a custom controller as the scheduler. Leverages K8s primitives (pods, resource limits). Appropriate when K8s is explicitly in scope, but adds infrastructure complexity and is out of scope when interviewer says to ignore container/K8s concerns.)
OpenAI · System Design · Last reported March 2026