System Design: CI/CD Pipeline

Question

Design a multi-tenant CI/CD system that schedules and executes user-defined workflows in response to git push events. The system receives push notifications via internal API calls containing a repository ID and the current commit hash. Workflows are defined as a sequence of jobs in a single YAML file stored at a static location within each repository. Users must be able to view the output and status of jobs in real time as they execute. The system must be highly scalable (supporting many tenants/repositories), fault-tolerant, and must implement exactly-once execution semantics.

Full canonical prompt (verbatim from multiple reports): 'Design a multi-tenant CI/CD system which schedules and executes user-defined workflows in response to git pushes. The system receives information about pushes via API calls from an internal service which contain the repository id and the current state of the repository (commit hash). Workflows are a sequence of jobs which are defined within a single YAML file in a static location for each repository. Users should be able to view the output and status of jobs as they are running.'

Variant framing: Also asked as 'Design a GitHub Actions-like system' or 'Design a CI/CD job scheduler using K8s and Docker.' Some interviewers explicitly simplify scope: jobs are shell scripts (no K8s/containers needed), and workflows are linear sequences (not DAGs). Clarify scope before designing.

AceOffer · Accepted Answer

Design a multi-tenant CI/CD system that schedules and executes user-defined workflows in response to git push events. The system receives push notifications via internal API calls containing a repository ID and the current commit hash. Workflows are defined as a sequence of jobs in a single YAML file stored at a static location within each repository. Users must be able to view the output and status of jobs in real time as they execute. The system must be highly scalable (supporting many tenants/repositories), fault-tolerant, and must implement exactly-once execution semantics.

Full canonical prompt (verbatim from multiple reports): 'Design a multi-tenant CI/CD system which schedules and executes user-defined workflows in response to git pushes. The system receives information about pushes via API calls from an internal service which contain the repository id and the current state of the repository (commit hash). Workflows are a sequence of jobs which are defined within a single YAML file in a static location for each repository. Users should be able to view the output and status of jobs as they are running.'

Variant framing: Also asked as 'Design a GitHub Actions-like system' or 'Design a CI/CD job scheduler using K8s and Docker.' Some interviewers explicitly simplify scope: jobs are shell scripts (no K8s/containers needed), and workflows are linear sequences (not DAGs). Clarify scope before designing.

Reported follow-ups:
1. What happens if a worker dies mid-execution? How do you ensure the job is not lost or double-executed? (when: Candidate does not address worker failure or job recovery)
2. How do you ensure exactly-once execution? What if the same job is delivered twice from the queue? (when: Candidate addresses only at-least-once delivery without exactly-once semantics)
3. How do users see job output while the job is running? How do you stream logs in real time? (when: Candidate does not mention real-time log streaming or user visibility)
4. What happens if a job is in RUNNING state forever? How do you detect and recover from this? (when: Candidate misses stuck-state handling)
5. If the worker container only has a minimal image with no HTTP/RPC client, how does it notify the system of job completion? How do you retrieve logs? (when: Candidate assumes containers have full network access)
6. How do you prevent one tenant's high load from starving other tenants' jobs? (when: Candidate doesn't address multi-tenant isolation)
7. How does your scheduler scale if you have millions of workflows? What happens if it crashes? (when: Candidate uses stateful scheduler)
8. Let's simplify: assume each workflow is just a linear sequence of jobs, one after another. How does that change your design? (when: Candidate uses DAG when interviewer wants linear)

**Common mistakes:** Stacking modules without specifying component interactions, APIs, or DB schemas; Missing real-time log streaming component entirely; Designing a stateful scheduler that becomes a bottleneck and single point of failure; Assuming at-least-once delivery is sufficient without addressing idempotency/exactly-once; Not handling stuck RUNNING jobs (no reconciler or heartbeat mechanism); Not clarifying scope (K8s vs shell scripts, DAG vs linear) before designing; Designing DAG-based system when interviewer explicitly says linear steps only; Missing multi-tenant isolation and resource fairness; Using simplistic polling-based scheduler instead of CDC/event-driven approach; Not preparing for container-specific corner cases (minimal image, log retrieval)

**Interviewer hints:** 'Think about what happens if the worker dies.' (when candidate missed fault tolerance); 'How do users actually see what's happening with their job?' (prompting log streaming discussion); 'Let's say the container has a minimal image — no curl, no HTTP client. How does the system know the job is done?' (container corner case); 'Think about at-least-once vs exactly-once — which does your design give you?' (prompting idempotency discussion); 'You can treat each job as a single task, no need for DAG.' (simplifying scope when candidate over-engineered); 'What API protocol would you use here — and why not just curl?' (prompting protocol discussion); 'Think about the container observability problem.' (after candidate missed log streaming)

**What passers do:** Explicitly framed the problem as a job scheduler from the start; Proposed stateless scheduler using CDC for status-driven job enqueuing; Proactively addressed exactly-once semantics with idempotency key + DB check; Covered real-time log streaming without being prompted; Proposed reconciler for stuck-state detection and recovery; Clarified scope upfront (linear vs DAG, containers vs shell scripts); Detailed DB schema with created_at, status enum, worker_id, step_index; Addressed multi-tenant isolation (queue partitioning, quotas); Described every layer in detail when probed — no hand-waving

**Why people fail:** Described modules at high level without component detail, then ran out of time; Missed log streaming entirely, only mentioned job status; Could not answer container corner case questions (minimal image, no HTTP client); Treated it as generic web system design without job-scheduler specifics; Did not address stuck RUNNING states at all; Used stateful scheduler without explaining how it scales or recovers; Did not clarify scope and designed overly complex DAG engine; Could not respond to exactly-once semantics follow-up

**Edge cases probed:** Minimal container image with no HTTP/RPC/curl — how to notify job completion; Container log retrieval and debugging when no log client inside container; Stuck jobs in RUNNING state (worker crash, network partition); Concurrent double execution of same job from queue retry; Workflow where a mid-step job fails — do subsequent steps run?; Very large log output streaming (backpressure handling); Multi-tenant resource starvation (noisy neighbor); Git push events arriving out of order or duplicate pushes; YAML config file missing or malformed at workflow trigger time

**Alternative approaches:** DAG-based workflow engine (e.g., Airflow-style) (More general than linear steps; supports parallel job execution within a workflow. Increases scheduler complexity significantly (dependency graph evaluation, fan-out/fan-in logic). Overkill if workflows are confirmed to be linear sequences. Interviewers may explicitly say DAG is not needed.); Push-based worker notification (vs pull-based queue) (Scheduler directly calls workers (RPC) instead of workers pulling from queue. Lower latency but tighter coupling, harder to scale workers independently, requires scheduler to track worker availability.); Polling-based scheduler (vs CDC) (Scheduler periodically polls DB for jobs to enqueue. Simpler to implement but adds latency, increases DB load, and can miss rapid status transitions. CDC is preferred for responsiveness.); K8s-native approach (CRDs + controllers) (Use Kubernetes CRDs for workflow/job resources and a custom controller as the scheduler. Leverages K8s primitives (pods, resource limits). Appropriate when K8s is explicitly in scope, but adds infrastructure complexity and is out of scope when interviewer says to ignore container/K8s concerns.)

System Design: CI/CD Pipeline

Understanding the Problem

Functional Requirements

Non-Functional Requirements

The Set Up

Defining the Core Entities

The API

High-Level Design

Potential Deep Dives

1)What happens if a worker dies mid-execution? How do you ensure the job is not lost or double-executed? (when: Candidate does not address worker failure or job recovery)

2)How do you ensure exactly-once execution? What if the same job is delivered twice from the queue? (when: Candidate addresses only at-least-once delivery without exactly-once semantics)

3)How do users see job output while the job is running? How do you stream logs in real time? (when: Candidate does not mention real-time log streaming or user visibility)

4)What happens if a job is in RUNNING state forever? How do you detect and recover from this? (when: Candidate misses stuck-state handling)

5)If the worker container only has a minimal image with no HTTP/RPC client, how does it notify the system of job completion? How do you retrieve logs? (when: Candidate assumes containers have full network access)

6)How do you prevent one tenant's high load from starving other tenants' jobs? (when: Candidate doesn't address multi-tenant isolation)

7)How does your scheduler scale if you have millions of workflows? What happens if it crashes? (when: Candidate uses stateful scheduler)

8)Let's simplify: assume each workflow is just a linear sequence of jobs, one after another. How does that change your design? (when: Candidate uses DAG when interviewer wants linear)

What is Expected at Each Level?

Insider Notes

More OpenAI Questions

Every question in the OpenAI catalog gets this depth