AO
AceOffer
·
Back

Web Crawler

Phone ScreenPhone ScreenSoftware Engineer, Machine Learning EngineerLast reported April 2026High Frequency

Problem Overview

Given a seed URL and a provided helper function (variously named getURLs, getLinks, getPageLinks, htmlParser.getUrls, or a jsoup wrapper) that fetches all hyperlinks found on a page, implement a web crawler that: (1) starts from the seed URL, (2) only visits pages within the same hostname/domain as the seed, (3) avoids re-visiting URLs (deduplication via a visited set), and (4) strips URL fragment identifiers (the '#...' portion) from URLs before deduplication — URL normalization beyond fragment stripping is explicitly not required.
You're given
function getURLs(url: string): string[];
function getLinks(url: string): string[];
function getPageLinks(url: string): string[];
function htmlParser.getUrls(url: string): string[];
Helper variants seen across reports. Don't implement HTTP fetching or HTML parsing — the helper handles both.
What you must implement
  1. starts from the seed URL
  2. only visits pages within the same hostname/domain as the seed
  3. avoids re-visiting URLs (deduplication via a visited set)
  4. strips URL fragment identifiers (the '#...' portion) from URLs before deduplication — URL normalization beyond fragment stripping is explicitly not required.
Where you'll code it
ReplitCodeSignal
Live shared coding environment. You may need to create test files / inputs yourself. A correct run produces 100 unique URLs.
Full problem statement
Given a seed URL and a provided helper function (variously named getURLs, getLinks, getPageLinks, htmlParser.getUrls, or a jsoup wrapper) that fetches all hyperlinks found on a page, implement a web crawler that: (1) starts from the seed URL, (2) only visits pages within the same hostname/domain as the seed, (3) avoids re-visiting URLs (deduplication via a visited set), and (4) strips URL fragment identifiers (the '#...' portion) from URLs before deduplication — URL normalization beyond fragment stripping is explicitly not required. The crawler is run against a real small website (via Replit or CodeSignal); a correct result produces fewer than ~100 unique URLs. The helper function handles HTTP fetching and HTML parsing; candidates are not required to implement those. Start with a single-threaded solution, then extend to a concurrent version. Interview confirmation emails note: 'The interviewer will be checking if you're coding productively, reviewing concurrency primitives, etc.'

Follow-up Arc

Interviewers escalate through these phases. The order varies, but most candidates see at least one from each bucket.
Concurrency · 6Distributed scale · 1Edge cases · 5Trade-off discussion · 1
Concurrency
How would you optimize this to run faster? Implement a concurrent/multi-threaded version.
Probes for: After single-threaded solution is working
How would you implement this so that a task is processed immediately once it is ready (no polling delay)?
Probes for: Discussing or after implementing multi-threading
How would you benchmark/compare the runtime of the single-threaded vs concurrent version?
Probes for: After complete solution
How do you handle in-flight / pending requests to avoid fetching the same URL twice concurrently?
Probes for: After basic crawler is working
How would you improve concurrency further / reduce lock contention?
Probes for: If candidate uses locks for shared state
Why did you choose ThreadPoolExecutor (or asyncio)? What are the trade-offs?
Probes for: Choice of concurrency model
Distributed scale
How would you design this to run on multiple servers (distributed crawling for millions of seed URLs)?
Probes for: After concurrent implementation
Edge cases
Our crawler is too aggressive and overloads the servers we crawl. How would you implement a politeness policy?
Probes for: During distributed system discussion
Many different URLs point to the same or very similar content. How would you detect and handle this?
Probes for: Distributed or large-scale variant
How would you handle relative URLs (e.g., /about vs https://example.com/about)?
Probes for: After basic crawler is working
How would you detect and prevent redirect loops?
Probes for: After basic crawler is working
How would you implement rate limiting or throttling (e.g., respect robots.txt or avoid hammering a single domain)?
Probes for: After basic crawler is working
Trade-off discussion
What is the difference between threads and processes? When would you use one over the other?
Probes for: Discussing concurrency mechanisms

Approach Trade-offs

Approaches actually attempted in reports — including ones that lost candidates time. Pick deliberately.
ApproachNotes
asyncio with asyncio.Queue + worker coroutinesElegant for I/O-bound work and avoids GIL concerns; requires making the fetch/parse step async (e.g., wrapping blocking calls with asyncio.to_thread or switching to aiohttp). Risk: if the provided helper is a blocking call, the event loop will be blocked unless explicitly offloaded. Several candidates who tried writing a custom async HTML parser hit encoding errors or ran out of time.
DFS (recursive or stack-based) single-threaded baselineSimpler to implement first; some candidates started here and then refactored. Risk of stack overflow for deep sites; BFS is generally preferred for breadth-first discovery and easier to parallelize.
Distributed multi-server design (follow-up only, no code required)Use a central queue (e.g., Redis) with one coordinator assigning URL batches to worker servers; adds fault tolerance and horizontal scale but significant operational complexity.

Practice

Open the editor to write a solution against test cases, then return here to compare against the follow-ups.
Open Editor →
Anthropic · Phone Screen · Reported 57× across candidate reports