Web Crawler

Question

Given a seed URL and a provided helper function (variously named getURLs, getLinks, getPageLinks, htmlParser.getUrls, or a jsoup wrapper) that fetches all hyperlinks found on a page, implement a web crawler that: (1) starts from the seed URL, (2) only visits pages within the same hostname/domain as the seed, (3) avoids re-visiting URLs (deduplication via a visited set), and (4) strips URL fragment identifiers (the '#...' portion) from URLs before deduplication — URL normalization beyond fragment stripping is explicitly not required. The crawler is run against a real small website (via Replit or CodeSignal); a correct result produces fewer than ~100 unique URLs. The helper function handles HTTP fetching and HTML parsing; candidates are not required to implement those. Start with a single-threaded solution, then extend to a concurrent version. Interview confirmation emails note: 'The interviewer will be checking if you're coding productively, reviewing concurrency primitives, etc.'

AceOffer · Accepted Answer

Given a seed URL and a provided helper function (variously named getURLs, getLinks, getPageLinks, htmlParser.getUrls, or a jsoup wrapper) that fetches all hyperlinks found on a page, implement a web crawler that: (1) starts from the seed URL, (2) only visits pages within the same hostname/domain as the seed, (3) avoids re-visiting URLs (deduplication via a visited set), and (4) strips URL fragment identifiers (the '#...' portion) from URLs before deduplication — URL normalization beyond fragment stripping is explicitly not required. The crawler is run against a real small website (via Replit or CodeSignal); a correct result produces fewer than ~100 unique URLs. The helper function handles HTTP fetching and HTML parsing; candidates are not required to implement those. Start with a single-threaded solution, then extend to a concurrent version. Interview confirmation emails note: 'The interviewer will be checking if you're coding productively, reviewing concurrency primitives, etc.'

Reported follow-ups:
1. How would you optimize this to run faster? Implement a concurrent/multi-threaded version. (when: After single-threaded solution is working)
2. How would you implement this so that a task is processed immediately once it is ready (no polling delay)? (when: Discussing or after implementing multi-threading)
3. What is the difference between threads and processes? When would you use one over the other? (when: Discussing concurrency mechanisms)
4. How would you design this to run on multiple servers (distributed crawling for millions of seed URLs)? (when: After concurrent implementation)
5. Our crawler is too aggressive and overloads the servers we crawl. How would you implement a politeness policy? (when: During distributed system discussion)
6. Many different URLs point to the same or very similar content. How would you detect and handle this? (when: Distributed or large-scale variant)
7. How would you benchmark/compare the runtime of the single-threaded vs concurrent version? (when: After complete solution)
8. How would you improve the concurrency performance and throughput of your solution? (when: After initial implementation is working)
9. How would you handle relative URL resolution, redirect loops, rate limiting, and hung/stalled requests? (when: If implementing a crawler)

**Alternative approaches:** asyncio with asyncio.Queue + worker coroutines (Elegant for I/O-bound work and avoids GIL concerns; requires making the fetch/parse step async (e.g., wrapping blocking calls with asyncio.to_thread or switching to aiohttp). Risk: if the provided helper is a blocking call, the event loop will be blocked unless explicitly offloaded. Several candidates who tried writing a custom async HTML parser hit encoding errors or ran out of time.); DFS (recursive or stack-based) single-threaded baseline (Simpler to implement first; some candidates started here and then refactored. Risk of stack overflow for deep sites; BFS is generally preferred for breadth-first discovery and easier to parallelize.); Distributed multi-server design (follow-up only, no code required) (Use a central queue (e.g., Redis) with one coordinator assigning URL batches to worker servers; adds fault tolerance and horizontal scale but significant operational complexity.)

Web Crawler

Problem Overview

Follow-up Arc

Approach Trade-offs

Practice

More Anthropic Questions

Every question in the Anthropic catalog gets this depth

Approach	Notes
asyncio with asyncio.Queue + worker coroutines	Elegant for I/O-bound work and avoids GIL concerns; requires making the fetch/parse step async (e.g., wrapping blocking calls with asyncio.to_thread or switching to aiohttp). Risk: if the provided helper is a blocking call, the event loop will be blocked unless explicitly offloaded. Several candidates who tried writing a custom async HTML parser hit encoding errors or ran out of time.
DFS (recursive or stack-based) single-threaded baseline	Simpler to implement first; some candidates started here and then refactored. Risk of stack overflow for deep sites; BFS is generally preferred for breadth-first discovery and easier to parallelize.
Distributed multi-server design (follow-up only, no code required)	Use a central queue (e.g., Redis) with one coordinator assigning URL batches to worker servers; adds fault tolerance and horizontal scale but significant operational complexity.