Develop a High-Performance Concurrent Web Crawler in C++
Build an advanced mini-project that involves designing and implementing a multithreaded web crawler in C++ capable of efficiently scanning and indexing web pages with rate limiting and duplicate URL detection.
Challenge prompt
Create a C++ program that takes a seed URL and crawls the web pages connected through hyperlinks up to a specified depth. Your crawler must support concurrency to speed up retrieval, avoid revisiting the same URL multiple times, and respect a rate limit of requests per domain to prevent server overload. Collected URLs and their page titles should be stored in a thread-safe data structure and printed out once crawling finishes. Requirements: 1. Accept a seed URL and crawl depth as input. 2. Use multiple threads to fetch pages concurrently. 3. Implement a mechanism to detect and skip duplicate URLs. 4. Enforce per-domain rate limiting (e.g. max 1 request per second per domain). 5. Extract the page title (from the <title> HTML tag) for indexing with the URL. 6. Ensure all shared data structures are safely accessed in a concurrent environment. 7. Output all collected URLs with their titles in any order after crawling completes.
Guidance
- • Use a thread-safe queue or work-stealing queue to manage URLs to visit across threads.
- • Implement a hash set or map with mutex protection or concurrent containers to track visited URLs and avoid duplicates.
- • Use a timer or timestamp map to enforce rate limiting per domain before making HTTP requests.
- • Leverage an HTTP client library such as libcurl or boost::asio with async calls for fetching page content.
Hints
- • Parsing HTML can be simplified by searching for <title> tags with string manipulation or a lightweight HTML parser like Gumbo or pugixml.
- • Domain extraction from URLs can be performed by parsing the URL string or using existing URL parsing utilities.
- • Synchronization primitives like std::mutex, std::shared_mutex, or atomic flags will help manage concurrent access.
Starter code
#include <iostream>
#include <string>
#include <thread>
#include <queue>
#include <unordered_set>
#include <mutex>
#include <condition_variable>
// You may include additional libraries such as libcurl, regex, or parsers
int main() {
std::cout << "Enter seed URL and crawl depth:\n";
std::string seedUrl;
int maxDepth;
std::cin >> seedUrl >> maxDepth;
// TODO: Initialize shared structures for URLs, visited set, synchronization primitives
// TODO: Start worker threads to process URLs concurrently
// TODO: Crawl pages, apply rate limiting, extract titles, and collect results
// TODO: Print all collected URLs with their titles
return 0;
}
Expected output
URL: http://example.com Title: Example Domain URL: http://example.com/about Title: About Example ... (all crawled URLs with extracted titles)
Core concepts
Challenge a Friend
Send this duel to someone else and see if they can solve it.