cppadvanced120 minutes

Develop a High-Performance Concurrent Web Crawler in C++

Build an advanced mini-project that involves designing and implementing a multithreaded web crawler in C++ capable of efficiently scanning and indexing web pages with rate limiting and duplicate URL detection.

Challenge prompt

Create a C++ program that takes a seed URL and crawls the web pages connected through hyperlinks up to a specified depth. Your crawler must support concurrency to speed up retrieval, avoid revisiting the same URL multiple times, and respect a rate limit of requests per domain to prevent server overload. Collected URLs and their page titles should be stored in a thread-safe data structure and printed out once crawling finishes. Requirements: 1. Accept a seed URL and crawl depth as input. 2. Use multiple threads to fetch pages concurrently. 3. Implement a mechanism to detect and skip duplicate URLs. 4. Enforce per-domain rate limiting (e.g. max 1 request per second per domain). 5. Extract the page title (from the <title> HTML tag) for indexing with the URL. 6. Ensure all shared data structures are safely accessed in a concurrent environment. 7. Output all collected URLs with their titles in any order after crawling completes.

Guidance

  • Use a thread-safe queue or work-stealing queue to manage URLs to visit across threads.
  • Implement a hash set or map with mutex protection or concurrent containers to track visited URLs and avoid duplicates.
  • Use a timer or timestamp map to enforce rate limiting per domain before making HTTP requests.
  • Leverage an HTTP client library such as libcurl or boost::asio with async calls for fetching page content.

Hints

  • Parsing HTML can be simplified by searching for <title> tags with string manipulation or a lightweight HTML parser like Gumbo or pugixml.
  • Domain extraction from URLs can be performed by parsing the URL string or using existing URL parsing utilities.
  • Synchronization primitives like std::mutex, std::shared_mutex, or atomic flags will help manage concurrent access.

Starter code

#include <iostream>
#include <string>
#include <thread>
#include <queue>
#include <unordered_set>
#include <mutex>
#include <condition_variable>
// You may include additional libraries such as libcurl, regex, or parsers

int main() {
    std::cout << "Enter seed URL and crawl depth:\n";
    std::string seedUrl;
    int maxDepth;
    std::cin >> seedUrl >> maxDepth;

    // TODO: Initialize shared structures for URLs, visited set, synchronization primitives
    // TODO: Start worker threads to process URLs concurrently
    // TODO: Crawl pages, apply rate limiting, extract titles, and collect results

    // TODO: Print all collected URLs with their titles
    return 0;
}

Expected output

URL: http://example.com Title: Example Domain URL: http://example.com/about Title: About Example ... (all crawled URLs with extracted titles)

Core concepts

multithreadingconcurrencynetwork programmingdata structuresrate limiting

Challenge a Friend

Send this duel to someone else and see if they can solve it.