cppadvanced90 minutes

Advanced Multithreaded File Processing and Data Aggregation in C++

Create a performant C++ application that reads multiple large text files concurrently, processes the extracted data to compute aggregate statistics, and outputs a sorted summary report.

Challenge prompt

Build a C++ program that accepts a list of file paths, reads each file in parallel using multithreading, extracts integer values from each line, and calculates the total sum, average, maximum, and minimum values across all files. Finally, output a summary report sorted by file name that includes these statistics for each file and a combined aggregate for all files.

Guidance

  • Use C++11 or later thread support libraries (e.g., std::thread, std::mutex) for concurrent file reading.
  • Design thread-safe data structures or use synchronization primitives to aggregate data safely.
  • Optimize file reading and parsing to handle large files without excessive memory usage.

Hints

  • Consider having each thread process its file and store statistics locally before merging results.
  • Use locks or atomic operations only when updating shared aggregate data to avoid performance bottlenecks.
  • Use standard algorithms from <algorithm> for computing min, max, and sorting results.

Starter code

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <thread>
#include <mutex>
#include <map>
#include <numeric>
#include <limits>

struct Statistics {
    long long sum = 0;
    int count = 0;
    int max = std::numeric_limits<int>::min();
    int min = std::numeric_limits<int>::max();
};

std::mutex mtx;
std::map<std::string, Statistics> fileStats;
Statistics totalStats;

void processFile(const std::string& filename) {
    std::ifstream file(filename);
    if (!file.is_open()) {
        std::cerr << "Failed to open " << filename << std::endl;
        return;
    }
    Statistics localStats;
    std::string line;
    while (std::getline(file, line)) {
        try {
            int number = std::stoi(line);
            localStats.sum += number;
            localStats.count++;
            if (number > localStats.max) localStats.max = number;
            if (number < localStats.min) localStats.min = number;
        } catch (...) {
            continue; // ignore lines that aren't integers
        }
    }
    std::lock_guard<std::mutex> lock(mtx);
    fileStats[filename] = localStats;
    totalStats.sum += localStats.sum;
    totalStats.count += localStats.count;
    if (localStats.max > totalStats.max) totalStats.max = localStats.max;
    if (localStats.min < totalStats.min) totalStats.min = localStats.min;
}

int main(int argc, char* argv[]) {
    if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " <file1> [file2 ...]" << std::endl;
        return 1;
    }
    std::vector<std::thread> threads;
    for (int i = 1; i < argc; ++i) {
        threads.emplace_back(processFile, argv[i]);
    }
    for (auto& t : threads) {
        t.join();
    }

    // Output sorted summary report
    std::cout << "File Stats (sorted by file name):" << std::endl;
    for (auto& [filename, stats] : fileStats) {
        double avg = stats.count ? static_cast<double>(stats.sum) / stats.count : 0;
        std::cout << filename << ": sum=" << stats.sum << ", avg=" << avg
                  << ", max=" << stats.max << ", min=" << stats.min << std::endl;
    }
    double totalAvg = totalStats.count ? static_cast<double>(totalStats.sum) / totalStats.count : 0;
    std::cout << "Combined: sum=" << totalStats.sum << ", avg=" << totalAvg
              << ", max=" << totalStats.max << ", min=" << totalStats.min << std::endl;
    return 0;
}

Expected output

File Stats (sorted by file name): file1.txt: sum=123456, avg=123.45, max=999, min=1 file2.txt: sum=234567, avg=234.56, max=999, min=2 ... Combined: sum=358023, avg=179.01, max=999, min=1

Core concepts

MultithreadingFile I/OData AggregationSynchronizationSorting

Challenge a Friend

Send this duel to someone else and see if they can solve it.