Topic: Optimization of Chunking Strategies in High-Performance Computing Date: October 26, 2023
Rechunking is an "embarrassingly parallel" task if configured correctly.
Compression is the enemy of rechunking speed because data must be decompressed to move it and recompressed to store it. rechunk000pak better
Split work into:
Use a thread pool (e.g., in Rust or Go). Avoid Python GIL unless using multiprocessing. Chunk Plan: Ensure the input chunks match the
Choosing the right target chunk size is critical for performance.
For large PAK files (> 4 GB), mmap() (or CreateFileMapping on Windows) avoids loading the whole file into RAM.
Use MAP_PRIVATE to safely read chunks. Compression is the enemy of rechunking speed because
Better: memory-map both source and output.