Performance Statistics¶
Some idea of what to expect using this library.
Wikipedia Data¶
As of Oct 2020:
Compressed bz2 sizes:
Single multistream file size: 17.5 GB
Smaller multistream files size: total, max, min
Compressed lz4 sizes:
Smaller multistream files size: total, max, min
Do you need to get the latest of Wikipedia? Here’s some basic stats on the rate of Wikipedia size and growth.
Compression speedups¶
Working with just the first file (232MB in bz2), I got the following on my i7-7820X CPU @ 3.60GHz (reading from SSD):
bzcat entire file to /dev/null: 26s
decompress entire bz2 file (as text) in Python: 31s
decompress entire bz2 and recompress w/ lz4 in Python: 49s
decompress entire lz4 file (as text) in Python: 4s
read entire raw XML file from disk in Python: 2s