Downloading Wikipedia¶
Just some consolidated notes to save you time figuring it out yourself.
XML/SQL Wikipedia content dumps are produced twice a month (1st/20th)
Second dump is considered ‘partial’ and only contains information about current revisions
Seems complete to me
The main servers only keep the last 7 dumps
Wikipedia offers the page contents either as a single file or as multiple smaller files
This library defaults to use multiple smaller files as it allows parsing early files while downloading the rest
In the smaller files, their names each end in a suffix that indicates the range of page IDs stored in a given file
e.g. enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2 holds all pages with IDs from 1 - 41,242
Wikipedia has indexed where pages are in the compressed data
The Bzip file design allows for the data to be broken into parts (called streams) yet still contain all these compressed streams in one file
Wikipedia publishes the addresses the “stream” containing each page inside the compressed data
This means that you don’t have to decompress an entire file to get a single page, just the stream that contains it
Each stream contains 100 pages
This is what is meant by multistream in a filename
The indices are labeled as multistream-index in their filename
Indices are composed of byte_offset:page_id:page_name (e.g. 617:10:AccessibleComputing)
Page ID != page count since pages can be deleted
The loss in bz2 compression size caused by this is roughly 10%
There are several hundred streams in each file
BZ2 files expand to approximately 3.5x
Details on the lib¶
We don’t actually pull from the ‘latest’ directory
The RSS XML files imply that the files in latest are copies of the most recent dump date
Files in ‘latest/’ don’t have their date in the name (which means trouble figuring things out 6 months later)
The HTML page listing the latest files: * has a different layout so require custom page parsing * includes occasional files from earlier dumps (older copies of the same data)
While these things are easy enough to overcome, there seems to be no gain in adding extra code to do so