Downloading Wikipedia¶

Just some consolidated notes to save you time figuring it out yourself.

XML/SQL Wikipedia content dumps are produced twice a month (1st/20th)

Second dump is considered ‘partial’ and only contains information about current revisions

Seems complete to me

The main servers only keep the last 7 dumps
Wikipedia offers the page contents either as a single file or as multiple smaller files

This library defaults to use multiple smaller files as it allows parsing early files while downloading the rest

In the smaller files, their names each end in a suffix that indicates the range of page IDs stored in a given file

e.g. enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2 holds all pages with IDs from 1 - 41,242

Wikipedia has indexed where pages are in the compressed data

The Bzip file design allows for the data to be broken into parts (called streams) yet still contain all these compressed streams in one file

Wikipedia publishes the addresses the “stream” containing each page inside the compressed data

This means that you don’t have to decompress an entire file to get a single page, just the stream that contains it

Each stream contains 100 pages

This is what is meant by multistream in a filename

The indices are labeled as multistream-index in their filename

Indices are composed of byte_offset:page_id:page_name (e.g. 617:10:AccessibleComputing)

Page ID != page count since pages can be deleted

The loss in bz2 compression size caused by this is roughly 10%

There are several hundred streams in each file

BZ2 files expand to approximately 3.5x

Details on the lib¶

We don’t actually pull from the ‘latest’ directory

The RSS XML files imply that the files in latest are copies of the most recent dump date

Files in ‘latest/’ don’t have their date in the name (which means trouble figuring things out 6 months later)

The HTML page listing the latest files: * has a different layout so require custom page parsing * includes occasional files from earlier dumps (older copies of the same data)

While these things are easy enough to overcome, there seems to be no gain in adding extra code to do so