=====================
Downloading Wikipedia
=====================

Just some consolidated notes to save you time figuring it out yourself.

* XML/SQL Wikipedia content dumps are produced twice a month (1st/20th)

 * Second dump is considered 'partial' and only contains information about current revisions

   * Seems complete to me

* The main servers only keep the last 7 dumps

* Wikipedia offers the page contents either as a single file or as multiple smaller files

 * This library defaults to use multiple smaller files as it allows parsing early files while downloading the rest
 * In the smaller files, their names each end in a suffix that indicates the range of page IDs stored in a given file

   * e.g. enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2 holds all pages with IDs from 1 - 41,242

* Wikipedia has indexed where pages are in the compressed data

 * The Bzip file design allows for the data to be broken into parts (called streams) yet still contain all these compressed streams in one file
 * Wikipedia publishes the addresses the "stream" containing each page inside the compressed data

   * This means that you don't have to decompress an entire file to get a single page, just the stream that contains it
   * Each stream contains 100 pages
   * This is what is meant by multistream in a filename
   * The indices are labeled as multistream-index in their filename

     * Indices are composed of byte_offset:page_id:page_name (e.g. 617:10:AccessibleComputing)
     * Page ID != page count since pages can be deleted

   * The loss in bz2 compression size caused by this is roughly 10%

 * There are several hundred streams in each file
 * BZ2 files expand to approximately 3.5x


Details on the lib
------------------
* We don't actually pull from the 'latest' directory

 * The RSS XML files imply that the files in latest are copies of the most recent dump date
 * Files in 'latest/' don't have their date in the name (which means trouble figuring things out 6 months later)
 * The HTML page listing the latest files:
   * has a different layout so require custom page parsing
   * includes occasional files from earlier dumps (older copies of the same data)
 * While these things are easy enough to overcome, there seems to be no gain in adding extra code to do so