===================== Downloading Wikipedia ===================== Just some consolidated notes to save you time figuring it out yourself. * XML/SQL Wikipedia content dumps are produced twice a month (1st/20th) * Second dump is considered 'partial' and only contains information about current revisions * Seems complete to me * The main servers only keep the last 7 dumps * Wikipedia offers the page contents either as a single file or as multiple smaller files * This library defaults to use multiple smaller files as it allows parsing early files while downloading the rest * In the smaller files, their names each end in a suffix that indicates the range of page IDs stored in a given file * e.g. enwiki-20201001-pages-articles-multistream1.xml-p1p41242.bz2 holds all pages with IDs from 1 - 41,242 * Wikipedia has indexed where pages are in the compressed data * The Bzip file design allows for the data to be broken into parts (called streams) yet still contain all these compressed streams in one file * Wikipedia publishes the addresses the "stream" containing each page inside the compressed data * This means that you don't have to decompress an entire file to get a single page, just the stream that contains it * Each stream contains 100 pages * This is what is meant by multistream in a filename * The indices are labeled as multistream-index in their filename * Indices are composed of byte_offset:page_id:page_name (e.g. 617:10:AccessibleComputing) * Page ID != page count since pages can be deleted * The loss in bz2 compression size caused by this is roughly 10% * There are several hundred streams in each file * BZ2 files expand to approximately 3.5x Details on the lib ------------------ * We don't actually pull from the 'latest' directory * The RSS XML files imply that the files in latest are copies of the most recent dump date * Files in 'latest/' don't have their date in the name (which means trouble figuring things out 6 months later) * The HTML page listing the latest files: * has a different layout so require custom page parsing * includes occasional files from earlier dumps (older copies of the same data) * While these things are easy enough to overcome, there seems to be no gain in adding extra code to do so