Corpus Wrangler¶
This is an active project (Fall 2020) but not yet ready for tire kicking. If this project sounds useful to you, reach out and let me know what you’d like to see.
A simple utility library for downloading and handling large corpora (like Wikipedia).
General Motivation¶
There is a lifecycle to processing large text corpora:
Download a sample (if available), figure out how you want to parse it (this is a bunch of experimenting)
Download the entire corpus and process
Find an issue that wasn’t present in the sample, figure out fix, reprocess
Six months later find you need to change something and you want to pull the latest data anyways
It is tempting to treat working with a corpus as a one time event that only warrants a throw away script. The list above reminds us that it just isn’t true. This library gives some simple tooling to help speed this work along and add conveniences you might not bother with if you think of corpus wrangling as a one time event. Beyond some simple parallelization, it also aims to add some conveniences:
Large data == a slow download
Once you know how you want to parse things, there is no reason not to parallelize downloads with parsing
Large data should mean parallel parsing
A lots of data can be easily parsed on a single machine, but that doesn’t mean it should be parsed serially too!
Every hour lost to simple work is one less NLP experiment you get to run
Slow decompression algorithms
Hosts (like Wikimedia) typically chose compression algorithms for maximum compression size
Fast algorithms can achieve 70% of the same compression but achieve near RAM access speeds in decompression
Never presume you will only need to visit the raw data once, you’ll make your future self waste time waiting for decompression
This uses lz4 as it is super fast (and can allow multicore decompression on the command line)
Knowing whether you are working on the latest data
Some text collections are periodically updated, might as well let a script check if I’m using the latest
Code as documentation
Figuring out what to download
It isn’t hard but I don’t want to re-learn a host’s file naming system when I update in 6 months
Reproducibility means you should store the code used to make current and past datasets
Using a convenience library can keep the code simple to read and clean
Checksums
We should check this, but rarely do, might as well make it trivial to do so
Specific motivation¶
The original use of a library is helpful to understand its design. In two words: word vectors. If this isn’t your need, I’ve tried to make the library generic enough that it is useful for you as well.
Yes, pre-made vectors do exist, but when they lack key vocabulary you need, making your own is relatively simple with GloVe and word2vec. Quite often folks like to start with Wikipedia as source for their word vectors. As anyone who has done this from scratch can attest, that isn’t as simple as one would like. Parse the XML files naively and you need to load the entire file into memory (a waste). Once you have your page contents, they are still filled with Wikitext markup as well as unresolved Wiki Template tags. Even after you find something like mwparserfromhell to help with that headache, you are now faced with messes like fractional sentences, because the second half of the sentence was a math equation.
This isn’t to say that folks haven’t posted their own solutions. I’m adding this one specifically because what I found elsewhere wasn’t as customizable or transparent as I’d like. I want to run with the defaults as a first pass to get a rough cut to start with for my NLP pipeline, and while a later, when the algo is taking forever, I can come back and experiment with small changes. For example, maybe I want to normalize all numbers in the corpus (either always numerals, always written numbers, or a token like <NUMBER>). Making the data for experiments should fast and convenient.
Free software: MIT license
Documentation: https://corpus-wrangler.readthedocs.io.
Features¶
TODO
Credits¶
This package was started with Cookiecutter and the audreyr/cookiecutter-pypackage project template.