Making the Wikipedia Corpus – Andrew Fairless, Ph.D.

Downloading Wikipedia

Wikipedia periodically provides its entire website in a compressed XML ‘dump’ file that anyone can download. Further information is provided here:

meta.wikimedia.org/wiki/Data_dumps](meta.wikimedia.org/wiki/Data_dumps

I downloaded this version of Wikipedia at this link:

ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20170701/enwiki-20170701-pages-articles.xml.bz2

It includes a bit more than 9 million articles (9019957 articles, according to my notes).

I also downloaded this version of a part of Wikipedia for small scale testing:

ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20170701/enwiki-20170701-pages-articles1.xml-p10p30302.bz2

This smaller version includes only 15,151 articles.

The Goal

The goal is to create a Gensim corpus and dictionary based on the Wikipedia articles so that we can use them to train a Latent Dirichlet Allocation (LDA) model. We can then use that model to classify the topics in the Peanuts comic strips.

Parsing Wikipedia

We want to parse the Wikipedia XML file so that we retain only article webpages and discard the template and redirect pages. Fortunately, Jeff Heaton provided a model of how to do this. For our purposes, I’ve modified and re-factored his example into several functions.

Processing Wikipedia Articles

Creating a corpus and dictionary directly from the raw Wikipedia pages would be rather messy. Instead we want to clean the pages so that we retain only the most useful information. Since we’re discarding parts of the pages, this provides an added benefit of speeding up our later processing steps on the smaller documents.

Specifically, we’re removing HTML tags, URLs, newline indicators, punctuation, stop words, and numbers. Stop words are really common words like the, and, and is that generally provide little useful information for natural language analyses. Additionally, we’re removing the references for the Wikipedia pages. While parts of the references, like book titles, contain useful information about the article’s topic(s), much of the reference information (e.g., book publication city, weblinks, usually author names) isn’t so useful. Instead of taking the time to separate the sporadic useful information from the less useful stuff, it’s just easier to get rid of all of it. We still have plenty of useful information remaining in the article text itself.

Furthermore, we’re converting all the words to lower case and stem them. Both steps provide more uniformity, so that Bridge and bridge are counted as the same word and talked and talking are both counted as talk.

Here is an excerpt to show how we’re processing the text:

a_string = a_string.split('=References=')[0]                # remove references and everything afterwards
a_string = html_2_text(a_string).lower()                    # remove HTML tags, convert to lowercase
a_string = re.sub(r'https?:\/\/.*?[\s]', '', a_string)      # remove URLs

# 'ToktokTokenizer' does divide by '|' and '\n', but retaining this
#   statement seems to improve its speed a little
a_string = a_string.replace('|', ' ').replace('\n', ' ')

tokenizer = nltk_tok_tok()                                  # tokenizes faster than 'word_tokenize'
tokens = tokenizer.tokenize(a_string)

stop_words = nltk_stopwords.words('english')
string_punctuation = list(string.punctuation)
remove_items_list = stop_words + string_punctuation
tokens = [w for w in tokens if w not in remove_items_list]

tokens = [w for w in tokens if '=' not in w]                        # remove remaining tags and the like
tokens = [w for w in tokens if not                                  # remove tokens that are all digits or punctuation
            all(x.isdigit() or x in string_punctuation for x in w)]
tokens = [w.strip(string.punctuation) for w in tokens]              # remove stray punctuation attached to words
tokens = [w for w in tokens if len(w) > 1]                          # remove single characters
tokens = [w for w in tokens if not any(x.isdigit() for x in w)]     # remove everything with a digit in it

stemmer = nltk_lancaster()                                          # fastest stemmer; results seem okay
stemmed = [stemmer.stem(w) for w in tokens]

As noted in the comments, the Lancaster stemmer was chosen because it was the fastest available stemmer (details here), and it provided adequate results. Likewise, the sequence of list comprehensions for the tokens variable was determined to be the fastest of several tested alternatives.

Storing Processed Documents in Database

We’re storing the processed Wikipedia documents in a SQLite database. The primary key of the documents is randomized to provide conveniently randomized access to the articles. This could be handy in case you want only a (randomized) sample of the articles, but there’s also a more important reason that appears when we use Gensim to train a Latent Dirichlet Allocation (LDA) model. Wikipedia is such a large corpus that we don’t have to make multiple passes through the entire corpus (i.e., process each document many times) to train our model. Instead we can make only one pass and update the model on small chunks of documents in the corpus. For this to work well (i.e., so that the model converges), the topics need to be approximately evenly distributed across the chunks (i.e., little or no topic drift). We don’t know beforehand which documents will be assigned to which topics, but we can randomize the order of the documents to minimize topic drift.

More Speed: Parallelization and Cython

The Python package Multiprocessing is used so that the Wikipedia parsing, article processing, and document storing steps are parallelized. The Python code is converted and compiled to run as Cython code for a modest additional speed gain. Static typing for Cython isn’t implemented.

Partial Processing

All of Wikipedia is rather large and can take a while to process. Do we have to wait for all 9 million articles to be processed before we start subsequent analyses? No, we can use a only a part of Wikipedia. But there’s a problem: the Wikipedia XML file is in order, so that the A articles are first, then the B articles, and so on. Wouldn’t processing, say, articles from only the first 10% of the alphabet be a problem? Well, maybe. The topics that we want to identify with LDA might not be evenly distributed across the alphabet.

Our solution is to introduce a sampling interval. Instead of processing every single article in order, we can set our sampling interval so that, say, every 10th article is processed. When the processing run is finished, we have 10% of the articles sampled evenly across the entire alphabetic sequence of articles. For our next processing run, we can set the sampling interval to, say, every 5th article. That will add another 10% of the articles to our database, giving us a total of 20%. This way, we can start working on subsequent analysis steps with valid samples.

Gensim Corpus and Dictionary

Once a processing run is complete, a Gensim corpus and dictionary is created from the existing database.

Full Code

The full code is here.