Training the LDA Model
With more than 9 million documents, the entire Wikipedia corpus takes a while to process and compile (or a larger budget!). The currently posted results for topic modeling are based on about 5.5 million documents, or a bit over 60% of Wikipedia (to be exact: 5578331 / 9019957 = 0.618).
On this corpus, I’ve trained five models. Here are the results for those models:
model_ID | num_topics | alpha | eta | chunksize | passes | coh_topn | lda_runtime | c_v_coherence |
---|---|---|---|---|---|---|---|---|
100 | 10 | asymmetric | 100000 | 1 | 10 | 12499.4541299343 | 0.5247517679 | |
101 | 10 | auto | 100000 | 1 | 10 | 15684.6514217854 | 0.5933546515 | |
102 | 20 | auto | 100000 | 1 | 10 | 8.4785871506 | 0.6304638499 | |
103 | 25 | auto | 100000 | 1 | 10 | 19074.6811187267 | 0.6301311693 | |
104 | 30 | auto | 100000 | 1 | 10 | 19897.7017846107 | 0.6059053268 |
The first model, Model 100, was run as a quick look to see some initial results. Its ‘alpha’ is set to ‘asymmetric’, which we’ve already shown runs faster but gets lower coherences compared to ‘auto’. Now we see the same pattern on this much larger corpus when we compare Model 100 to Model 101 above.
Out of all the models, Model 102 with 20 topics gets the highest Cv coherence, though Model 103 with 25 topics gets practically the same result. Even so, we’ll focus on Model 102.
Evaluating the LDA Model
How well did our model train? For an answer, we’ll evaluate two metrics: the proportion of documents that converged and the topic differences.
Document Convergence
Here is a plot of the proportion of documents that converged in each successive training iteration (i.e., for each chunk):
Ideally, every document would converge so that the curve would rise to 1 by the end of training. For our model it rises to about 0.90 at the last iteration. We would like it to be higher, but that’s fairly close to 1. The curve is still gradually rising at the end of training; if we ran another pass or two through the corpus, it might rise more, but we would have to tolerate a much longer running time.
Topic Differences
Here is a plot of the topic differences between successive training iterations (i.e., between chunks):
Ideally, we want to see this curve fall to zero. That would mean that there would be no differences between topics from iteration to iteration; the topics would have converged. In our case, the curve falls to about 0.07 (before mysteriously jumping up to 0.10 at the last iteration; perhaps there were few documents in the last chunk). As for the document convergence, we’d like to see it do somewhat better, but that’s good enough to use model.
Code
The code for training the LDA models is here.
The code for extracting the document convergences and topic differences from the training logs is here.
The code for plotting the document convergences and topic differences is here.