LDA Parameters Testing
Before training a Latent Dirichlet Allocation (LDA) model on the entire, large Wikipedia corpus, it would be helpful to do some small-scale testing of some of the parameter settings available in Gensim’s LDA implementation.
I did this small-scale testing on a portion of Wikipedia with only 15,151 documents. I downloaded it from here:
ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20170701/enwiki-20170701-pages-articles1.xml-p10p30302.bz2
Coherence
First, we need to determine how to measure how good a model is. A good topic model will produce a set of topics that we humans would perceive as sensible. More concretely, it would make sense to us for the topic “Fantasy” to include Tolkien’s The Lord of the Rings and J. K. Rowling’s Harry Potter series, but not Lee’s To Kill a Mockingbird.
Fortunately, the hard work has already been done for us. Gensim provides several measurements of the coherence of topics, and Roder, Both, and Hinneburg (2015) determined that the Cv measurement provides the highest correlations with humans’ judgments of good (and bad) topics. The Cv measurement also takes the longest to calculate, but we’ll endure that for the sake of high quality results in the end.
LDA Hyperparameters
The most important parameters are the Bayesian prior hyperparameters alpha and beta (called ‘eta’ in Gensim). Both hyperparameters represent our prior beliefs about the corpus before we train our model. A high alpha generally makes the combination of topics associated with each document more uniform across documents; that is, the documents become more similar to each other regarding their topics. A high beta generally makes each topic contain a more uniform mix of words.
In Gensim, both alpha and beta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’. The default for both is a symmetric distribution of 1 divided by the number of topics. For ‘auto’, the model automatically learns the best values for the hyperparameters as it is trained on more and more data. The trade-off is that ‘auto’ takes longer to compute than ‘symmetric’ or ‘asymmetric’.
Beta
Below are some excerpts from a table showing the results of training multiple LDA models with different parameters on our small portion of Wikipedia. Because we’re training on a small corpus, we’re not going to run only one pass through the corpus. We’ll run many — 15, in this case. That’s probably far more than enough.
model_ID | num_topics | alpha | eta | chunksize | passes | coh_topn | lda_runtime | c_v_coherence |
---|---|---|---|---|---|---|---|---|
100 | 20 | asymmetric | auto | 1000 | 15 | 10 | 3.5661740303 | 0.508885978 |
101 | 20 | asymmetric | 1000 | 15 | 10 | 3.6440546513 | 0.5221392219 | |
108 | 20 | symmetric | auto | 1000 | 15 | 10 | 3454.8225626946 | 0.4936805796 |
109 | 20 | symmetric | 1000 | 15 | 10 | 3157.2487235069 | 0.5302252269 | |
110 | 20 | auto | auto | 1000 | 15 | 10 | 6919.0864150524 | 0.574127957 |
111 | 20 | auto | 1000 | 15 | 10 | 5254.5730302334 | 0.5749012486 |
Each pair of models compares running beta (labeled as ‘eta’ in the table) as ‘auto’ versus running it as the default, which is ‘symmetric’. The last column is our Cv coherence measurement. For each pair of models, setting beta to ‘symmetric’ improves the coherence. And as a bonus, it generally takes less time to run, so we’ll keep beta on its default.
Alpha
Now let’s test alpha:
model_ID | num_topics | alpha | eta | chunksize | passes | coh_topn | lda_runtime | c_v_coherence |
---|---|---|---|---|---|---|---|---|
101 | 20 | asymmetric | 1000 | 15 | 10 | 3.6440546513 | 0.5221392219 | |
111 | 20 | auto | 1000 | 15 | 10 | 5254.5730302334 | 0.5749012486 | |
103 | 40 | asymmetric | 1000 | 15 | 10 | 6.2108206749 | 0.5633506395 | |
113 | 40 | auto | 1000 | 15 | 10 | 7317.1304666996 | 0.6084816755 |
Setting alpha to ‘auto’ improves the coherences.
Other Parameters
I looked at parameters other than alpha and beta, too. The number of topics was the most important remaining parameter, but testing it on such a small corpus probably wouldn’t reveal much about what its value should be on the entire Wikipedia corpus. Testing it was more useful for gaining a feel for how the models and their coherences behave as the topic number changes.
Here are all the models, in case you’d like to see other how the models behave when other parameters are changed:
model_ID | num_topics | alpha | eta | chunksize | passes | coh_topn | lda_runtime | c_v_coherence |
---|---|---|---|---|---|---|---|---|
100 | 20 | asymmetric | auto | 1000 | 15 | 10 | 3.5661740303 | 0.508885978 |
101 | 20 | asymmetric | 1000 | 15 | 10 | 3.6440546513 | 0.5221392219 | |
102 | 40 | asymmetric | auto | 1000 | 15 | 10 | 6.146630764 | 0.5279288833 |
103 | 40 | asymmetric | 1000 | 15 | 10 | 6.2108206749 | 0.5633506395 | |
104 | 20 | asymmetric | 1000 | 15 | 20 | 3042.1319074631 | 0.4684317634 | |
105 | 40 | asymmetric | 1000 | 15 | 20 | 4943.2342066765 | 0.5080142432 | |
106 | 40 | asymmetric | 1000 | 25 | 10 | 8143.2007772923 | 0.5669173986 | |
107 | 50 | asymmetric | 1000 | 25 | 10 | 15599.5442836285 | 0.565301955 | |
108 | 20 | symmetric | auto | 1000 | 15 | 10 | 3454.8225626946 | 0.4936805796 |
109 | 20 | symmetric | 1000 | 15 | 10 | 3157.2487235069 | 0.5302252269 | |
110 | 20 | auto | auto | 1000 | 15 | 10 | 6919.0864150524 | 0.574127957 |
111 | 20 | auto | 1000 | 15 | 10 | 5254.5730302334 | 0.5749012486 | |
112 | 35 | auto | 1000 | 15 | 10 | 6847.2765920162 | 0.6040312474 | |
113 | 40 | auto | 1000 | 15 | 10 | 7317.1304666996 | 0.6084816755 | |
114 | 45 | auto | 1000 | 15 | 10 | 7952.9970645905 | 0.6364677107 | |
115 | 50 | auto | 1000 | 15 | 10 | 11759.0735104084 | 0.6235108619 | |
116 | 47 | auto | 1000 | 15 | 10 | 10507.5518200397 | 0.6244982872 | |
117 | 53 | auto | 1000 | 15 | 10 | 11931.2115678787 | 0.6321550319 | |
118 | 45 | auto | 1000 | 5 | 10 | 3232.03733778 | 0.6236311279 | |
119 | 45 | auto | 1000 | 8 | 10 | 4889.1224434376 | 0.633669544 | |
120 | 45 | auto | 1000 | 10 | 10 | 7466.1797115803 | 0.6357059721 | |
121 | 45 | auto | 1000 | 12 | 10 | 10473.717195034 | 0.6330950488 | |
122 | 55 | auto | 1000 | 8 | 10 | 5423.5688388348 | 0.622883028 | |
123 | 58 | auto | 1000 | 8 | 10 | 5597.1084225178 | 0.6259745748 | |
124 | 60 | auto | 1000 | 8 | 10 | 5541.1032752991 | 0.6420843413 | |
125 | 45 | auto | 2000 | 10 | 10 | 4166.3788282871 | 0.6320565375 | |
126 | 60 | auto | 1000 | 4 | 10 | 3266.8249993324 | 0.6374277764 | |
127 | 65 | auto | 1000 | 4 | 10 | 3547.5680966377 | 0.6177338613 | |
128 | 70 | auto | 1000 | 4 | 10 | 3603.2955093384 | 0.5926260675 | |
129 | 70 | auto | 1000 | 8 | 10 | 13077.4700155258 | 0.6041583718 | |
130 | 60 | auto | 1000 | 8 | 20 | 8.5191428661 | 0.5952184031 | |
131 | 60 | auto | 1000 | 8 | 30 | 0.9584441185 | 0.5684720246 | |
132 | 60 | auto | 1000 | 8 | 35 | 0.9461779594 | 0.5589818702 | |
133 | 10 | asymmetric | 1000 | 2 | 10 | 423.7141816616 | 0.4152896112 | |
134 | 10 | asymmetric | 1000 | 2 | 10 | 484.8685014248 | 0.4073819038 | |
135 | 4 | asymmetric | 1000 | 1 | 4 | 199.7571542263 | 0.599919112 |