Parameter Testing

LDA Parameters Testing

Before training a Latent Dirichlet Allocation (LDA) model on the entire, large Wikipedia corpus, it would be helpful to do some small-scale testing of some of the parameter settings available in Gensim’s LDA implementation.

I did this small-scale testing on a portion of Wikipedia with only 15,151 documents. I downloaded it from here:

ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20170701/enwiki-20170701-pages-articles1.xml-p10p30302.bz2

Coherence

First, we need to determine how to measure how good a model is. A good topic model will produce a set of topics that we humans would perceive as sensible. More concretely, it would make sense to us for the topic “Fantasy” to include Tolkien’s The Lord of the Rings and J. K. Rowling’s Harry Potter series, but not Lee’s To Kill a Mockingbird.

Fortunately, the hard work has already been done for us. Gensim provides several measurements of the coherence of topics, and Roder, Both, and Hinneburg (2015) determined that the Cv measurement provides the highest correlations with humans’ judgments of good (and bad) topics. The Cv measurement also takes the longest to calculate, but we’ll endure that for the sake of high quality results in the end.

LDA Hyperparameters

The most important parameters are the Bayesian prior hyperparameters alpha and beta (called ‘eta’ in Gensim). Both hyperparameters represent our prior beliefs about the corpus before we train our model. A high alpha generally makes the combination of topics associated with each document more uniform across documents; that is, the documents become more similar to each other regarding their topics. A high beta generally makes each topic contain a more uniform mix of words.

In Gensim, both alpha and beta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’. The default for both is a symmetric distribution of 1 divided by the number of topics. For ‘auto’, the model automatically learns the best values for the hyperparameters as it is trained on more and more data. The trade-off is that ‘auto’ takes longer to compute than ‘symmetric’ or ‘asymmetric’.

Beta

Below are some excerpts from a table showing the results of training multiple LDA models with different parameters on our small portion of Wikipedia. Because we’re training on a small corpus, we’re not going to run only one pass through the corpus. We’ll run many — 15, in this case. That’s probably far more than enough.

model_ID num_topics alpha eta chunksize passes coh_topn lda_runtime c_v_coherence
100 20 asymmetric auto 1000 15 10 3.5661740303 0.508885978
101 20 asymmetric 1000 15 10 3.6440546513 0.5221392219
108 20 symmetric auto 1000 15 10 3454.8225626946 0.4936805796
109 20 symmetric 1000 15 10 3157.2487235069 0.5302252269
110 20 auto auto 1000 15 10 6919.0864150524 0.574127957
111 20 auto 1000 15 10 5254.5730302334 0.5749012486

Each pair of models compares running beta (labeled as ‘eta’ in the table) as ‘auto’ versus running it as the default, which is ‘symmetric’. The last column is our Cv coherence measurement. For each pair of models, setting beta to ‘symmetric’ improves the coherence. And as a bonus, it generally takes less time to run, so we’ll keep beta on its default.

Alpha

Now let’s test alpha:

model_ID num_topics alpha eta chunksize passes coh_topn lda_runtime c_v_coherence
101 20 asymmetric 1000 15 10 3.6440546513 0.5221392219
111 20 auto 1000 15 10 5254.5730302334 0.5749012486
103 40 asymmetric 1000 15 10 6.2108206749 0.5633506395
113 40 auto 1000 15 10 7317.1304666996 0.6084816755

Setting alpha to ‘auto’ improves the coherences.

Other Parameters

I looked at parameters other than alpha and beta, too. The number of topics was the most important remaining parameter, but testing it on such a small corpus probably wouldn’t reveal much about what its value should be on the entire Wikipedia corpus. Testing it was more useful for gaining a feel for how the models and their coherences behave as the topic number changes.

Here are all the models, in case you’d like to see other how the models behave when other parameters are changed:

model_ID num_topics alpha eta chunksize passes coh_topn lda_runtime c_v_coherence
100 20 asymmetric auto 1000 15 10 3.5661740303 0.508885978
101 20 asymmetric 1000 15 10 3.6440546513 0.5221392219
102 40 asymmetric auto 1000 15 10 6.146630764 0.5279288833
103 40 asymmetric 1000 15 10 6.2108206749 0.5633506395
104 20 asymmetric 1000 15 20 3042.1319074631 0.4684317634
105 40 asymmetric 1000 15 20 4943.2342066765 0.5080142432
106 40 asymmetric 1000 25 10 8143.2007772923 0.5669173986
107 50 asymmetric 1000 25 10 15599.5442836285 0.565301955
108 20 symmetric auto 1000 15 10 3454.8225626946 0.4936805796
109 20 symmetric 1000 15 10 3157.2487235069 0.5302252269
110 20 auto auto 1000 15 10 6919.0864150524 0.574127957
111 20 auto 1000 15 10 5254.5730302334 0.5749012486
112 35 auto 1000 15 10 6847.2765920162 0.6040312474
113 40 auto 1000 15 10 7317.1304666996 0.6084816755
114 45 auto 1000 15 10 7952.9970645905 0.6364677107
115 50 auto 1000 15 10 11759.0735104084 0.6235108619
116 47 auto 1000 15 10 10507.5518200397 0.6244982872
117 53 auto 1000 15 10 11931.2115678787 0.6321550319
118 45 auto 1000 5 10 3232.03733778 0.6236311279
119 45 auto 1000 8 10 4889.1224434376 0.633669544
120 45 auto 1000 10 10 7466.1797115803 0.6357059721
121 45 auto 1000 12 10 10473.717195034 0.6330950488
122 55 auto 1000 8 10 5423.5688388348 0.622883028
123 58 auto 1000 8 10 5597.1084225178 0.6259745748
124 60 auto 1000 8 10 5541.1032752991 0.6420843413
125 45 auto 2000 10 10 4166.3788282871 0.6320565375
126 60 auto 1000 4 10 3266.8249993324 0.6374277764
127 65 auto 1000 4 10 3547.5680966377 0.6177338613
128 70 auto 1000 4 10 3603.2955093384 0.5926260675
129 70 auto 1000 8 10 13077.4700155258 0.6041583718
130 60 auto 1000 8 20 8.5191428661 0.5952184031
131 60 auto 1000 8 30 0.9584441185 0.5684720246
132 60 auto 1000 8 35 0.9461779594 0.5589818702
133 10 asymmetric 1000 2 10 423.7141816616 0.4152896112
134 10 asymmetric 1000 2 10 484.8685014248 0.4073819038
135 4 asymmetric 1000 1 4 199.7571542263 0.599919112