Parameter Testing – Andrew Fairless, Ph.D.

LDA Parameters Testing

Before training a Latent Dirichlet Allocation (LDA) model on the entire, large Wikipedia corpus, it would be helpful to do some small-scale testing of some of the parameter settings available in Gensim’s LDA implementation.

I did this small-scale testing on a portion of Wikipedia with only 15,151 documents. I downloaded it from here:

ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20170701/enwiki-20170701-pages-articles1.xml-p10p30302.bz2

Coherence

First, we need to determine how to measure how good a model is. A good topic model will produce a set of topics that we humans would perceive as sensible. More concretely, it would make sense to us for the topic “Fantasy” to include Tolkien’s The Lord of the Rings and J. K. Rowling’s Harry Potter series, but not Lee’s To Kill a Mockingbird.

Fortunately, the hard work has already been done for us. Gensim provides several measurements of the coherence of topics, and Roder, Both, and Hinneburg (2015) determined that the Cv measurement provides the highest correlations with humans’ judgments of good (and bad) topics. The Cv measurement also takes the longest to calculate, but we’ll endure that for the sake of high quality results in the end.

LDA Hyperparameters

The most important parameters are the Bayesian prior hyperparameters alpha and beta (called ‘eta’ in Gensim). Both hyperparameters represent our prior beliefs about the corpus before we train our model. A high alpha generally makes the combination of topics associated with each document more uniform across documents; that is, the documents become more similar to each other regarding their topics. A high beta generally makes each topic contain a more uniform mix of words.

In Gensim, both alpha and beta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’. The default for both is a symmetric distribution of 1 divided by the number of topics. For ‘auto’, the model automatically learns the best values for the hyperparameters as it is trained on more and more data. The trade-off is that ‘auto’ takes longer to compute than ‘symmetric’ or ‘asymmetric’.

Beta

Below are some excerpts from a table showing the results of training multiple LDA models with different parameters on our small portion of Wikipedia. Because we’re training on a small corpus, we’re not going to run only one pass through the corpus. We’ll run many — 15, in this case. That’s probably far more than enough.

model_ID	num_topics	alpha	eta	chunksize	passes	coh_topn	lda_runtime	c_v_coherence
100	20	asymmetric	auto	1000	15	10	3.5661740303	0.508885978
101	20	asymmetric		1000	15	10	3.6440546513	0.5221392219

108	20	symmetric	auto	1000	15	10	3454.8225626946	0.4936805796
109	20	symmetric		1000	15	10	3157.2487235069	0.5302252269

110	20	auto	auto	1000	15	10	6919.0864150524	0.574127957
111	20	auto		1000	15	10	5254.5730302334	0.5749012486

Each pair of models compares running beta (labeled as ‘eta’ in the table) as ‘auto’ versus running it as the default, which is ‘symmetric’. The last column is our Cv coherence measurement. For each pair of models, setting beta to ‘symmetric’ improves the coherence. And as a bonus, it generally takes less time to run, so we’ll keep beta on its default.

Alpha

Now let’s test alpha:

model_ID	num_topics	alpha	chunksize	passes	coh_topn	lda_runtime	c_v_coherence
101	20	asymmetric	1000	15	10	3.6440546513	0.5221392219
111	20	auto	1000	15	10	5254.5730302334	0.5749012486

103	40	asymmetric	1000	15	10	6.2108206749	0.5633506395
113	40	auto	1000	15	10	7317.1304666996	0.6084816755

Setting alpha to ‘auto’ improves the coherences.

Other Parameters

I looked at parameters other than alpha and beta, too. The number of topics was the most important remaining parameter, but testing it on such a small corpus probably wouldn’t reveal much about what its value should be on the entire Wikipedia corpus. Testing it was more useful for gaining a feel for how the models and their coherences behave as the topic number changes.

Here are all the models, in case you’d like to see other how the models behave when other parameters are changed:

model_ID	num_topics	alpha	eta	chunksize	passes	coh_topn	lda_runtime	c_v_coherence
100	20	asymmetric	auto	1000	15	10	3.5661740303	0.508885978
101	20	asymmetric		1000	15	10	3.6440546513	0.5221392219
102	40	asymmetric	auto	1000	15	10	6.146630764	0.5279288833
103	40	asymmetric		1000	15	10	6.2108206749	0.5633506395
104	20	asymmetric		1000	15	20	3042.1319074631	0.4684317634
105	40	asymmetric		1000	15	20	4943.2342066765	0.5080142432
106	40	asymmetric		1000	25	10	8143.2007772923	0.5669173986
107	50	asymmetric		1000	25	10	15599.5442836285	0.565301955
108	20	symmetric	auto	1000	15	10	3454.8225626946	0.4936805796
109	20	symmetric		1000	15	10	3157.2487235069	0.5302252269
110	20	auto	auto	1000	15	10	6919.0864150524	0.574127957
111	20	auto		1000	15	10	5254.5730302334	0.5749012486
112	35	auto		1000	15	10	6847.2765920162	0.6040312474
113	40	auto		1000	15	10	7317.1304666996	0.6084816755
114	45	auto		1000	15	10	7952.9970645905	0.6364677107
115	50	auto		1000	15	10	11759.0735104084	0.6235108619
116	47	auto		1000	15	10	10507.5518200397	0.6244982872
117	53	auto		1000	15	10	11931.2115678787	0.6321550319
118	45	auto		1000	5	10	3232.03733778	0.6236311279
119	45	auto		1000	8	10	4889.1224434376	0.633669544
120	45	auto		1000	10	10	7466.1797115803	0.6357059721
121	45	auto		1000	12	10	10473.717195034	0.6330950488
122	55	auto		1000	8	10	5423.5688388348	0.622883028
123	58	auto		1000	8	10	5597.1084225178	0.6259745748
124	60	auto		1000	8	10	5541.1032752991	0.6420843413
125	45	auto		2000	10	10	4166.3788282871	0.6320565375
126	60	auto		1000	4	10	3266.8249993324	0.6374277764
127	65	auto		1000	4	10	3547.5680966377	0.6177338613
128	70	auto		1000	4	10	3603.2955093384	0.5926260675
129	70	auto		1000	8	10	13077.4700155258	0.6041583718
130	60	auto		1000	8	20	8.5191428661	0.5952184031
131	60	auto		1000	8	30	0.9584441185	0.5684720246
132	60	auto		1000	8	35	0.9461779594	0.5589818702
133	10	asymmetric		1000	2	10	423.7141816616	0.4152896112
134	10	asymmetric		1000	2	10	484.8685014248	0.4073819038
135	4	asymmetric		1000	1	4	199.7571542263	0.599919112