The function that tokenized, stemmed, and otherwise processed the Wikipedia articles was consuming most of the overall computational time for the program. If it could be sped up, it would make the overall program much faster.
I placed the first, slow version of the function modify_text into a small, independent testing program (optimize_modify_text1.py) that excluded the parsing of the Wikipedia XML file and the storage into the SQLite database. That way, I could test it without concerning myself with those other parts of the overall program. Once the function was sped up, I could copy the faster version back into the overall program.
Below is the line-by-line profile for the first, slow version of modify_text, as run on a very small sample of 11 Wikipedia articles for small-scale testing,
|
Wrote profile results to optimize_modify_text1.py.lprof Timer unit: 1e-06 s Total time: 12.3793 s File: optimize_modify_text1.py Function: modify_text at line 39 Line # Hits Time Per Hit % Time Line Contents ============================================================== 39 @profile 40 def modify_text(a_string): 41 ''' 42 Processes Wikipedia text for analysis: removes HTML tags, removes newline 43 indicators, converts to lowercase, removes references, removes URLs, 44 tokenizes, removes punctuations, removes stop words, removes numbers, 45 stems 46 Wikipedia text is input as a single string; each string is an article 47 Returns a list of processed tokens 48 ''' 49 11 155927 14175.2 1.3 import html2text 50 11 1143751 103977.4 9.2 import nltk 51 11 36 3.3 0.0 import string 52 11 35 3.2 0.0 import re 53 #nltk.download('punkt') 54 #nltk.download('all') 55 11 1364314 124028.5 11.0 a_string = html2text.html2text(a_string).lower() # remove HTML tags, convert to lowercase 56 11 438 39.8 0.0 a_string = a_string.split('=references=')[0] # remove references and everything afterwards 57 11 2110 191.8 0.0 a_string = re.sub(r'https?:\/\/.*?[\s]', '', a_string) # remove URLs 58 11 669 60.8 0.0 a_string = a_string.replace('|', ' ').replace('\n', ' ') # tokenizer doesn't divide by '|' and '\n' 59 11 1650571 150051.9 13.3 tokens = nltk.tokenize.word_tokenize(a_string) 60 #tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(a_string) 61 11 6442 585.6 0.1 stop_words = nltk.corpus.stopwords.words('english') 62 11 91 8.3 0.0 string_punctuation = set(string.punctuation) 63 11 109 9.9 0.0 punctuation = [p for p in string_punctuation] 64 #miscellaneous = ['url'] 65 11 44 4.0 0.0 remove_items_list = stop_words + punctuation #+ miscellaneous 66 11 419303 38118.5 3.4 tokens = [w for w in tokens if w not in remove_items_list] 67 11 15809 1437.2 0.1 tokens = [w for w in tokens if '=' not in w] # remove remaining tags and the like 68 11 147484 13407.6 1.2 tokens = [w for w in tokens if not # remove tokens that are all digits or punctuation 69 all(x.isdigit() or x in string_punctuation for x in w)] 70 11 24743 2249.4 0.2 tokens = [w.strip(string.punctuation) for w in tokens] # remove stray punctuation attached to words 71 11 15698 1427.1 0.1 tokens = [w for w in tokens if len(w) > 1] # remove single characters 72 11 202115 18374.1 1.6 tokens = [w for w in tokens if not any(x.isdigit() for x in w)] # remove everything with a digit in it 73 11 586 53.3 0.0 stemmer = nltk.stem.PorterStemmer() 74 11 7228979 657179.9 58.4 stemmed = [stemmer.stem(w) for w in tokens] 75 11 45 4.1 0.0 return(stemmed) |
The overall processing time for the testing program was over 12 seconds. Importing NLTK was taking a lot of time (line 50). So was removing the HTML tags (line 55), tokenizing the string (line 59), and stemming the string (line 74).
The stemmer was the biggest time hog, so it was a good target for improvement. Instead of the Porter stemmer, I tried the Snowball stemmer. Here are some relevant excerpts from the resulting line profile (notice that unused lines are commented out):
|
Wrote profile results to optimize_modify_text3.py.lprof Timer unit: 1e-06 s Total time: 22.3184 s File: optimize_modify_text3.py Function: modify_text at line 39 Line # Hits Time Per Hit % Time Line Contents ============================================================== 77 #stemmer = nltk.stem.PorterStemmer() 78 11 246 22.4 0.0 stemmer = nltk.stem.SnowballStemmer('english') 79 #stemmer = nltk.stem.lancaster.LancasterStemmer() 80 11 17174672 1561333.8 77.0 stemmed = [stemmer.stem(w) for w in tokens] 81 11 44 4.0 0.0 return(stemmed) |
Ouch! We went from 12 seconds to 22 seconds! That’s definitely not the improvement we were looking for. Let’s try another stemmer:
|
Wrote profile results to optimize_modify_text4.py.lprof Timer unit: 1e-06 s Total time: 9.28702 s File: optimize_modify_text4.py Function: modify_text at line 39 Line # Hits Time Per Hit % Time Line Contents ============================================================== 77 #stemmer = nltk.stem.PorterStemmer() 78 #stemmer = nltk.stem.SnowballStemmer('english') 79 11 103 9.4 0.0 stemmer = nltk.stem.lancaster.LancasterStemmer() 80 11 4174730 379520.9 45.0 stemmed = [stemmer.stem(w) for w in tokens] 81 11 39 3.5 0.0 return(stemmed) |
The Lancaster stemmer is much better: our overall time is now 9 seconds, which is an improvement over the Porter stemmer.
The tokenizing step was also taking a lot of time. Maybe we can improve it, too. Let’s try a different tokenizer:
|
Wrote profile results to optimize_modify_text5.py.lprof Timer unit: 1e-06 s Total time: 8.3118 s File: optimize_modify_text5.py Function: modify_text at line 39 Line # Hits Time Per Hit % Time Line Contents ============================================================== 60 #tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(a_string) 61 #tokens = nltk.tokenize.word_tokenize(a_string) 62 11 56 5.1 0.0 tokenizer = nltk.tokenize.ToktokTokenizer() 63 11 449286 40844.2 5.4 tokens = tokenizer.tokenize(a_string) |
We’ve now replaced the ‘word tokenizer’ with the ‘Tok tokenizer’. And that improved our overall time from about 9 seconds to about 8 seconds.
Finally, we can improve on those importing times. Not only is importing NLTK taking a long time, but we’re calling this function repeatedly, so we’re importing NLTK over and over again. For speed, it would be much smarter to import it only once. We can import it globally one time and then call it locally from within our function. We’ll do the same for the html2text import. Here are the global imports:
|
# moved these from function 'modify_text' to global variables for speed from html2text import html2text from nltk.tokenize import ToktokTokenizer from nltk.corpus import stopwords from nltk.stem.lancaster import LancasterStemmer |
And here are the resulting excerpts from the line profile:
|
Wrote profile results to optimize_modify_text6.py.lprof Timer unit: 1e-06 s Total time: 6.79855 s File: optimize_modify_text6.py Function: modify_text at line 46 Line # Hits Time Per Hit % Time Line Contents ============================================================== 57 #import html2text 58 #import nltk 59 # moved imports from 'html2text' and 'nltk' to global variables for speed 60 11 36 3.3 0.0 html_2_text = html2text 61 11 30 2.7 0.0 nltk_tok_tok = ToktokTokenizer 62 11 28 2.5 0.0 nltk_stopwords = stopwords 63 11 27 2.5 0.0 nltk_lancaster = LancasterStemmer 64 11 46 4.2 0.0 import string 65 11 34 3.1 0.0 import re |
That improved the overall speed to a bit under 7 seconds. That’s much better than the 12 seconds we started out with. Now we can enjoy much faster processing of the Wikipedia articles.
I also made some smaller changes to the function. Here’s the final full line profile, so you can see all the changes compared to the first one:
|
Wrote profile results to optimize_modify_text6.py.lprof Timer unit: 1e-06 s Total time: 6.79855 s File: optimize_modify_text6.py Function: modify_text at line 46 Line # Hits Time Per Hit % Time Line Contents ============================================================== 46 @profile 47 def modify_text(a_string): 48 ''' 49 Processes Wikipedia text for analysis: removes HTML tags, removes newline 50 indicators, converts to lowercase, removes references, removes URLs, 51 tokenizes, removes punctuations, removes stop words, removes numbers, 52 stems 53 Wikipedia text is input as a single string; each string is an article 54 Returns a list of processed tokens 55 ''' 56 57 #import html2text 58 #import nltk 59 # moved imports from 'html2text' and 'nltk' to global variables for speed 60 11 36 3.3 0.0 html_2_text = html2text 61 11 30 2.7 0.0 nltk_tok_tok = ToktokTokenizer 62 11 28 2.5 0.0 nltk_stopwords = stopwords 63 11 27 2.5 0.0 nltk_lancaster = LancasterStemmer 64 11 46 4.2 0.0 import string 65 11 34 3.1 0.0 import re 66 67 #nltk.download('punkt') 68 #nltk.download('all') 69 70 11 509 46.3 0.0 a_string = a_string.split('=References=')[0] # remove references and everything afterwards 71 #a_string = html2text.html2text(a_string).lower() # remove HTML tags, convert to lowercase 72 11 1279744 116340.4 18.8 a_string = html_2_text(a_string).lower() # remove HTML tags, convert to lowercase 73 11 2171 197.4 0.0 a_string = re.sub(r'https?:\/\/.*?[\s]', '', a_string) # remove URLs 74 11 693 63.0 0.0 a_string = a_string.replace('|', ' ').replace('\n', ' ') # 'word_tokenize' doesn't divide by '|' and '\n' 75 76 #tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(a_string) 77 #tokens = nltk.tokenize.word_tokenize(a_string) 78 #tokenizer = nltk.tokenize.ToktokTokenizer() 79 11 39 3.5 0.0 tokenizer = nltk_tok_tok() # tokenizes faster than 'word_tokenize' 80 11 430930 39175.5 6.3 tokens = tokenizer.tokenize(a_string) 81 82 #stop_words = nltk.corpus.stopwords.words('english') 83 11 7940 721.8 0.1 stop_words = nltk_stopwords.words('english') 84 11 67 6.1 0.0 string_punctuation = list(string.punctuation) 85 11 44 4.0 0.0 remove_items_list = stop_words + string_punctuation 86 11 445611 40510.1 6.6 tokens = [w for w in tokens if w not in remove_items_list] 87 88 11 14418 1310.7 0.2 tokens = [w for w in tokens if '=' not in w] # remove remaining tags and the like 89 11 192927 17538.8 2.8 tokens = [w for w in tokens if not # remove tokens that are all digits or punctuation 90 all(x.isdigit() or x in string_punctuation for x in w)] 91 11 24811 2255.5 0.4 tokens = [w.strip(string.punctuation) for w in tokens] # remove stray punctuation attached to words 92 11 15883 1443.9 0.2 tokens = [w for w in tokens if len(w) > 1] # remove single characters 93 11 201439 18312.6 3.0 tokens = [w for w in tokens if not any(x.isdigit() for x in w)] # remove everything with a digit in it 94 95 #stemmer = nltk.stem.PorterStemmer() 96 #stemmer = nltk.stem.SnowballStemmer('english') 97 #stemmer = nltk.stem.lancaster.LancasterStemmer() # fastest stemmer; results seem okay 98 11 99 9.0 0.0 stemmer = nltk_lancaster() # fastest stemmer; results seem okay 99 11 4180987 380089.7 61.5 stemmed = [stemmer.stem(w) for w in tokens] 100 101 11 38 3.5 0.0 return(stemmed) |