Speeding Up the Code

The function that tokenized, stemmed, and otherwise processed the Wikipedia articles was consuming most of the overall computational time for the program. If it could be sped up, it would make the overall program much faster.

I placed the first, slow version of the function modify_text into a small, independent testing program (optimize_modify_text1.py) that excluded the parsing of the Wikipedia XML file and the storage into the SQLite database. That way, I could test it without concerning myself with those other parts of the overall program. Once the function was sped up, I could copy the faster version back into the overall program.

Below is the line-by-line profile for the first, slow version of modify_text, as run on a very small sample of 11 Wikipedia articles for small-scale testing,

 
Wrote profile results to optimize_modify_text1.py.lprof
Timer unit: 1e-06 s

Total time: 12.3793 s
File: optimize_modify_text1.py
Function: modify_text at line 39

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    39                                           @profile
    40                                           def modify_text(a_string):
    41                                               '''
    42                                               Processes Wikipedia text for analysis:  removes HTML tags, removes  newline
    43                                                   indicators, converts to lowercase, removes references, removes URLs,
    44                                                   tokenizes, removes punctuations, removes stop words, removes numbers,
    45                                                   stems
    46                                               Wikipedia text is input as a single string; each string is an article
    47                                               Returns a list of processed tokens
    48                                               '''
    49        11       155927  14175.2      1.3      import html2text
    50        11      1143751 103977.4      9.2      import nltk
    51        11           36      3.3      0.0      import string
    52        11           35      3.2      0.0      import re
    53                                               #nltk.download('punkt')
    54                                               #nltk.download('all')
    55        11      1364314 124028.5     11.0      a_string = html2text.html2text(a_string).lower()            # remove HTML tags, convert to lowercase
    56        11          438     39.8      0.0      a_string = a_string.split('=references=')[0]                # remove references and everything afterwards
    57        11         2110    191.8      0.0      a_string = re.sub(r'https?:\/\/.*?[\s]', '', a_string)      # remove URLs
    58        11          669     60.8      0.0      a_string = a_string.replace('|', ' ').replace('\n', ' ')    # tokenizer doesn't divide by '|' and '\n'
    59        11      1650571 150051.9     13.3      tokens = nltk.tokenize.word_tokenize(a_string)
    60                                               #tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(a_string)
    61        11         6442    585.6      0.1      stop_words = nltk.corpus.stopwords.words('english')
    62        11           91      8.3      0.0      string_punctuation = set(string.punctuation)
    63        11          109      9.9      0.0      punctuation = [p for p in string_punctuation]
    64                                               #miscellaneous = ['url']
    65        11           44      4.0      0.0      remove_items_list = stop_words + punctuation #+ miscellaneous
    66        11       419303  38118.5      3.4      tokens = [w for w in tokens if w not in remove_items_list]
    67        11        15809   1437.2      0.1      tokens = [w for w in tokens if '=' not in w]                        # remove remaining tags and the like
    68        11       147484  13407.6      1.2      tokens = [w for w in tokens if not                                  # remove tokens that are all digits or punctuation
    69                                                         all(x.isdigit() or x in string_punctuation for x in w)]
    70        11        24743   2249.4      0.2      tokens = [w.strip(string.punctuation) for w in tokens]              # remove stray punctuation attached to words
    71        11        15698   1427.1      0.1      tokens = [w for w in tokens if len(w) > 1]                          # remove single characters
    72        11       202115  18374.1      1.6      tokens = [w for w in tokens if not any(x.isdigit() for x in w)]     # remove everything with a digit in it
    73        11          586     53.3      0.0      stemmer = nltk.stem.PorterStemmer()
    74        11      7228979 657179.9     58.4      stemmed = [stemmer.stem(w) for w in tokens]
    75        11           45      4.1      0.0      return(stemmed)

The overall processing time for the testing program was over 12 seconds. Importing NLTK was taking a lot of time (line 50). So was removing the HTML tags (line 55), tokenizing the string (line 59), and stemming the string (line 74).

The stemmer was the biggest time hog, so it was a good target for improvement. Instead of the Porter stemmer, I tried the Snowball stemmer. Here are some relevant excerpts from the resulting line profile (notice that unused lines are commented out):


Wrote profile results to optimize_modify_text3.py.lprof
Timer unit: 1e-06 s

Total time: 22.3184 s
File: optimize_modify_text3.py
Function: modify_text at line 39

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

    77                                               #stemmer = nltk.stem.PorterStemmer()
    78        11          246     22.4      0.0      stemmer = nltk.stem.SnowballStemmer('english')
    79                                               #stemmer = nltk.stem.lancaster.LancasterStemmer()
    80        11     17174672 1561333.8     77.0      stemmed = [stemmer.stem(w) for w in tokens]
    81        11           44      4.0      0.0      return(stemmed)

Ouch! We went from 12 seconds to 22 seconds! That’s definitely not the improvement we were looking for. Let’s try another stemmer:


Wrote profile results to optimize_modify_text4.py.lprof
Timer unit: 1e-06 s

Total time: 9.28702 s
File: optimize_modify_text4.py
Function: modify_text at line 39

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

    77                                               #stemmer = nltk.stem.PorterStemmer()
    78                                               #stemmer = nltk.stem.SnowballStemmer('english')
    79        11          103      9.4      0.0      stemmer = nltk.stem.lancaster.LancasterStemmer()
    80        11      4174730 379520.9     45.0      stemmed = [stemmer.stem(w) for w in tokens]
    81        11           39      3.5      0.0      return(stemmed)

The Lancaster stemmer is much better: our overall time is now 9 seconds, which is an improvement over the Porter stemmer.

The tokenizing step was also taking a lot of time. Maybe we can improve it, too. Let’s try a different tokenizer:


Wrote profile results to optimize_modify_text5.py.lprof
Timer unit: 1e-06 s

Total time: 8.3118 s
File: optimize_modify_text5.py
Function: modify_text at line 39

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

    60                                               #tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(a_string)
    61                                               #tokens = nltk.tokenize.word_tokenize(a_string)
    62        11           56      5.1      0.0      tokenizer = nltk.tokenize.ToktokTokenizer()
    63        11       449286  40844.2      5.4      tokens = tokenizer.tokenize(a_string)

We’ve now replaced the ‘word tokenizer’ with the ‘Tok tokenizer’. And that improved our overall time from about 9 seconds to about 8 seconds.

Finally, we can improve on those importing times. Not only is importing NLTK taking a long time, but we’re calling this function repeatedly, so we’re importing NLTK over and over again. For speed, it would be much smarter to import it only once. We can import it globally one time and then call it locally from within our function. We’ll do the same for the html2text import. Here are the global imports:


# moved these from function 'modify_text' to global variables for speed
from html2text import html2text
from nltk.tokenize import ToktokTokenizer
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer

And here are the resulting excerpts from the line profile:


Wrote profile results to optimize_modify_text6.py.lprof
Timer unit: 1e-06 s

Total time: 6.79855 s
File: optimize_modify_text6.py
Function: modify_text at line 46

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

    57                                               #import html2text
    58                                               #import nltk
    59                                               # moved imports from 'html2text' and 'nltk' to global variables for speed
    60        11           36      3.3      0.0      html_2_text = html2text
    61        11           30      2.7      0.0      nltk_tok_tok = ToktokTokenizer
    62        11           28      2.5      0.0      nltk_stopwords = stopwords
    63        11           27      2.5      0.0      nltk_lancaster = LancasterStemmer
    64        11           46      4.2      0.0      import string
    65        11           34      3.1      0.0      import re

That improved the overall speed to a bit under 7 seconds. That’s much better than the 12 seconds we started out with. Now we can enjoy much faster processing of the Wikipedia articles.

I also made some smaller changes to the function. Here’s the final full line profile, so you can see all the changes compared to the first one:

 
Wrote profile results to optimize_modify_text6.py.lprof
Timer unit: 1e-06 s

Total time: 6.79855 s
File: optimize_modify_text6.py
Function: modify_text at line 46

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    46                                           @profile
    47                                           def modify_text(a_string):
    48                                               '''
    49                                               Processes Wikipedia text for analysis:  removes HTML tags, removes newline
    50                                                   indicators, converts to lowercase, removes references, removes URLs,
    51                                                   tokenizes, removes punctuations, removes stop words, removes numbers,
    52                                                   stems
    53                                               Wikipedia text is input as a single string; each string is an article
    54                                               Returns a list of processed tokens
    55                                               '''
    56                                           
    57                                               #import html2text
    58                                               #import nltk
    59                                               # moved imports from 'html2text' and 'nltk' to global variables for speed
    60        11           36      3.3      0.0      html_2_text = html2text
    61        11           30      2.7      0.0      nltk_tok_tok = ToktokTokenizer
    62        11           28      2.5      0.0      nltk_stopwords = stopwords
    63        11           27      2.5      0.0      nltk_lancaster = LancasterStemmer
    64        11           46      4.2      0.0      import string
    65        11           34      3.1      0.0      import re
    66                                           
    67                                               #nltk.download('punkt')
    68                                               #nltk.download('all')
    69                                           
    70        11          509     46.3      0.0      a_string = a_string.split('=References=')[0]                # remove references and everything afterwards
    71                                               #a_string = html2text.html2text(a_string).lower()            # remove HTML tags, convert to lowercase
    72        11      1279744 116340.4     18.8      a_string = html_2_text(a_string).lower()                    # remove HTML tags, convert to lowercase
    73        11         2171    197.4      0.0      a_string = re.sub(r'https?:\/\/.*?[\s]', '', a_string)      # remove URLs
    74        11          693     63.0      0.0      a_string = a_string.replace('|', ' ').replace('\n', ' ')    # 'word_tokenize' doesn't divide by '|' and '\n'
    75                                           
    76                                               #tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(a_string)
    77                                               #tokens = nltk.tokenize.word_tokenize(a_string)
    78                                               #tokenizer = nltk.tokenize.ToktokTokenizer()
    79        11           39      3.5      0.0      tokenizer = nltk_tok_tok()                                  # tokenizes faster than 'word_tokenize'
    80        11       430930  39175.5      6.3      tokens = tokenizer.tokenize(a_string)
    81                                           
    82                                               #stop_words = nltk.corpus.stopwords.words('english')
    83        11         7940    721.8      0.1      stop_words = nltk_stopwords.words('english')
    84        11           67      6.1      0.0      string_punctuation = list(string.punctuation)
    85        11           44      4.0      0.0      remove_items_list = stop_words + string_punctuation
    86        11       445611  40510.1      6.6      tokens = [w for w in tokens if w not in remove_items_list]
    87                                           
    88        11        14418   1310.7      0.2      tokens = [w for w in tokens if '=' not in w]                        # remove remaining tags and the like
    89        11       192927  17538.8      2.8      tokens = [w for w in tokens if not                                  # remove tokens that are all digits or punctuation
    90                                                         all(x.isdigit() or x in string_punctuation for x in w)]
    91        11        24811   2255.5      0.4      tokens = [w.strip(string.punctuation) for w in tokens]              # remove stray punctuation attached to words
    92        11        15883   1443.9      0.2      tokens = [w for w in tokens if len(w) > 1]                          # remove single characters
    93        11       201439  18312.6      3.0      tokens = [w for w in tokens if not any(x.isdigit() for x in w)]     # remove everything with a digit in it
    94                                           
    95                                               #stemmer = nltk.stem.PorterStemmer()
    96                                               #stemmer = nltk.stem.SnowballStemmer('english')
    97                                               #stemmer = nltk.stem.lancaster.LancasterStemmer()            # fastest stemmer; results seem okay
    98        11           99      9.0      0.0      stemmer = nltk_lancaster()                                  # fastest stemmer; results seem okay
    99        11      4180987 380089.7     61.5      stemmed = [stemmer.stem(w) for w in tokens]
   100                                           
   101        11           38      3.5      0.0      return(stemmed)