Creating Features, Part 2 – Andrew Fairless, Ph.D.

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we started adding features that we can use to predict whether a review is about an establishment of conventional or alternative medicine. The first set of features that we added were statistics about the reviews, including the number of words, the polarity/sentiment, and the formality. Now we’re going to add another set of features: the usage/occurrence rates of individual words for each review.

For example, if a review said simply, “I like this doctor,” then the usage rate for ‘I’ is 25%, for ‘like’ is 25%, and likewise for ‘this’ and ‘doctor’.

We saved our data thus far to an external file that we can read back into R:

revsall <- readRDS("revsall.rds")

We’ll need the the R package ‘tm’:

library(tm)

First, let’s look at what words are commonly used in our data set of reviews. Here’s a function that will count all the words used in the reviews:

corpfreqprep <- function(vectoroftexts, removepunct = TRUE) {
     corpus <- Corpus(VectorSource(vectoroftexts))
     corpus <- tm_map(corpus, content_transformer(removeNumbers))
     corpus <- tm_map(corpus, content_transformer(stripWhitespace))
     if (removepunct) corpus <- tm_map(corpus, content_transformer(removePunctuation))
     corpfreq <- colSums(as.matrix(DocumentTermMatrix(corpus)))
     corpfreq <- sort(corpfreq, decreasing = TRUE)
     return(corpfreq)
}

Let’s call the function to count the words:

alltextsfreq <- corpfreqprep(revsall$text, removepunct = TRUE)

How many words did we get?

length(alltextsfreq)

## [1] 35565

What were the most common words? Here are the top 50:

alltextsfreq[1:50]

##         the         and         was         for        that        they 
##       78494       74903       32991       25508       21851       18487 
##        have        with         you        this         had         not 
##       18274       17136       16665       14959       13598       12349 
##         she         but        very         are      office        time 
##       10882       10668       10127       10007        9042        8424 
##       staff        been         all       about        when         her 
##        7807        7656        7581        7287        7167        7117 
##         get         out       would       there        back         his 
##        7042        6797        6627        6554        6373        6214 
##        were        from        just       great      doctor        what 
##        6040        5889        5817        5791        5715        5595 
##         one        like       after        your     dentist        here 
##        5491        5462        5327        5218        5111        5098 
##        care appointment        will        them         has       their 
##        5075        5012        4988        4761        4626        4520 
##     because       place 
##        4393        4360

Some of the words are proper nouns that are specific to our particular data set and won’t generalize well to other data sets. These include ‘Vegas’, ‘Arizona’, ‘Mayo’, and ‘Summerlin’. So let’s eliminate those:

# 331 vegas, 393 yelp, 543 phoenix, 561 las, 660 scottsdale, 753 groupon, 1047 arizona
# 1051 mayo, 1209 chandler, 1217 summerlin, 1331 john, 1333 nevada, 1385 miller
# 1426 gilbert, 1317 north, 1407 west
wordstoremove <- c(331, 393, 543, 561, 660, 753, 1047, 1051, 1209, 1217, 1317,
                   1331, 1333, 1385, 1407, 1426)
alltextsfreq <- alltextsfreq[-wordstoremove]

That leaves us with 35549 words.

Next, most of the words in our data set seldom occur and may not be very useful predictors. At the very least, processing them will require a lot of computational resources for probably little benefit. So let’s limit our analysis to the most common words. We’ll retain the words that appear at least 100 times across all our reviews. The ‘100’ is a fairly arbitrary cut-off.

Let’s add columns to our data frame for the top words:

commonwordsnum <- match(99, alltextsfreq) - 1
commonwords <- names(alltextsfreq)[1:commonwordsnum]

revsallcolnum <- dim(revsall)[2]
revsall[(revsallcolnum + 1):(revsallcolnum + commonwordsnum)] <- NA
colnames(revsall)[(revsallcolnum + 1):(revsallcolnum + commonwordsnum)] <- commonwords

That gives us 1462 top words.

For each of the top words in each review, we’ll calculate the proportion of how many times the word is used out of the total number of words in that review, and then we’ll add that statistic to our data frame:

# for each row in 'revsall', i.e., for each review
for (iter in 1:dim(revsall)[1]) {
     # if (iter %% 100 == 0) {print(iter)}
     tempfreq <- corpfreqprep(revsall$text[iter], removepunct = TRUE)
     # if 'corpfreqprep' returns 'logical(0)' to 'tempfreq', assigning "" to 'tempfreq'
     #    avoid subsequent error
     # this error occurs 8 times in the data set at these row indices in 'revsall':
     #    1711, 1954, 6760, 8069, 8417, 8486, 9783, 10902
     # the review texts at these row indices are:  "Ok|", "|", ".", "|", "c|", "A|", "NT|", ":)|"
     # these errors do not affect the totals for the most common words, so the
     #    correction below is adequate
     if (length(tempfreq) == 0) {
          tempfreq <- 0
          names(tempfreq) <- ""
     }
     # for each word counted in a review
     for (iter2 in 1:length(tempfreq)) {
          # if the word counted in the review is in the common words
          if (names(tempfreq)[iter2] %in% commonwords) {
               wordcoltemp <- which(colnames(revsall)[(revsallcolnum + 1):dim(revsall)[2]] ==
                                    names(tempfreq)[iter2])
               wordcol <- wordcoltemp + revsallcolnum
               # save the frequency of the word as a proportion of the number of words in the review
               revsall[iter, wordcol] <- tempfreq[iter2] / revsall$n.words[iter]
          }
     }
}

Finally, let’s save those results to an external file:

saveRDS(revsall, file = "revsallwordfreqs.rds")