Creating Features, Part 2
Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we started adding features that we can use to predict whether a review is about an establishment of conventional or alternative medicine. The first set of features that we added were statistics about the reviews, including the number of words, the polarity/sentiment, and the formality. Now we’re going to add another set of features: the usage/occurrence rates of individual words for each review.
For example, if a review said simply, “I like this doctor,” then the usage rate for ‘I’ is 25%, for ‘like’ is 25%, and likewise for ‘this’ and ‘doctor’.
We saved our data thus far to an external file that we can read back into R:
revsall <- readRDS("revsall.rds")
We’ll need the the R package ‘tm’:
library(tm)
First, let’s look at what words are commonly used in our data set of reviews. Here’s a function that will count all the words used in the reviews:
corpfreqprep <- function(vectoroftexts, removepunct = TRUE) {
corpus <- Corpus(VectorSource(vectoroftexts))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
if (removepunct) corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpfreq <- colSums(as.matrix(DocumentTermMatrix(corpus)))
corpfreq <- sort(corpfreq, decreasing = TRUE)
return(corpfreq)
}
Let’s call the function to count the words:
alltextsfreq <- corpfreqprep(revsall$text, removepunct = TRUE)
How many words did we get?
length(alltextsfreq)
## [1] 35565
What were the most common words? Here are the top 50:
alltextsfreq[1:50]
## the and was for that they
## 78494 74903 32991 25508 21851 18487
## have with you this had not
## 18274 17136 16665 14959 13598 12349
## she but very are office time
## 10882 10668 10127 10007 9042 8424
## staff been all about when her
## 7807 7656 7581 7287 7167 7117
## get out would there back his
## 7042 6797 6627 6554 6373 6214
## were from just great doctor what
## 6040 5889 5817 5791 5715 5595
## one like after your dentist here
## 5491 5462 5327 5218 5111 5098
## care appointment will them has their
## 5075 5012 4988 4761 4626 4520
## because place
## 4393 4360
Some of the words are proper nouns that are specific to our particular data set and won’t generalize well to other data sets. These include ‘Vegas’, ‘Arizona’, ‘Mayo’, and ‘Summerlin’. So let’s eliminate those:
# 331 vegas, 393 yelp, 543 phoenix, 561 las, 660 scottsdale, 753 groupon, 1047 arizona
# 1051 mayo, 1209 chandler, 1217 summerlin, 1331 john, 1333 nevada, 1385 miller
# 1426 gilbert, 1317 north, 1407 west
wordstoremove <- c(331, 393, 543, 561, 660, 753, 1047, 1051, 1209, 1217, 1317,
1331, 1333, 1385, 1407, 1426)
alltextsfreq <- alltextsfreq[-wordstoremove]
That leaves us with 35549 words.
Next, most of the words in our data set seldom occur and may not be very useful predictors. At the very least, processing them will require a lot of computational resources for probably little benefit. So let’s limit our analysis to the most common words. We’ll retain the words that appear at least 100 times across all our reviews. The ‘100’ is a fairly arbitrary cut-off.
Let’s add columns to our data frame for the top words:
commonwordsnum <- match(99, alltextsfreq) - 1
commonwords <- names(alltextsfreq)[1:commonwordsnum]
revsallcolnum <- dim(revsall)[2]
revsall[(revsallcolnum + 1):(revsallcolnum + commonwordsnum)] <- NA
colnames(revsall)[(revsallcolnum + 1):(revsallcolnum + commonwordsnum)] <- commonwords
That gives us 1462 top words.
For each of the top words in each review, we’ll calculate the proportion of how many times the word is used out of the total number of words in that review, and then we’ll add that statistic to our data frame:
# for each row in 'revsall', i.e., for each review
for (iter in 1:dim(revsall)[1]) {
# if (iter %% 100 == 0) {print(iter)}
tempfreq <- corpfreqprep(revsall$text[iter], removepunct = TRUE)
# if 'corpfreqprep' returns 'logical(0)' to 'tempfreq', assigning "" to 'tempfreq'
# avoid subsequent error
# this error occurs 8 times in the data set at these row indices in 'revsall':
# 1711, 1954, 6760, 8069, 8417, 8486, 9783, 10902
# the review texts at these row indices are: "Ok|", "|", ".", "|", "c|", "A|", "NT|", ":)|"
# these errors do not affect the totals for the most common words, so the
# correction below is adequate
if (length(tempfreq) == 0) {
tempfreq <- 0
names(tempfreq) <- ""
}
# for each word counted in a review
for (iter2 in 1:length(tempfreq)) {
# if the word counted in the review is in the common words
if (names(tempfreq)[iter2] %in% commonwords) {
wordcoltemp <- which(colnames(revsall)[(revsallcolnum + 1):dim(revsall)[2]] ==
names(tempfreq)[iter2])
wordcol <- wordcoltemp + revsallcolnum
# save the frequency of the word as a proportion of the number of words in the review
revsall[iter, wordcol] <- tempfreq[iter2] / revsall$n.words[iter]
}
}
}
Finally, let’s save those results to an external file:
saveRDS(revsall, file = "revsallwordfreqs.rds")
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });