Creating Features, Part 1 – Andrew Fairless, Ph.D.

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews).

We saved those reviews to an external file that we can read back into R:

revs <- readRDS("revsculledprepped.rds")

We’ll need the the R package ‘qdap’ again:

library(qdap)

Some of the reviews have repeated text, that is, the entire review has been printed twice verbatim. We’ll include a function that will delete the second copy while we’re processing each review:

# This function detects whether a sequence of a vector's elements repeats.  
# If they do, it eliminates the duplicates/repetitions so that only one 
# sequence remains.

# The sequence can extend over any number of vector elements.  There can be
# any number of duplicates in the vector; all the duplicates will be eliminated.

# The limitation of this function is that it checks for only 1 duplication.
# If it finds 1 duplicate and the overall length of the vector is a multiple
# of the length of the sequence, then it assumes any additional elements 
# following the first duplicate are also duplicated sequences.

removerepeats <- function(vectora) {
     # vectors must have at least 2 elements to continue through the function
     if (length(vectora) < 2) {return(vectora)}
     
     # checks whether the vector's 1st element is duplicated
     firstrepeat <- match(vectora[1], vectora[-1], nomatch = 1)
     # if not, function exits
     if (firstrepeat == 1) {
          return(vectora)
          # if so, 'firstrepeat' is the vector index of that duplicate
          # thus, 'firstrepeat - 1' is the length of the possible duplication sequence
     } else if (firstrepeat > 1) {
          firstrepeat <- firstrepeat + 1
     }
     
     # vector length must be a multiple of the sequence length to continue
     # through function
     lengthremainder <- length(vectora) %% (firstrepeat - 1)
     if (lengthremainder != 0) {
          return(vectora)
     } 
     
     # to be a duplicate, each element of the 2nd sequence must match the 
     # corresponding element of the 1st sequence
     repeattest = rep(FALSE, (firstrepeat - 1))
     for (iter in 1:(firstrepeat - 1)) {
          repeattest[iter] <- (vectora[iter] == vectora[firstrepeat + iter - 1])
     }
     
     # if any element between the 2 sequences mismatched, exit function
     if (FALSE %in% repeattest) {
          return(vectora)
          # otherwise, the 2nd sequence is a duplicate, and the vector is modified
          # so that only the 1st sequence is retained
     } else {
          vectora <- vectora[1:(firstrepeat - 1)]
          return(vectora)
     }
     
}

Our next major task is to start creating features that we can use to predict whether a review is about an establishment of conventional or alternative medicine. The package ‘qdap’ lets us calculate a variety of statistics about the reviews, such as the number of words, the polarity/sentiment, and the formality. The qdap documentation includes descriptions of what each qdap function calculates.

First, we’ll define the column names that we’ll use for our features:

wordstatscolnames = c(
     # from word_stats function
     "all", "n.sent", "n.words", "n.char", "n.syl", "n.poly", 
     "wps", "cps", "sps", "psps", "cpw", "spw", "pspw", 
     "n.state", "n.quest", "n.exclm", "n.imper", "n.incom", 
     "p.state", "p.quest", "p.exclm", "p.imper", "p.incom", 
     "n.hapax", "n.dis", "grow.rate", "prop.dis",
     # from 'pronoun_type' function
     "pronoun.word.count", "I", "we", "you", "he", "she", 
     "they", "it", "me", "us", "him", "her", "them", "their",
     # from 'polarity' function
     "total.sentences", "total.words", "ave.polarity",
     "sd.polarity", "stan.mean.polarity",
     # from 'automated_readability_index' function
     "read.word.count", "sentence.count", "character.count",
     "Automated_Readability_Index",
     # from 'diversity' function
     "wc", "simpson", "shannon", "collision", 
     "berger_parker", "brillouin",
     # from 'lexical_classification' function
     "lexical.word.count", "ave.content.rate", "SE", "n.content",
     "n.functional", # "content", "functional",
     # from 'formality' function
     "formality.word.count", "formality")

Now we’ll initialize a data frame in which to store our results:

# number of columns following 'prop.dis' (last element of 'word_stats' function)
wstatsendcol = which(wordstatscolnames == "prop.dis")
afterwstatscolnum = length(wordstatscolnames) - wstatsendcol

temparray = rep(NA, length(wordstatscolnames))
temparray = as.data.frame(t(temparray))
colnames(temparray) = wordstatscolnames

Next, we’ll calculate the statistics for the review and store them in our initialized data frame. During the process, we’ll also call our function to remove those extra copies of some reviews.

for (iter in 1:dim(revs)[1]) {
     otherstats = rep(NA, afterwstatscolnum)         # initialize 'otherstats', which contains results from functions other than 'wordstats'
     otherstats = as.data.frame(t(otherstats))
     if (grepl("[a-zA-Z]", revs$text[iter]) == FALSE) {          # if there is no text (i.e., no alphabetic characters) in the review
          wordstats = rep(NA, length(wordstatscolnames))         # create 'wordstats' with 'NA's for all values
          wordstats = as.data.frame(t(wordstats))
          colnames(wordstats) = wordstatscolnames
     } else {                                                    # else create 'wordstats' with 'word_stats' function calculations
          tempsplit = sentSplit(revs[iter, ], "text")
          adjrownum = length(removerepeats(tempsplit$text))
          tempsplit = tempsplit[1:adjrownum, ]
          otherstats[1:14] = pronoun_type(tempsplit$text)$prop[ , -1]
          otherstats[15:19] = polarity(tempsplit$text)$group[ , -1]
          otherstats[20:23] = automated_readability_index(tempsplit$text)$Readability[ , -1]
          otherstats[24:29] = diversity(tempsplit$text)[ , -1]
          # 'lexical_classification' and 'formality' produce errors if there's
          # only one word of text in the review, so they are skipped in this case
          if (otherstats[1] > 1) {
               otherstats[30:34] = lexical_classification(tempsplit$text)$lexical_classification[ , 2:6]
               dummy <- capture.output(otherstats[35:36] <- formality(tempsplit$text)$formality[ , -1])   # hacky way of preventing 'formality' from printing output
          } else {
               otherstats[30:36] = NA
          }
          wordstats = word_stats(tempsplit$text)$gts
          
          if (colnames(wordstats)[dim(wordstats)[2]] != wordstatscolnames[wstatsendcol]) {  
               wordstats[dim(wordstats)[2] + 1] = NA
               colnames(wordstats)[dim(wordstats)[2]] = wordstatscolnames[wstatsendcol]
          }
          # if the last column name of 'wordstats' is not "prop.dis" (i.e., the last element in 'wordstatscolnames'),
          # the 'if' condition in the 'for' loop below will produce an error; the 'if' statement above avoids this
          
          # if a column in 'wordstats' is not present, insert column with 'NA' as value
          for (iter2 in 1:wstatsendcol) {
               if (colnames(wordstats)[iter2] != wordstatscolnames[iter2]) {
                    wordstats = data.frame(wordstats[1:(iter2 - 1)], NA, wordstats[iter2:dim(wordstats)[2]])
               }
          }
     }
     wordstats[28:63] = otherstats
     colnames(wordstats) = wordstatscolnames
     wordstats$all = as.character(wordstats$all)  # convert 'factor' to 'character' so 'review_id' can be added (below)
     wordstats$all = revs$review_id[iter]
     temparray = rbind.data.frame(temparray, wordstats)
     # if (iter %% 100 == 0) {print(iter)}
}

Let’s take a look at some results:

##                      all n.sent n.words n.char n.syl n.poly    wps    cps
## 1                   <NA>     NA      NA     NA    NA     NA     NA     NA
## 2 --gIJ5IhuAOJJs-76fklOQ     13     103    418   140      8  7.923 32.154
## 3 --ji515P_ulxMXjK9aw30g      5      84    352   111      8 16.800 70.400
## 4 -_h8TiFplNZAQsOHTG0SZA     10     131    548   179     14 13.100 54.800
## 5 -025RRfiUofqbZsU_Fk7Ow      4      84    362   119      9 21.000 90.500
## 6 -0ujS8TVzMUn08nxyp2Deg      6      65    283    85      2 10.833 47.167
##      sps  psps   cpw   spw  pspw n.state n.quest n.exclm n.imper n.incom
## 1     NA    NA    NA    NA    NA      NA      NA      NA      NA      NA
## 2 10.769 0.615 4.058 1.359 0.078       2      NA      10      NA       1
## 3 22.200 1.600 4.190 1.321 0.095       4      NA       1      NA      NA
## 4 17.900 1.400 4.183 1.366 0.107      10      NA      NA      NA      NA
## 5 29.750 2.250 4.310 1.417 0.107       4      NA      NA      NA      NA
## 6 14.167 0.333 4.354 1.308 0.031       3      NA       3      NA      NA
##   p.state p.quest p.exclm p.imper p.incom n.hapax n.dis grow.rate prop.dis
## 1      NA      NA      NA      NA      NA      NA    NA        NA       NA
## 2   0.154      NA   0.769      NA   0.077      54    11     0.529    0.108
## 3   0.800      NA   0.200      NA      NA      49     9     0.583    0.107
## 4   1.000      NA      NA      NA      NA      62    18     0.473    0.137
## 5   1.000      NA      NA      NA      NA      53     7     0.639    0.084
## 6   0.500      NA   0.500      NA      NA      41     9     0.631    0.138
##   pronoun.word.count        I        we       you       he      she
## 1                 NA       NA        NA        NA       NA       NA
## 2                102 4.901961 0.9803922 0.0000000 0.000000 0.000000
## 3                 84 2.380952 0.0000000 2.3809524 0.000000 2.380952
## 4                131 3.053435 0.0000000 0.7633588 3.053435 2.290076
## 5                 83 4.819277 0.0000000 0.0000000 0.000000 0.000000
## 6                 65 0.000000 0.0000000 3.0769231 0.000000 0.000000
##        they        it       me us him       her      them their
## 1        NA        NA       NA NA  NA        NA        NA    NA
## 2 3.9215686 0.0000000 0.000000  0   0 0.0000000 0.0000000     0
## 3 0.0000000 0.0000000 1.190476  0   0 1.1904762 0.0000000     0
## 4 0.7633588 0.7633588 2.290076  0   0 0.7633588 0.7633588     0
## 5 1.2048193 0.0000000 2.409639  0   0 0.0000000 0.0000000     0
## 6 1.5384615 0.0000000 0.000000  0   0 0.0000000 0.0000000     0
##   total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1              NA          NA           NA          NA                 NA
## 2              13         102   -0.2353054   0.2562831         -0.9181461
## 3               5          84    0.2092823   0.2207562          0.9480247
## 4              10         131   -0.1516716   0.3076372         -0.4930209
## 5               4          83    0.1630107   0.3417488          0.4769897
## 6               6          65   -0.2300099   0.4921409         -0.4673659
##   read.word.count sentence.count character.count
## 1              NA             NA              NA
## 2             102             13             417
## 3              84              5             352
## 4             131             10             548
## 5              83              4             361
## 6              65              6             283
##   Automated_Readability_Index  wc   simpson  shannon collision
## 1                          NA  NA        NA       NA        NA
## 2                    1.748665 102 0.9904873 4.134069  3.951628
## 3                    6.707143  84 0.9911073 4.029541  3.878027
## 4                    4.822901 131 0.9923664 4.351402  4.185874
## 5                    9.430663  83 0.9861887 3.967877  3.661531
## 6                    4.493282  65 0.9927885 3.881028  3.794898
##   berger_parker brillouin lexical.word.count ave.content.rate       SE
## 1            NA        NA                 NA               NA       NA
## 2    0.04901961  3.386615                102         36.27451 4.784061
## 3    0.05952381  3.251561                 84         47.61905 5.481987
## 4    0.03816794  3.615584                131         35.87786 4.206739
## 5    0.09638554  3.193740                 83         43.37349 5.472870
## 6    0.04615385  3.069550                 65         47.69231 6.243340
##   n.content n.functional formality.word.count formality
## 1        NA           NA                   NA        NA
## 2        37           65                  102  46.07843
## 3        40           44                   84  51.78571
## 4        47           84                  138  44.92754
## 5        36           47                   83  53.61446
## 6        31           34                   66  56.81818

They look good, except we need to get rid of that first row of NAs and make sure our row names are correct:

temparray = temparray[-1, ]
rownames(temparray) = seq(from = 1, to = dim(temparray)[1])

Now we want to combine our newly calculated statistics with their corresponding reviews and other associated data. Before doing so, we need to check that the review IDs in the two data frames match. They should match, but we want to make sure that we didn’t make a mistake, because mis-matching our reviews and our predictors would completely mess up our analysis.

Just to be clear, we want to make sure that these in our original data frame with the reviews:

## [1] "--gIJ5IhuAOJJs-76fklOQ" "--ji515P_ulxMXjK9aw30g"
## [3] "-_h8TiFplNZAQsOHTG0SZA" "-025RRfiUofqbZsU_Fk7Ow"
## [5] "-0ujS8TVzMUn08nxyp2Deg" "-0UPrtlZvYUIZciIzCYZYQ"

match these in our new data frame with the statistics:

## [1] "--gIJ5IhuAOJJs-76fklOQ" "--ji515P_ulxMXjK9aw30g"
## [3] "-_h8TiFplNZAQsOHTG0SZA" "-025RRfiUofqbZsU_Fk7Ow"
## [5] "-0ujS8TVzMUn08nxyp2Deg" "-0UPrtlZvYUIZciIzCYZYQ"

So we’ll create a table, and all the results should be ‘true’:

table(revs$review_id == temparray$all)

## 
##  TRUE 
## 15761

All the review IDs match, so we’ll combine the two data frames and save the result to an external file:

revsall = cbind(revs, temparray)
saveRDS(revsall, file = "revsall.rds")