Creating Features, Part 1
Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews).
We saved those reviews to an external file that we can read back into R:
revs <- readRDS("revsculledprepped.rds")
We’ll need the the R package ‘qdap’ again:
library(qdap)
Some of the reviews have repeated text, that is, the entire review has been printed twice verbatim. We’ll include a function that will delete the second copy while we’re processing each review:
# This function detects whether a sequence of a vector's elements repeats.
# If they do, it eliminates the duplicates/repetitions so that only one
# sequence remains.
# The sequence can extend over any number of vector elements. There can be
# any number of duplicates in the vector; all the duplicates will be eliminated.
# The limitation of this function is that it checks for only 1 duplication.
# If it finds 1 duplicate and the overall length of the vector is a multiple
# of the length of the sequence, then it assumes any additional elements
# following the first duplicate are also duplicated sequences.
removerepeats <- function(vectora) {
# vectors must have at least 2 elements to continue through the function
if (length(vectora) < 2) {return(vectora)}
# checks whether the vector's 1st element is duplicated
firstrepeat <- match(vectora[1], vectora[-1], nomatch = 1)
# if not, function exits
if (firstrepeat == 1) {
return(vectora)
# if so, 'firstrepeat' is the vector index of that duplicate
# thus, 'firstrepeat - 1' is the length of the possible duplication sequence
} else if (firstrepeat > 1) {
firstrepeat <- firstrepeat + 1
}
# vector length must be a multiple of the sequence length to continue
# through function
lengthremainder <- length(vectora) %% (firstrepeat - 1)
if (lengthremainder != 0) {
return(vectora)
}
# to be a duplicate, each element of the 2nd sequence must match the
# corresponding element of the 1st sequence
repeattest = rep(FALSE, (firstrepeat - 1))
for (iter in 1:(firstrepeat - 1)) {
repeattest[iter] <- (vectora[iter] == vectora[firstrepeat + iter - 1])
}
# if any element between the 2 sequences mismatched, exit function
if (FALSE %in% repeattest) {
return(vectora)
# otherwise, the 2nd sequence is a duplicate, and the vector is modified
# so that only the 1st sequence is retained
} else {
vectora <- vectora[1:(firstrepeat - 1)]
return(vectora)
}
}
Our next major task is to start creating features that we can use to predict whether a review is about an establishment of conventional or alternative medicine. The package ‘qdap’ lets us calculate a variety of statistics about the reviews, such as the number of words, the polarity/sentiment, and the formality. The qdap documentation includes descriptions of what each qdap function calculates.
First, we’ll define the column names that we’ll use for our features:
wordstatscolnames = c(
# from word_stats function
"all", "n.sent", "n.words", "n.char", "n.syl", "n.poly",
"wps", "cps", "sps", "psps", "cpw", "spw", "pspw",
"n.state", "n.quest", "n.exclm", "n.imper", "n.incom",
"p.state", "p.quest", "p.exclm", "p.imper", "p.incom",
"n.hapax", "n.dis", "grow.rate", "prop.dis",
# from 'pronoun_type' function
"pronoun.word.count", "I", "we", "you", "he", "she",
"they", "it", "me", "us", "him", "her", "them", "their",
# from 'polarity' function
"total.sentences", "total.words", "ave.polarity",
"sd.polarity", "stan.mean.polarity",
# from 'automated_readability_index' function
"read.word.count", "sentence.count", "character.count",
"Automated_Readability_Index",
# from 'diversity' function
"wc", "simpson", "shannon", "collision",
"berger_parker", "brillouin",
# from 'lexical_classification' function
"lexical.word.count", "ave.content.rate", "SE", "n.content",
"n.functional", # "content", "functional",
# from 'formality' function
"formality.word.count", "formality")
Now we’ll initialize a data frame in which to store our results:
# number of columns following 'prop.dis' (last element of 'word_stats' function)
wstatsendcol = which(wordstatscolnames == "prop.dis")
afterwstatscolnum = length(wordstatscolnames) - wstatsendcol
temparray = rep(NA, length(wordstatscolnames))
temparray = as.data.frame(t(temparray))
colnames(temparray) = wordstatscolnames
Next, we’ll calculate the statistics for the review and store them in our initialized data frame. During the process, we’ll also call our function to remove those extra copies of some reviews.
for (iter in 1:dim(revs)[1]) {
otherstats = rep(NA, afterwstatscolnum) # initialize 'otherstats', which contains results from functions other than 'wordstats'
otherstats = as.data.frame(t(otherstats))
if (grepl("[a-zA-Z]", revs$text[iter]) == FALSE) { # if there is no text (i.e., no alphabetic characters) in the review
wordstats = rep(NA, length(wordstatscolnames)) # create 'wordstats' with 'NA's for all values
wordstats = as.data.frame(t(wordstats))
colnames(wordstats) = wordstatscolnames
} else { # else create 'wordstats' with 'word_stats' function calculations
tempsplit = sentSplit(revs[iter, ], "text")
adjrownum = length(removerepeats(tempsplit$text))
tempsplit = tempsplit[1:adjrownum, ]
otherstats[1:14] = pronoun_type(tempsplit$text)$prop[ , -1]
otherstats[15:19] = polarity(tempsplit$text)$group[ , -1]
otherstats[20:23] = automated_readability_index(tempsplit$text)$Readability[ , -1]
otherstats[24:29] = diversity(tempsplit$text)[ , -1]
# 'lexical_classification' and 'formality' produce errors if there's
# only one word of text in the review, so they are skipped in this case
if (otherstats[1] > 1) {
otherstats[30:34] = lexical_classification(tempsplit$text)$lexical_classification[ , 2:6]
dummy <- capture.output(otherstats[35:36] <- formality(tempsplit$text)$formality[ , -1]) # hacky way of preventing 'formality' from printing output
} else {
otherstats[30:36] = NA
}
wordstats = word_stats(tempsplit$text)$gts
if (colnames(wordstats)[dim(wordstats)[2]] != wordstatscolnames[wstatsendcol]) {
wordstats[dim(wordstats)[2] + 1] = NA
colnames(wordstats)[dim(wordstats)[2]] = wordstatscolnames[wstatsendcol]
}
# if the last column name of 'wordstats' is not "prop.dis" (i.e., the last element in 'wordstatscolnames'),
# the 'if' condition in the 'for' loop below will produce an error; the 'if' statement above avoids this
# if a column in 'wordstats' is not present, insert column with 'NA' as value
for (iter2 in 1:wstatsendcol) {
if (colnames(wordstats)[iter2] != wordstatscolnames[iter2]) {
wordstats = data.frame(wordstats[1:(iter2 - 1)], NA, wordstats[iter2:dim(wordstats)[2]])
}
}
}
wordstats[28:63] = otherstats
colnames(wordstats) = wordstatscolnames
wordstats$all = as.character(wordstats$all) # convert 'factor' to 'character' so 'review_id' can be added (below)
wordstats$all = revs$review_id[iter]
temparray = rbind.data.frame(temparray, wordstats)
# if (iter %% 100 == 0) {print(iter)}
}
Let’s take a look at some results:
## all n.sent n.words n.char n.syl n.poly wps cps
## 1 <NA> NA NA NA NA NA NA NA
## 2 --gIJ5IhuAOJJs-76fklOQ 13 103 418 140 8 7.923 32.154
## 3 --ji515P_ulxMXjK9aw30g 5 84 352 111 8 16.800 70.400
## 4 -_h8TiFplNZAQsOHTG0SZA 10 131 548 179 14 13.100 54.800
## 5 -025RRfiUofqbZsU_Fk7Ow 4 84 362 119 9 21.000 90.500
## 6 -0ujS8TVzMUn08nxyp2Deg 6 65 283 85 2 10.833 47.167
## sps psps cpw spw pspw n.state n.quest n.exclm n.imper n.incom
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 10.769 0.615 4.058 1.359 0.078 2 NA 10 NA 1
## 3 22.200 1.600 4.190 1.321 0.095 4 NA 1 NA NA
## 4 17.900 1.400 4.183 1.366 0.107 10 NA NA NA NA
## 5 29.750 2.250 4.310 1.417 0.107 4 NA NA NA NA
## 6 14.167 0.333 4.354 1.308 0.031 3 NA 3 NA NA
## p.state p.quest p.exclm p.imper p.incom n.hapax n.dis grow.rate prop.dis
## 1 NA NA NA NA NA NA NA NA NA
## 2 0.154 NA 0.769 NA 0.077 54 11 0.529 0.108
## 3 0.800 NA 0.200 NA NA 49 9 0.583 0.107
## 4 1.000 NA NA NA NA 62 18 0.473 0.137
## 5 1.000 NA NA NA NA 53 7 0.639 0.084
## 6 0.500 NA 0.500 NA NA 41 9 0.631 0.138
## pronoun.word.count I we you he she
## 1 NA NA NA NA NA NA
## 2 102 4.901961 0.9803922 0.0000000 0.000000 0.000000
## 3 84 2.380952 0.0000000 2.3809524 0.000000 2.380952
## 4 131 3.053435 0.0000000 0.7633588 3.053435 2.290076
## 5 83 4.819277 0.0000000 0.0000000 0.000000 0.000000
## 6 65 0.000000 0.0000000 3.0769231 0.000000 0.000000
## they it me us him her them their
## 1 NA NA NA NA NA NA NA NA
## 2 3.9215686 0.0000000 0.000000 0 0 0.0000000 0.0000000 0
## 3 0.0000000 0.0000000 1.190476 0 0 1.1904762 0.0000000 0
## 4 0.7633588 0.7633588 2.290076 0 0 0.7633588 0.7633588 0
## 5 1.2048193 0.0000000 2.409639 0 0 0.0000000 0.0000000 0
## 6 1.5384615 0.0000000 0.000000 0 0 0.0000000 0.0000000 0
## total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 NA NA NA NA NA
## 2 13 102 -0.2353054 0.2562831 -0.9181461
## 3 5 84 0.2092823 0.2207562 0.9480247
## 4 10 131 -0.1516716 0.3076372 -0.4930209
## 5 4 83 0.1630107 0.3417488 0.4769897
## 6 6 65 -0.2300099 0.4921409 -0.4673659
## read.word.count sentence.count character.count
## 1 NA NA NA
## 2 102 13 417
## 3 84 5 352
## 4 131 10 548
## 5 83 4 361
## 6 65 6 283
## Automated_Readability_Index wc simpson shannon collision
## 1 NA NA NA NA NA
## 2 1.748665 102 0.9904873 4.134069 3.951628
## 3 6.707143 84 0.9911073 4.029541 3.878027
## 4 4.822901 131 0.9923664 4.351402 4.185874
## 5 9.430663 83 0.9861887 3.967877 3.661531
## 6 4.493282 65 0.9927885 3.881028 3.794898
## berger_parker brillouin lexical.word.count ave.content.rate SE
## 1 NA NA NA NA NA
## 2 0.04901961 3.386615 102 36.27451 4.784061
## 3 0.05952381 3.251561 84 47.61905 5.481987
## 4 0.03816794 3.615584 131 35.87786 4.206739
## 5 0.09638554 3.193740 83 43.37349 5.472870
## 6 0.04615385 3.069550 65 47.69231 6.243340
## n.content n.functional formality.word.count formality
## 1 NA NA NA NA
## 2 37 65 102 46.07843
## 3 40 44 84 51.78571
## 4 47 84 138 44.92754
## 5 36 47 83 53.61446
## 6 31 34 66 56.81818
They look good, except we need to get rid of that first row of NAs and make sure our row names are correct:
temparray = temparray[-1, ]
rownames(temparray) = seq(from = 1, to = dim(temparray)[1])
Now we want to combine our newly calculated statistics with their corresponding reviews and other associated data. Before doing so, we need to check that the review IDs in the two data frames match. They should match, but we want to make sure that we didn’t make a mistake, because mis-matching our reviews and our predictors would completely mess up our analysis.
Just to be clear, we want to make sure that these in our original data frame with the reviews:
## [1] "--gIJ5IhuAOJJs-76fklOQ" "--ji515P_ulxMXjK9aw30g"
## [3] "-_h8TiFplNZAQsOHTG0SZA" "-025RRfiUofqbZsU_Fk7Ow"
## [5] "-0ujS8TVzMUn08nxyp2Deg" "-0UPrtlZvYUIZciIzCYZYQ"
match these in our new data frame with the statistics:
## [1] "--gIJ5IhuAOJJs-76fklOQ" "--ji515P_ulxMXjK9aw30g"
## [3] "-_h8TiFplNZAQsOHTG0SZA" "-025RRfiUofqbZsU_Fk7Ow"
## [5] "-0ujS8TVzMUn08nxyp2Deg" "-0UPrtlZvYUIZciIzCYZYQ"
So we’ll create a table, and all the results should be ‘true’:
table(revs$review_id == temparray$all)
##
## TRUE
## 15761
All the review IDs match, so we’ll combine the two data frames and save the result to an external file:
revsall = cbind(revs, temparray)
saveRDS(revsall, file = "revsall.rds")
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });