Modifying or Removing Features – Andrew Fairless, Ph.D.

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we added features that we can use to predict whether a review is about an establishment of conventional or alternative medicine.

Some of our features are redundant or will need modification before we use them as predictors in our analysis.

We saved our data thus far to an external file that we can read back into R:

expdata <- readRDS("revsallwordfreqs.rds")

As is common in popular internet forums, the reviewers of the medical establishments on the Yelp website often use informal language and grammar. One consequence is that statistics based on the numbers of sentences tend to be unreliable. Look at these two reviews as examples:

## [1] "More then Comeptent. Trustworthy. Affordable. Looks to the best interest of the patients first. All in all, your best choice in the area."                                                                                                                                                                                                                                                                     
## [2] "I have been coming to this practice for years. It is a true family practice, however, they are progressive and they do everything in-house, they have their own x-rays, labs, chiropractor and physical therapy. I have used almost all of their services without any complaints. They are thorough and take the time to discuss all your needs. I personal use Dr. Becky, the PA, but I have also seen Bruce."

The first review includes some sentence fragments – even one-word sentence fragments – that are counted as sentences by our text analysis R packages. The second review includes a run-on sentence that is counted as a single sentence.

Since the sentence counts are unreliable, we’ll eliminate statistics based on such counts from our features.

First, we’ll adjust our polarity statistic. The function from the R package ‘qdap’ that we used to calculate polarity provided us the average polarity per sentence in each review. Since sentence counts for our reviews are unreliable, we’ll re-calculate the average polarity per word instead of per sentence:

expdata$ave.polarity = (expdata$ave.polarity * expdata$total.sentences) / expdata$total.words

We’ll also replace any NAs in our data frame with zeroes:

expdata[is.na(expdata)] <- 0

Now let’s figure out which features we can eliminate from our data set.

We have several features that all calculate the number of words in each review. However, they don’t all completely agree with one another. Let’s see how well they correlate with each other:

##                      n.words pronoun.word.count total.words
## n.words            1.0000000          0.9999027   0.9998567
## pronoun.word.count 0.9999027          1.0000000   0.9999565
## total.words        0.9998567          0.9999565   1.0000000
## read.word.count    0.9999027          1.0000000   0.9999565
## wc                 0.9999027          1.0000000   0.9999565
## lexical.word.count 0.9999027          1.0000000   0.9999565
##                    read.word.count        wc lexical.word.count
## n.words                  0.9999027 0.9999027          0.9999027
## pronoun.word.count       1.0000000 1.0000000          1.0000000
## total.words              0.9999565 0.9999565          0.9999565
## read.word.count          1.0000000 1.0000000          1.0000000
## wc                       1.0000000 1.0000000          1.0000000
## lexical.word.count       1.0000000 1.0000000          1.0000000

Even though the correlations are not all perfect (i.e., equal to 1), they’re high enough that we can retain only one and delete the others.

Similarly, the measurements of diversity/entropy are highly correlated:

##                  simpson    shannon  collision berger_parker  brillouin
## simpson        1.0000000  0.1784831  0.2205959    -0.5255727  0.1269855
## shannon        0.1784831  1.0000000  0.9822904    -0.6191851  0.9936292
## collision      0.2205959  0.9822904  1.0000000    -0.6964112  0.9592996
## berger_parker -0.5255727 -0.6191851 -0.6964112     1.0000000 -0.5595345
## brillouin      0.1269855  0.9936292  0.9592996    -0.5595345  1.0000000

Specifically, the Shannon, Collision, and Brillouin indices correlate highly, so we’ll retain the Shannon index and delete the other two.

Below is the compilation of features that we can eliminate, including the unreliable sentence counts, extra word counts, extra diversity indices, and a few other redundancies.

Before we eliminate those features, we have 1534 columns in our data frame.

# delete: 37, 52, 56, 60, 66: columns of word counts
# delete 9, review text
# delete column 10, duplicate of 'review_id'
# delete 11 'n.sent':  number of sentences is often incorrect
# delete 16:19:  all 'per sentence' stats, which are often incorrect
# delete 28:32:  proportions of all sentences, which are often incorrect
# delete 33:36:  unsure how unique words would be useful; deleting
# delete 54:55:  polarity standard deviation is by sentence count, which is often incorrect
# delete 56:59:  readability relies on sentence counts, which are often incorrect; can I calculate something useful from this?
# delete 63, 65:  shannon, collision, brillouin correlate very highly; keep shannon
# delete 66, 68:70:  'ave.content.rate' (col 67) is good (expressed as %); other cols unnecessary
# delete 71:  'formality' (col 72) is good (expressed as %); 71 is unnecessary
cullcols = c(9:11, 16:19, 28:32, 33:37, 51:52, 54:60, 63, 65:66, 68:71)
expdata = expdata[ , -cullcols]

After eliminating those features, we have 1501 columns in our data frame.

Let’s save our data to an external file:

saveRDS(expdata, file = "revsfeaturesculled.rds")