Feature Importance – Andrew Fairless, Ph.D.

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we added features, split our data into training and testing sets, and trained a random forest model to predict whether a review is about an establishment of conventional or alternative medicine.

Most recently, we found that the model is performing rather well. It’s certainly classifying the reviews correctly at a level far higher than chance. Its features, or predictor variables, must contain information about how patients’ experiences with conventional and alternative medicine differ.

Let’s explore those differences.

We’ll need the R package ‘randomForest’.

library(randomForest)

We saved our random forest model to an external file that we can read back into R:

rfmodel <- readRDS('random_forest_model.rds')

Feature importance: Top features

A handy advantage of using random forest as our classification algorithm is that we can easily find out how important each feature is in making classifications.

Let’s plot the top features:

The plot above shows our features ranked by two different methods. On the left is the Mean Decrease in Accuracy, or the Permutation Importance. For this method, the values for each feature are randomly reassigned, or permuted, to different cases one at a time. The permuted feature and all the other, non-permuted features are then used to make predictions and the classification error is measured. If the error rises a lot, that is, if the classification accuracy falls a lot, then the permuted variable must have been really important for accurate predictions.

On the right is the Mean Decrease in the Gini Index, which measures the heterogeneity of a group. As the random forest algorithm builds a tree, it uses different features to split heterogeneous groups into more homogeneous nodes. Features that produce the largest increases in homogeneity, that is, in the ‘purity’ of a group, must be the most important features for making good predictions.

Feature importance: Sensible but obvious

The features in the plot above are ranked in descending order, so that the most important features are at the top. The most important three features for both methods were how often the words ‘massage, dentist, chiropractor’ were used in the review. While the answer is probably obvious, we should check to see whether they were used more frequently in reviews for ‘conventional’ or ‘alternative’ medicine. Let’s grab that information from our training data set:

library(dplyr)

traindata <- readRDS('traindata.rds')

con_alt_means_train <- traindata[ , -c(1:3)] %>%
     group_by(medicine) %>%
     summarise_each(funs(mean))

con_alt_means_train <- as.data.frame(t(con_alt_means_train[ , -1]))
colnames(con_alt_means_train) <- c('alternative', 'conventional')
con_alt_means_train$alt_con_ratio <- (con_alt_means_train$alternative / 
                                      con_alt_means_train$conventional)
con_alt_means_train$con_alt_ratio <- (con_alt_means_train$conventional / 
                                      con_alt_means_train$alternative)

Now let’s look at our top three most important features:

##              alternative conventional alt_con_ratio con_alt_ratio
## dentist           0.0000       0.0039        0.0007     1404.7346
## massage           0.0077       0.0000      228.9037        0.0044
## chiropractor      0.0023       0.0000      399.8521        0.0025

The word ‘dentist’ is used far more often in reviews about conventional medical establishments than in reviews about alternative medical establishments, whereas the words ‘massage’ and ‘chiropractor’ are used far more often about alternative medicine.

That’s good to know – the model makes sense! – but it doesn’t really tell us anything new. We already knew that. And it looks like most of our top features fall into this category of sensible-but-obvious: ‘acupuncture’, ‘spa’, ‘surgery’, ‘hospital’, and ‘adjustment’, where the last one probably refers to chiropractors’ work (though I haven’t checked this rigorously to be sure).

Feature importance: Body parts

In another category are words that refer to the body: ‘body’, ‘foot’, ‘back’, and ‘neck’, where the latter two, again, probably refer to chiropractors’ work.

Feature importance: Pain

Perhaps more interesting is the use of the word ‘pain’:

##      alternative conventional alt_con_ratio con_alt_ratio
## pain      0.0036        0.001        3.5781        0.2795

It shows up about 3.5 times more often in reviews about alternative medicine than about conventional medicine. Let’s do some quick, crude correlations to get some hints about what might be happening here. Restricting ourselves only to alternative medicine, let’s see how well usage rates for ‘pain’ correlate with the patients’ ratings (called ‘stars’) for their overall treatment experience:

##            pain
## stars 0.0915272

It’s a small correlation, but it is positive, suggesting that the more often patients of alternative medical establishments mention ‘pain’, the higher they rate their treatment experience at that establishment.

How do mentions of ‘pain’ correlate with polarity, or sentiment?

##                     pain
## ave.polarity -0.05654968

It’s a very small correlation, and it’s negative: more mentions of pain are weakly associated with more negative emotional content of the patients’ reviews about alternative medicine.

Well, those correlations might make some sense: alternative medicine patients might write negatively about their pain but still write positively about their treatment experiences – either because they viewed their treatments as successful or because they gave credit to their alternative medical practitioners for trying to help.

What happens when we do the same correlations for conventional medicine?

##               pain
## stars -0.005997637

##                     pain
## ave.polarity -0.03483849

Patients of conventional medicine show virtually no relationship between how often they mention ‘pain’ and their overall ratings of their treatment experiences. Likewise, there’s little relationship with the emotional content of their reviews.

So, based only on the correlations, pain seems to have little to do with how conventional medical patients rate or write about their treatment experiences. It doesn’t seem to matter much to alternative medical patients, either, but they write about it more often than conventional medical patients do, and they kind of tend to write about it positively.

But wait. Didn’t our random forest model tell us that this is an important variable? Shouldn’t we see higher correlations? Well, yes, it is important – relative to 1496 other variables. But the variable could be relatively important (to classification) but still have small absolute relationships (with ratings and polarity/sentiment), as indicated by the correlations. Perhaps more importantly, a big purpose of machine learning, including our random forest model, is to find complex relationships among lots of variables that we humans can’t easily discern. The model might find that a variable is important in the context of other variables, even though looking at that single variable in isolation, like with a simple correlation, finds only weak relationships. So, the real purpose of our looking at the correlations is just to get some quick hints about patients might be writing about. It isn’t meant to be a statistically rigorous exploration with firm conclusions. But we might find some hints that we should explore more deeply and rigorously later.

Let’s look for a few more hints.

Feature importance: Waiting

The Mean Decrease in the Gini Index identifies mentions of ‘wait’ and ‘waiting’ as fairly important variables. For which type of medicine are they being mentioned more often?

##         alternative conventional alt_con_ratio con_alt_ratio
## wait          6e-04       0.0019        0.3025        3.3059
## waiting       3e-04       0.0015        0.1990        5.0257

‘Wait’ and ‘waiting’ are mentioned 3 – 5 times more often for conventional medicine than for alternative medicine. How do the correlations look?

##              wait
## stars -0.06981581

##                    wait
## ave.polarity -0.0555242

##         waiting
## stars -0.127547

##                 waiting
## ave.polarity -0.0646195

They’re weak, but they’re all in the negative direction: more mentions of ‘wait’ and ‘waiting’ are weakly associated with lower ratings and more negative emotional content in reviews. Maybe patients aren’t happy about waiting at conventional medical establishments.

Feature importance: Polarity, or sentiment

Average polarity, that is, the sentiment or emotional content of reviews, shows up as an important variable. How does it differ between the two types of medicine?

##              alternative conventional alt_con_ratio con_alt_ratio
## ave.polarity      0.0283       0.0179        1.5851        0.6309

It’s higher for alternative medicine than for conventional medicine; patients write about alternative medicine more positively than they do about conventional medicine. That’s not surprising to us by now, because we earlier found that patients rate their overall experiences with alternative medicine higher than conventional medicine. As we might expect, the polarity/sentiment and the overall ratings have a moderately high correlation (okay, ‘high’ compared to the other ones that we’ve seen so far):

## [1] 0.2724732

Feature importance: Relaxing

The word ‘relaxing’ shows up as an important variable under the Mean Decrease in Accuracy.

##          alternative conventional alt_con_ratio con_alt_ratio
## relaxing      0.0011        1e-04       18.2227        0.0549

It shows up about 18 times more often for alternative medicine than for conventional medicine.

##        relaxing
## stars 0.0120782

##                relaxing
## ave.polarity 0.01599536

And ‘relaxing’ is very weakly associated with positive correlations with ratings and polarity/sentiment.

Feature importance: Price

Delving deeper, we find that mentioning ‘price’ is ranked 143 out of 1497 variables in importance by the Mean Decrease in the Gini Index. Given ongoing political controversies about medical prices, it might be worth looking at.

##       alternative conventional alt_con_ratio con_alt_ratio
## price      0.0011        2e-04        5.0432        0.1983

‘Price’ is mentioned about 5 times more often for alternative medicine than for conventional medicine.

##            price
## stars 0.01935443

##                   price
## ave.polarity 0.03880303

Mentions of ‘price’ in reviews about alternative medicine correlate very weakly and positively with ratings and polarity/sentiment.

##             price
## stars -0.00785593

##                     price
## ave.polarity -0.001072382

Mentions of ‘price’ for conventional medicine show almost no relationship with ratings and polarity/sentiment.

Feature importance: Summary and conclusions

In summary, we found that most of the important variables didn’t tell us anything new about how patients experience conventional and alternative medicine differently, but a few gave us some hints. Patients mention pain more often for alternative medicine than for conventional, and they rate their alternative medical providers higher when they mention pain. Patients write about waiting more for conventional medicine than for alternative, and they apparently don’t like waiting. Patients of alternative medicine also mention ‘relaxing’ and ‘price’ more, and they associate their treatment experiences with more positive emotions, compared with conventional medicine.

Our measurements of predictor variable importance in our model aren’t the final word on how reviews for conventional and alternative medicine may differ. For one thing, a lot of the features are probably correlated with each other. That is, if a word shows up in a review, it’s likely that there are other words that also tend to show up in that review. For example, if ‘oncologist’ appears in a review, there’s a good chance that ‘cancer’ or ‘tumor’ would appear in it, too. These correlated predictor variables can hide each other’s importance to the model. The usage rates of the word ‘cancer’ might be really important to our model. But if we remove ‘cancer’ from our predictors, the model might still classify just as well because ‘tumor’ or ‘oncologist’ can take over ‘cancer’s’ predicting job. We might have to remove most or all the words related to ‘cancer’ to realize how important it is to the model. This feature correlation makes measuring the importance of our predictor variables harder to do reliably.

These feature correlations suggest two improvements we could make to our current analysis. First, there are many methods available for reducing our set of features to a smaller, more reliable set, which could improve our interpretations above. Second, instead of looking at each feature in isolation, we could take advantage of the correlations by better understanding how they relate to each other. Topic modeling could provide us a way to automatically detect that ‘cancer’, ‘tumor’, and ‘oncologist’ are all part of one topic, instead of manually looking at each variable in isolation. Either approach could further enhance our understanding of patients’ experiences with conventional and alternative medicine.