Training the Classification Model – Andrew Fairless, Ph.D.

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we added and modified features that we can use to predict whether a review is about an establishment of conventional or alternative medicine. Then we split our data into training and testing sets.

Now we’ll train a random forest model on our data to see how well it can classify the reviews as about conventional or alternative medical establishments. If the model works well enough, we can look at which features are most important for its classifications, and that may give us some insights into how patients’ experiences with conventional and alternative medicine differ.

We’ll need the R package ‘randomForest’:

library(randomForest)

We saved our training and testing data to external files that we can read back into R:

traindata <- readRDS("traindata.rds")
testdata <- readRDS("testdata.rds")

Here are the first few rows of the first several columns of our training data:

##                    business_id                user_id
## 615673  fue1nMtFUkYMUPIl7K_Cfw lFgg4xXvMJ8zCaGBGVyQfA
## 1180491 CMdNwkfXQanQ3asGkNTFIA 3t6BtCftpqWL4AjIJpO-BQ
## 561052  AmrJkhuLdS_3_GQKj01WAg Wga_siGuLi3dQwCATtD48g
## 216994  g-mzslTyKp4ZWkwQGstOug fsBMDuyJpnReuTsNBnp1Ow
## 1013724 EpLyvAlqd5kC2hSB1pQPVg F04PZuHKAm_JVFuHpq_TOQ
## 1242619 OGBBv1G_3hyqjNZdUfPLaA P6D9nJkBrx7rhyGtpHxHUg
##                      review_id     medicine stars funny useful cool
## 615673  --gIJ5IhuAOJJs-76fklOQ conventional     1     1      2    1
## 1180491 --ji515P_ulxMXjK9aw30g  alternative     5     0      0    0
## 561052  -025RRfiUofqbZsU_Fk7Ow conventional     3     0      0    0
## 216994  -18bfIm-BnDm8rbqF1M0pg conventional     1     0      0    0
## 1013724 -2DwtIP_kEiwpWDcts3R_g conventional     5     0      1    0
## 1242619 -2OsVfLqukXzq8LWE_fRmw  alternative     5     0      3    1
##         n.words n.char n.syl n.poly   cpw   spw  pspw
## 615673      103    418   140      8 4.058 1.359 0.078
## 1180491      84    352   111      8 4.190 1.321 0.095
## 561052       84    362   119      9 4.310 1.417 0.107
## 216994      240   1021   335     26 4.254 1.396 0.108
## 1013724      20    104    31      4 5.200 1.550 0.200
## 1242619      53    215    66      3 4.057 1.245 0.057

We won’t need the identification columns to train our model. Let’s specify the columns for our response, which denotes whether the reviews refer to conventional or alternative medical establishments, and for our predictor variables:

response_col <- 4
predictor_cols <- 5:dim(traindata)[2]

We have many more reviews for conventional than for alternative medical establishments, which means that our classes are not balanced:

## 
##  alternative conventional 
##         1147         5158

Like many classification algorithms, random forest may not work well when classes are imbalanced – especially if the imbalance is extreme. In such cases, the classifier might mistakenly learn that all cases belong to the majority class, instead of learning to distinguish among the classes. For example, if we have 100 cases and 95 of them belong to the majority class and 5 to the minority, the classifier can simply predict that all cases belong to the majority and be correct 95% of the time.

There are several ways to correct for this problem. One method is called ‘downsampling’, where not all the cases in the majority class are used. From our example above, instead of using all 95 cases from the majority class, we could use only 5, which would exactly balance the majority class with the 5 cases in the minority class. For our random forest model, a kind of downsampling called ‘balanced random forest’ works well when compared to other methods.

An obvious disadvantage of downsampling is that we throw out a lot of data from the majority class. But we randomize which cases we throw out when creating each decision tree, and we can create a lot of trees in a random forest. So, if we create a lot of trees, we can ensure that (nearly) all of the data from the majority class is used in at least one tree.

In the R package ‘randomForest’, the sample size is specified by the ‘sampsize’ parameter. We’ll set it to be the same size as our minority class, which is ‘alternative’ medicine. That doesn’t mean that every case in our minority class will be used once in each tree. We’re sampling the cases with replacement, so about 63% of the minority cases will be used on average in each tree. Some will be used more than once. Here’s our variable for the ‘sampsize’ parameter:

sampsize <- min(table(traindata[ , response_col]))

The number of trees in each forest is controlled by the ‘ntree’ or ‘ntreeTry’ parameters. We want it to be large:

n_trees <- 5000

At each node during creation of a tree, the algorithm tests our predictor variables to see which one best separates the cases into the purest, or most homogeneous, classes according to our response variable (‘conventional’ or ‘alternative’ medicine, in our case). How many predictor variables should it test? That’s what the ‘mtry’ parameter specifies, and it’s important because it can exert a strong influence on our final results. Therefore, we’re going to test several possible values of ‘mtry’ using the ‘tuneRF’ function and then select the best one to train our final model.

Let’s search for the best value of ‘mtry’ and save our results:

print_interval <- ceiling(n_trees / 100)
set.seed(22640171)
mtry_results <- tuneRF(y=traindata[ , response_col],
                       x=traindata[ , predictor_cols],
                       sampsize=c(sampsize, sampsize),
                       strata=traindata[ , response_col],
                       ntreeTry=n_trees,
                       doBest=F, importance=T, plot=F, do.trace=print_interval)

write.csv(mtry_results, 'mtry_oob_error.csv')

Let’s look at our results:

##        mtry   OOBError
## 19.OOB   19 0.06969120
## 38.OOB   38 0.06588409
## 76.OOB   76 0.06873942

The lowest out-of-bag (OOB) error rate was when ‘mtry’ was 38. That’s the value of ‘mtry’ that we want to use when training our final model, so let’s save that result:

mtry <- mtry_results[mtry_results[ , 2] == min(mtry_results[ , 2]), 1]

Now we can train our final random forest model and save it to an external file:

set.seed(83875501)
rfmodel <- randomForest(y=traindata[ , response_col],
                        x=traindata[ , predictor_cols],
                        sampsize=c(sampsize, sampsize),
                        strata=traindata[ , response_col],
                        ntree=n_trees, mtry=mtry,
                        importance=T, do.trace=print_interval)

saveRDS(rfmodel, file = "random_forest_model.rds")