Evaluating the Model – Andrew Fairless, Ph.D.

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we added features, split our data into training and testing sets, and trained a random forest model to predict whether a review is about an establishment of conventional or alternative medicine.

Now we need to evaluate whether the model is any good. If so, we can look at which features are most important for its classifications, and that may give us some insights into how patients’ experiences with conventional and alternative medicine differ.

We’ll need the R packages ‘randomForest’, ‘caret’, and ‘ROCR’:

library(randomForest)
library(caret)
library(ROCR)

We saved our training data set, testing data set, and random forest model to external files that we can read back into R:

traindata <- readRDS('traindata.rds')
testdata <- readRDS('testdata.rds')
rfmodel <- readRDS('random_forest_model.rds')

We’ll also need to know which column in our training and testing data contains our response variable:

response_col <- 4

Evaluate model: Metrics functions

We’ll need metrics by which to evaluate our model.

Let’s first write a function to extract some information from a confusion matrix:

get_confusion_elements <- function(caret_confusion_matrix) {
     tp <- as.numeric(caret_confusion_matrix$table[1])   # true positives
     fn <- as.numeric(caret_confusion_matrix$table[2])   # false negatives
     fp <- as.numeric(caret_confusion_matrix$table[3])   # false positives
     tn <- as.numeric(caret_confusion_matrix$table[4])   # true negatives
     return( c(tp, fp, tn, fn) )
}

Here are some functions to calculate informedness, which can be thought of as a non-biased version of recall, or sensitivity, which is particularly helpful when evaluating classifiers of imbalanced classes. It measures how consistently a predictor predicts the response variable, compared to a random predictor. It ranges from -1 to 1 and would be expected to be zero for a random, uninformative classifier. For dichotomous variables, it’s also known as Youden’s J and as DeltaP’:

calculate_informedness <- function(sensitivity_recall_tpr, specificity_tnr) {
     # for dichotomous outcomes, informedness is also known as Youden's J and
     #    as Delta P'
     informedness <- sensitivity_recall_tpr + specificity_tnr - 1
     return(informedness)
}

get_informedness <- function(caret_confusion_matrix) {
     # calculates informedness from confusion matrix
     sensitivity_recall_tpr <- caret_confusion_matrix$byClass[1]
     specificity_tnr <- caret_confusion_matrix$byClass[2]
     inform <- unname(calculate_informedness(sensitivity_recall_tpr,
                                             specificity_tnr))
     return(inform)
}

Here are some functions to calculate markedness, which can be thought of as a non-biased version of precision, or positive predictive value, which again is particularly helpful when evaluating classifiers of imbalanced classes. It measures how consistently a response is associated with a predictor, compared to a random association. Like informedness, it also ranges from -1 from 1 and would be expected to be zero for a random, uninformative classifier. For dichotomous variables, it’s also known as DeltaP:

calculate_markedness <- function(ppv_precision, npv) {
     # for dichotomous outcomes, markedness is also known as Delta P
     markedness <- ppv_precision + npv - 1
     return(markedness)
}

get_markedness <- function(caret_confusion_matrix) {
     # calculates markedness from confusion matrix
     ppv_precision <- caret_confusion_matrix$byClass[3]
     npv <- caret_confusion_matrix$byClass[4]
     markedness <- unname(calculate_markedness(ppv_precision, npv))
     return(markedness)
}

Here are some functions to calculate the Matthews correlation coefficient, which can be thought of as a version of Pearson’s correlation coefficient for dichotomous variables, though it is calculated from a confusion matrix and may not range from -1 to 1 for non-dichotomous variables. It is an average of informedness and markedness, and thus represents an unbiased correlation that balances the importance of classifying both classes correctly. It is also known as the phi coefficient:

calculate_mcc <- function(tp, fp, tn, fn) {
     # calculates Matthews correlation coefficient
     # tp - true positives
     # fp - false positives
     # tn - true negatives
     # fn - false negatives
     mcc <- ((tp * tn) - (fp * fn)) /
            sqrt( (tp + fp) * (tp + fn) * (tn + fp) * (tn + fn) )
     return(mcc)
}

get_mcc <- function(caret_confusion_matrix) {
     # calculates Matthews correlation coefficient from confusion matrix
     ce <- get_confusion_elements(caret_confusion_matrix)
     mcc <-calculate_mcc(ce[1], ce[2], ce[3], ce[4])
     return(mcc)
}

Here are some functions to calculate the commonly used F1 score, which is an average of recall, or sensitivity, and precision, or positive predictive value. It focuses on cases and predictions of whichever class is designated as the positive class – usually the minority, or rarer, class – rather than balancing the importance of both classes. When classes are imbalanced, it is biased, but we’ll include it here because it is so commonly used:

calculate_f1_score <- function(precision, recall, beta=1) {
     # calculates F1 score from precision, recall, and beta
     f1_score <- unname( ( (1 + beta^2) * recall * precision ) /
                         ( (recall + precision) * beta^2) )
     return(f1_score)
}
     
get_f1_score <- function(caret_confusion_matrix) {
     # calculates F1 score from confusion matrix
     sensitivity_recall <- caret_confusion_matrix$byClass[1]
     ppv_precision <- caret_confusion_matrix$byClass[3]
     f1_score <- calculate_f1_score(ppv_precision, sensitivity_recall)
     return(f1_score)
}

Here’s a function to calculate the area under the curve (AUC) for the receiver operating characteristic (ROC) curve for a single classifier. The AUC is “equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance”, where the ‘positive instance’ is a case in the positive class, which is usually chosen to be the minority class, and the ‘negative instance’ is usually in the majority class:

calculate_auc_one_classifier<- function(recall_sensitivity_tpr,
                                        specificity_tnr) {
     a <- recall_sensitivity_tpr
     b <- 1 - specificity_tnr
     # a*b/2 + a*(1-b) + (1-a)*(1-b)/2 reduces to:
     auc <- a/2 - b/2 + 1/2
     return(auc)
}

And finally, a function to compile several of our metrics:

get_metrics <- function(confusion_matrix) {
     inform <- get_informedness(confusion_matrix)
     mark <- get_markedness(confusion_matrix)
     mcc <- get_mcc(confusion_matrix)
     f1 <- get_f1_score(confusion_matrix)
     return( c(inform, mark, mcc, f1) )
}

Evaluate model: Set up classifiers

Now that we have our metric-calculating functions ready, let’s use our random forest model to predict the classes in our testing data set:

rf_classify <- predict(rfmodel, testdata)

Let’s also include the out-of-bag (OOB) classifications that were made during training. In general, the OOB classifications should provide a good estimate of the error rates on new data. However, there are some circumstances where they evidently do not. In our case, we did do some tuning on the parameter ‘mtry’, which might affect the OOB estimates. We’ll take a look and see:

rf_oob_classify <- rfmodel$predicted

Next we’ll define some useless classifiers to compare with our model.

But first, it will be convenient to have the numbers and names from our response variable in the training data:

class_table <- table(traindata[ , response_col])
class_table

## 
##  alternative conventional 
##         1719         7737

class_table_prop <- class_table / sum(class_table)
class_table_prop

## 
##  alternative conventional 
##    0.1817893    0.8182107

Now we’ll set up our first useless classifier. It will randomly assign a classification as either alternative or conventional to each case with a 50%-50% probability:

set.seed(12247081)
random_classify_5050 <- sample(names(class_table), dim(testdata)[1], replace=T)

Did we approximately get our 50%-50% classification?

table(random_classify_5050)

## random_classify_5050
##  alternative conventional 
##         3149         3156

table(random_classify_5050) / sum(table(random_classify_5050))

## random_classify_5050
##  alternative conventional 
##    0.4994449    0.5005551

Yep, that looks about right.

Now for our second useless classifier. It’ll randomly assign classification according to the prevalences of the classes in our testing data:

set.seed(32571012)
random_classify_skew <- sample(names(class_table), dim(testdata)[1], replace=T,
                               prob=class_table_prop)

Did we get the same prevalences? Here again is what they were for the training data:

class_table_prop

## 
##  alternative conventional 
##    0.1817893    0.8182107

And here’s what we got for the randomly predicted testing data:

table(random_classify_skew)

## random_classify_skew
##  alternative conventional 
##         1195         5110

table(random_classify_skew) / sum(table(random_classify_skew))

## random_classify_skew
##  alternative conventional 
##    0.1895321    0.8104679

Looks good.

Now let’s add our third useless classifier. This one will predict only the majority class:

max_class_n <- max(class_table)
max_class_name <- names(which(class_table == max_class_n))
uniform_classify <- rep(max_class_name, dim(testdata)[1])

What did we get?

table(uniform_classify)

## uniform_classify
## conventional 
##         6305

Only ‘conventional’ was predicted, just as we wanted.

However, some of our metrics won’t be defined for that classifier, and it might be interesting just to see what they would be for a very similar classifier. So let’s add a fourth useless classifier. Like the previous one, this one will predict everything as the majority class, except for one case:

min_class_n <- min(class_table)
min_class_name <- names(which(class_table == min_class_n))
almost_uniform_classify <- uniform_classify
almost_uniform_classify[1] <- min_class_name

What did we get this time?

table(almost_uniform_classify)

## almost_uniform_classify
##  alternative conventional 
##            1         6304

Perfect, that’s exactly what we wanted: one case classified as ‘alternative’ and everything else classified as ‘conventional’.

Evaluate model: Calculate metrics

Now, let’s calculate!

classify_list <- list(rf_oob_classify, rf_classify, random_classify_5050,
                      random_classify_skew, uniform_classify,
                      almost_uniform_classify)

confusion_matrices <- vector('list', length(classify_list))
col_names <- c('sensitivity', 'specificity', 'inform', 'mark', 'mcc', 'f1',
               'auc')
metrics <- data.frame(matrix(nrow=length(classify_list),
                             ncol=length(col_names)))
rownames(metrics) <- c('rf_oob', 'rf', 'random_50-50', 'random_skew',
                       'uniform', 'almost_uniform')
colnames(metrics) <- col_names
for (i in 1:length(classify_list)) {
     if (i == 1) {
         confusion_matrices[[i]] <- confusionMatrix(classify_list[[i]],
                                                    traindata[ , response_col])
     } else {
         confusion_matrices[[i]] <- confusionMatrix(classify_list[[i]],
                                                    testdata[ , response_col])
     }
    metrics[i, 1] <- confusion_matrices[[i]]$byClass[1]
    metrics[i, 2] <- confusion_matrices[[i]]$byClass[2]
    metrics[i, 3:(dim(metrics)[2]-1)] <- get_metrics(confusion_matrices[[i]])
    metrics[i, dim(metrics)[2]] <- calculate_auc_one_classifier(metrics[i, 1],
                                                                metrics[i, 2])
}

Evaluate model: Assess metrics

First, let’s look at the confusion matrix for our random forest model:

confusion_matrices[[2]]

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     alternative conventional
##   alternative          992          243
##   conventional         155         4915
##                                           
##                Accuracy : 0.9369          
##                  95% CI : (0.9306, 0.9428)
##     No Information Rate : 0.8181          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7941          
##  Mcnemar's Test P-Value : 1.295e-05       
##                                           
##             Sensitivity : 0.8649          
##             Specificity : 0.9529          
##          Pos Pred Value : 0.8032          
##          Neg Pred Value : 0.9694          
##              Prevalence : 0.1819          
##          Detection Rate : 0.1573          
##    Detection Prevalence : 0.1959          
##       Balanced Accuracy : 0.9089          
##                                           
##        'Positive' Class : alternative     
##

Our accuracy is 94%, but that doesn’t mean much with imbalanced classes, because we could get a high accuracy even if we predict poorly for the minority class but well on the majority class.

Fortunately, we’re predicting rather well for our minority class, which comprises reviews for alternative medicine. Out of all those reviews, we correctly classify 86% of them as for alternative medicine, which is the sensitivity, or recall. Out of the reviews that we classify as for alternative medicine, 80% really are for alternative medicine, which is the positive predictive value, or precision.

Not surprisingly, we’re predicting even better for our majority class, the reviews for conventional medicine. We correctly classify 95% of them (specificity), and 97% of our classifications as reviews for conventional medicine are correct (negative predictive value).

Let’s compare our random forest model to the useless classifiers:

format(metrics, digits = 2)

##                sensitivity specificity  inform   mark    mcc     f1  auc
## rf_oob             0.84584        0.95 0.79931 0.7669 0.7829 0.8231 0.90
## rf                 0.86486        0.95 0.81775 0.7727 0.7949 0.8329 0.91
## random_50-50       0.51787        0.50 0.02253 0.0134 0.0174 0.2765 0.51
## random_skew        0.19180        0.81 0.00278 0.0027 0.0027 0.1879 0.50
## uniform            0.00000        1.00 0.00000    NaN    NaN    NaN 0.50
## almost_uniform     0.00087        1.00 0.00087 0.8182 0.0267 0.0017 0.50

First, the OOB metrics (‘rf_oob’ in the table above) are very similar to the model’s performance on the testing data (‘rf’ in the table above). Thus, our model didn’t overfit. So as it turns out, we probably would’ve been fine not having a testing data set and simply relying on the OOB estimates of the model’s performance on new data.

Clearly, our random forest model is much better than the useless classifiers (the rows labeled ‘random_50-50, random_skew, uniform, almost_uniform’ in the table above).

Remember that informedness (‘inform’ in the table above) can be thought of as a non-biased version of sensitivity, and in our case, it’s a bit below sensitivity. Markedness (‘mark’ in the table above) has a similar value, which suggests that our model is balancing them well. The useless classifiers have values near zero for both metrics, except for the almost-uniform classifer for markedness. Its markedness is pretty good because it’s predicting nearly all reviews as for conventional medicine, but its informedness is terrible.

The Matthews correlation coefficient (MCC; ‘mcc’ in the table above) is a single metric that balances informedness and markedness. If we had to choose only one of these metrics to evaluate our model on imbalanced classes, the MCC is probably the best choice. Our model clearly scores far above the useless classifiers, but since the maximum possible value is 1, our model’s value of 0.79 could likely be improved upon.

The F1 score (‘f1’ in the table above) is included here because it’s commonly used, so some readers may wish to see it.

The area under the curve (AUC) for the receiver operating characteristic (ROC) curve (‘auc’ in the table above) is 0.91, fairly near its maximum possible value of 1. The useless classifiers all have AUCs around 0.50, as we would expect. These AUCs are calculated for single classifiers, each at a single value for sensitivity and a single value for specificity. But it would be nice to calculate a full ROC curve. Fortunately, a random forest model can provide us with more than just its predicted dichotomous classifications. It can also provide us with the proportion of trees that voted for a classification, and we can use that information to construct a full ROC curve with various combinations of sensitivities and specificities at different voting thresholds.

Evaluate model: ROC curve

rf_classify_prob <- predict(rfmodel, testdata, type='prob')
rf_test_pred <- prediction(rf_classify_prob[ , 1], testdata[ , response_col],
                      c(max_class_name, min_class_name))
rf_test_roc <- performance(rf_test_pred, 'tpr', 'fpr')

In the plot above, the full ROC curve is the curved red line. The dots represent individual classifiers. The red one is our random forest model, and it falls exactly on the full ROC curve. The ROC curve for the single classifier by itself is the red straight lines. For the single classifier, we previously calculated an AUC of 0.91. But for the full ROC curve, the AUC is larger – 0.97 – and this is obvious from the plot.

The four useless classifiers are the four black dots; only 3 dots are visible because the ‘uniform’ and ‘almost_uniform’ classifiers are plotted right on top of each other. All the useless classifiers are near the dashed black line, which represents classifiers that perform no better than chance. Just as we saw when we inspected the metrics, the random forest clearly performs much better than the useless classifiers.

Our random forest predictions on the testing set were based on a single voting threshold. Specifically, the model defaulted to a majority vote; if more than 50% of the trees voted that a review was about conventional (or alternative) medicine, then that’s how the model classified it. That 50% voting threshold produced a single confusion matrix and represented a single choice in balancing sensitivity, specificity, precision, and other metrics. But perhaps we can find a better voting threshold.

First, we’ll plot an iso-performance (or iso-cost) line. Any classifiers that fall on the same iso-performance line will perform equally well. The slope of this line is determined by the relative costs of misclassifications and by the prevalences or (im)balances of our classes. In our case, we’re judging our two classes to be equally important, so false positives and false negatives have the same cost. But our majority class of conventional medicine has many more cases than our minority class, alternative medicine. The ratio of the numbers of their cases determines the slope of the iso-performance line:

get_y_intercept <- function(x, y, slope) {
     return(y - (slope * x))
}

slope <- max_class_n / min_class_n
intercept <- get_y_intercept(1 - metrics[2, 2], metrics[2, 1], slope)

We re-plotted our ROC curve, but now we’ve added the iso-performance line for our random forest model with its default voting threshold of 50%. There are some points on the ROC curve that are to the upper left of the iso-performance line, suggesting that some other voting thresholds might perform better than our default.

Evaluate model: Maximizing MCC

We mentioned earlier that the MCC is probably the best single metric by which to judge our model. Let’s find out what voting threshold maximizes the MCC.

rf_test_mcc <- performance(rf_test_pred, 'mat')
mcc_max <- max(rf_test_mcc@y.values[[1]], na.rm=T)
mcc_max_idx <- which(rf_test_mcc@y.values[[1]] == mcc_max)
mcc_max_threshold <- rf_test_mcc@x.values[[1]][mcc_max_idx]

Across all the voting thresholds, the maximum MCC is 0.807. When we inspected our metrics earlier, the MCC was 0.795, so we can improve upon that by a little bit. The plot above shows the MCC plotted against the voting threshold (labeled ‘Cutoff’) with a dashed vertical line at the threshold that maximizes the MCC. That value is 0.544.

thresholds <- rf_test_roc@alpha.values[[1]]
thresholds_idx <- which(thresholds == mcc_max_threshold)
roc_y <- rf_test_roc@y.values[[1]][thresholds_idx]
roc_x <- rf_test_roc@x.values[[1]][thresholds_idx]

We’ve returned to our ROC plot, but now we’re zoomed in on the upper left corner. The new green point is the model with the voting threshold that maximizes MCC. Notice that it’s to the upper left of the iso-performance line, so it will perform better than our original voting threshold (at the red dot). I haven’t seen a reference verifying this, but I suspect that that green dot maximizes the distance between the iso-performance line and the ROC curve in the direction orthogonal to the iso-performance line and heading toward the upper left.

Let’s see how our new, MCC-optimizing threshold compares to the original threshold:

rf_predict_optimal_threshold <- as.factor(rf_classify_prob[, 1] < mcc_max_threshold)
levels(rf_predict_optimal_threshold) <- c(min_class_name, max_class_name)
rf_confusion_optimal_threshold <- confusionMatrix(rf_predict_optimal_threshold, 
    testdata[, response_col])
rf_optimized_metrics <- metrics[2:3, 1:dim(metrics)[2] - 1]
rownames(rf_optimized_metrics)[2] <- "rf_optimized"
rf_optimized_metrics[2, 1] <- rf_confusion_optimal_threshold$byClass[1]
rf_optimized_metrics[2, 2] <- rf_confusion_optimal_threshold$byClass[2]
rf_optimized_metrics[2, 3:6] <- get_metrics(rf_confusion_optimal_threshold)
rf_optimized_metrics

##              sensitivity specificity    inform      mark       mcc
## rf             0.8648649   0.9528887 0.8177536 0.7726669 0.7948906
## rf_optimized   0.8151700   0.9724699 0.7876400 0.8276014 0.8073735
##                     f1
## rf           0.8329135
## rf_optimized 0.8408273

The model with the optimized threshold is labeled ‘rf_optimized’. Its informedness is a bit lower, but its markedness is a bit higher and drives the small increase in the MCC.

Thus, our model with the optimized voting threshold performs better than the model with the original threshold.

Here’s the confusion matrix based on the optimized voting threshold:

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     alternative conventional
##   alternative          935          142
##   conventional         212         5016
##                                           
##                Accuracy : 0.9439          
##                  95% CI : (0.9379, 0.9494)
##     No Information Rate : 0.8181          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8068          
##  Mcnemar's Test P-Value : 0.0002451       
##                                           
##             Sensitivity : 0.8152          
##             Specificity : 0.9725          
##          Pos Pred Value : 0.8682          
##          Neg Pred Value : 0.9594          
##              Prevalence : 0.1819          
##          Detection Rate : 0.1483          
##    Detection Prevalence : 0.1708          
##       Balanced Accuracy : 0.8938          
##                                           
##        'Positive' Class : alternative     
##

Evaluate model: PRC curve

Finally, we’ll briefly look at a precision-recall (PRC) plot, which might be especially useful for imbalanced classes:

The red curve is the PRC curve. Classifiers that perform at a chance level would appear near the horizontal dashed line. Our model at the original (red) and optimized (green) voting thresholds appear as solid dots on the PRC curve. Given that markedness and informedness can be viewed as non-biased versions of precision and recall, respectively, I also plotted the model at the two voting thresholds at their markedness-informedness points as open circles. The markedness-informedness points don’t rate the model as well as the precision-recall points do. A full markedness-informedness curve might be useful, especially for imbalanced data sets.

A quick methodological note: the R package ‘PRROC’ provides proper non-linear interpolation when constructing PRC curves. The package that we used, ‘ROCR’, doesn’t do so.

Evaluate model: Where are the learning curves?

For random forest models, plotting the in-bag training error generally isn’t meaningful, because the error is always zero. But that’s with the default settings, which assume that at least 50% of the cases will be in-bag. We downsampled, so we had less than 50% of cases in-bag. So, we can plot a meaningful training curve, right? Well, no, not exactly. We downsampled only the majority class, conventional medicine, so we could calculate meaningful traning error for it. But for alternative medicine, over 50% of cases (~63%, on average) would still be in-bag. We wouldn’t want to assess our model’s bias and variance based only on the majority class, and we wouldn’t want to downsample our minority class, as we already have few cases for it. One option might be to increase the node size, so that trees don’t reach their maximum size, but we would have to test that to see whether the resulting model would still perform as well as our current one.