Creating Training and Testing Data Sets

Previously, we read the data set of Yelp business reviews into R, identified which reviews were about conventional and alternative medical establishments, and removed reviews that couldn’t be validly analyzed (e.g., non-English reviews). Then we added and modified features that we can use to predict whether a review is about an establishment of conventional or alternative medicine.

Now we need to split our data into training and testing sets. We’ll train our prediction model on the training set and then test its performance on the testing set.

We saved our data to an external file that we can read back into R:

expdata <- readRDS("revsfeaturesculled.rds")

In addition to providing reviews, Yelp lets its users rate businesses on their overall experience with that business. The ratings range from the lowest, 1 star, to the highest, 5 stars. Let’s look at how conventional and alternative medical establishments fare in these ratings:

First, notice how most of the ratings are either 1 or 5 stars – the lowest or highest possible. I suspect that most patrons or patients of medical establishments actually have middling experiences – 2s, 3s, or 4s. But the patrons/patients who are motivated enough to write about and rate their experiences on Yelp are probably the ones who feel the strongest about those experiences. Hence, this may be why we see such extremes in the ratings. We don’t know for certain that this is the reason, but we should keep it in mind as a possibility while interpreting the data.

Second, we’ll note that alternative medicine is rated more highly than conventional medicine. Alternative medicine receives a notably higher proportion of 5s and a lower proportion of 1s. The 2s, 3s, and 4s are rare enough that they don’t matter much.

However, a majority of the conventional medicine patrons/patients still rate their experiences at the highest possible score. So conventional medicine still does well – just not as well as alternative medicine.

When we split our data into training and testing sets, we’re going to maintain (nearly) equal proportions of each rating in the two sets. That way, when our model makes predictions for the testing set, classification errors won’t occur because the (proportions of) ratings were different from the data it trained on. When we split our data, 60% will become our training set and the other 40% will become our testing set.

Let’s identify the rows in our data frame that will be in our testing set:

var1 <- levels(expdata$medicine)
var2 <- sort(unique(expdata$stars))
testrows <- NA
set.seed(43317)

for (iter1 in 1:length(var1)) {
     for (iter2 in 1:length(var2)) {
          set <- which(expdata$medicine == var1[iter1] & expdata$stars == var2[iter2])
          sampledrows <- sample(set, size = round(0.4 * length(set)), replace = FALSE)
          testrows <- append(testrows, sampledrows)          
     }
}
testrows <- testrows[-1]

traindata <- expdata[-testrows, ]
testdata <- expdata[testrows, ]

Let’s check our split to make sure we did it correctly. Here are the overall proportions for the training and testing sets, respectively:

## [1] 0.5999619
## [1] 0.4000381

Those are very close to our intended 60%-40% split. Good.

Our proportions for our subgroups as defined by the two types of medicine and the five star ratings should also have (nearly) the same proportions. Here are the proportions for all of the data:

##               
##                          1           2           3           4           5
##   alternative  0.010976461 0.004060656 0.004250999 0.016750206 0.145802931
##   conventional 0.197132162 0.043905844 0.037243830 0.088890299 0.450986613

For the training set:

##               
##                          1           2           3           4           5
##   alternative  0.010998308 0.004018613 0.004230118 0.016708968 0.145833333
##   conventional 0.197123519 0.043887479 0.037225042 0.088938240 0.451036379

For the testing set:

##               
##                          1           2           3           4           5
##   alternative  0.010943695 0.004123711 0.004282316 0.016812054 0.145757335
##   conventional 0.197145123 0.043933386 0.037272006 0.088818398 0.450911975

The corresponding proportions for every subgroup are nearly equal, so our split worked the way we intended.

Let’s save our data sets to external files:

saveRDS(traindata, file = "traindata.rds")
saveRDS(testdata, file = "testdata.rds")