Cleaning the Data

Previously, we read the data set of Yelp business reviews into R and identified which reviews were about conventional and alternative medical establishments.

We saved those reviews to an external file that we can read back into R:

revs <- readRDS("revs.rds")

Let’s re-arrange the data frame to make it easier to work with:

revs <- cbind(revs$business_id, revs[ , 2:3], revs$medicine, revs$stars,
              revs$votes, revs$text, stringsAsFactors = F)
colnames(revs) <- gsub("revs\\$", "", colnames(revs))
revs$medicine <- as.factor(revs$medicine)
revs$business_id <- as.factor(revs$business_id)
revs <- revs[order(revs$review_id), ]

Our data set has 15840 reviews, one for each row. Here’s the first few rows of our data, minus the reviews:

head(revs[ , 1:8])
##                    business_id                user_id
## 253534  JtJCJTfMarEgkE4GQWYszw L9-X4KASFfdhOeVeYSDgvA
## 82661   kvEo5n57c-x_KcEpUzS1yA DOJM58OkGSsIdk2qCUZnLQ
## 14888   81R8SNN9_RfJoMIx3lqNYg hBZ4tkZJqkLZRf215I-WDg
## 304946  w-oT9vw9y1Y0Jd5Df-J_ng JQfPnTVHsRhRpK5dl-qxbA
## 1003011 lkCRIX6odrbZOuwtACr9Dw c0MADTAt4wiavdG_tEujDQ
## 74521   CGne_Sr7m9HT4ztsWQw6Aw tCVQyYHcmOrx2i81C47Tew
##                      review_id     medicine stars funny useful cool
## 253534  009WfPydDpJF1W6C_GnpLQ conventional     5     1      4    1
## 82661   00sSZKtSKhCS0Cxndmovlw conventional     3     0      2    0
## 14888   00VfFQXlIy1dPYmf0DMPaA conventional     2     0      2    0
## 304946  00yA91ROMC4LwMGKnJH2oQ  alternative     5     0      1    0
## 1003011 0140MOVX8ec10sT7kvWuVg  alternative     5     0      1    0
## 74521   01I4YPgjvz8r-UoZb2XW-w conventional     5     0      0    0

And here’s the first review:

revs[1, 9]
## [1] "Wow, what a great experience.  The place is vast (come a little early) but it's beautiful, flawlessly orgranized and has lovely service.  \n\nThere's an information desk complete with someone to walk you part of the way to your goal.  On your floor, there's a brisk check-in line where you hand in your insurance info as needed, and they alert the office you're visting to let them know you're ready for them.  There's a large, really nice seating area and someone comes and gets you within a few minutes.\n\nIt's clean and smells nice, and everyone is pleasant and professional.  The doctors and nurses were both knowledgable and extremely pleasant - they exuded competence and had a lovely bedside manner.  Even though my problem was minor and quick, they were thorough.\n\nI actually felt at ease, and even with all this great service, I was still in and out of there in record time (never more than a few moments waiting at any point), and I wasn't the first appointment of the day or anything.  It's amazing to go to a doctor's office that runs punctually!! \n\nOver all, I felt a bit like I'd walked into Lifestyles of the Rich and Famous - it felt like VIP treatment and atmosphere."

Now we need to clean the data set by removing any reviews that will pose problems for our analysis later.

We’re going to use the R package ‘qdap’ to do a lot of our text analysis. Let’s load it:

library(qdap)

Cleaning data: UTF-8 format

Later in our analysis, we’re going to use qdap’s ‘word_stats’ function, but it doesn’t accept texts in UTF-8 format. Do we have any reviews in UTF-8?

utfrows <- which(Encoding(revs$text[1:dim(revs)[1]]) == "UTF-8")
length(utfrows)
## [1] 76

Yep, 76 of our 15840 reviews are in UTF-8. I tried converting them but encountered problems. There aren’t that many of them and we don’t want to spend too much time chasing rabbits down holes, so let’s just get rid of them:

revs <- revs[-utfrows, ]

Now we’re down to 15764 reviews.

Cleaning data: Non-English reviews

Some of our reviews may not be in English. How would we find them? Well, an English spell-checker will return a very high error rate for non-English reviews, so let’s try that. First we’ll create a data frame with the number and rate of misspellings for each review:

misspellnumbs <- table(check_spelling(revs$text, assume.first.correct = FALSE)$row)
misspells <- cbind(as.integer(rownames(misspellnumbs)),
                   as.integer(misspellnumbs))
colnames(misspells) <- c("textrownumber", "misspellingsnumber")
missingrows <- setdiff(1:dim(revs)[1], misspells[ , 1])
nomisspells <- cbind(missingrows, rep(0, length(missingrows)))
misspells <- as.data.frame(rbind(misspells, nomisspells))
misspells <- misspells[order(misspells[ , 1]), ]  
rownames(misspells) <- 1:dim(misspells)[1]
misspells$totalwords <- word_count(revs$text)
misspells$errorrate <- misspells[ , 2] / misspells [ , 3]

Let’s plot the misspelling rate for each review to see if any are really high:

There are a few reviews with really high misspelling rates. Let’s look at the ones with a rate higher than 50%:

## [1] "Dauert manchmal etwas lang aber sonst ist alles okay"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [2] "NT"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "NO LLEVEN A SUS HIJOS A ESTE LUGAR!\nNO LLEVEN A SUS HIJOS A ESTE LUGAR!\n\ntengo una bebe de 9 meses, la lleve a este lugar desde que nacio. Resulta que cambie de pediatra por que me hacian esperar horas para que me atiendan. Ahora con nuevo pediatra me avisan que a mi nena la vacunaron en fechas equivocadas que no le correspondian a su edad, ahora tengo que esperar que mi nena cumpla 15 meses para que le hagan examenes de sangre para asegurarme de que este bien y estas vacunas no le hayan afectado. Obviamente que si algo le pasa a corte los llevare, pero les recomiendo no llevar a sus hijos aqui para que no pasen por la preocupacion que estoy pasando y ademas para que no sufran sus bebitos con tantas agujas."
## [4] "Best ob/gyns!"

Yep, there are two reviews that are clearly not in English. After removing them, we have 15762 reviews remaining.

I also checked reviews with error rates as low as ~20% and didn’t see any other non-English reviews.

Cleaning data: Formality processing error

Later in our analysis we’ll use qdap’s ‘formality’ function to process the reviews, but it returns an error for one review, and I haven’t solved the error, so we’ll just remove it.

problemrows <- c(4364)
revs <- revs[-problemrows, ]

That leaves us with 15761 reviews: 12895 reviews of 1847 conventional medical establishments and 2866 reviews of 368 alternative medical establishments.

Cleaning data: Formatting

To finish preparing our reviews for analysis with qdap, we need to run these functions:

revs$text = add_incomplete(revs$text)
revs$text = incomplete_replace(revs$text)

We have now removed reviews that would impede our later analysis from our data set, and we’ve prepared the text of the reviews for that analysis. This is a good place to pause and save our cleaned data set to an external file:

saveRDS(revs, file = "revsculledprepped.rds")