Reading the Data – Andrew Fairless, Ph.D.

The data set of business reviews was downloaded in 2015 from Yelp’s Dataset Challenge webpage at http://www.yelp.com/dataset_challenge. The data set is in the current working directory in a directory called ‘yelp_dataset_challenge_academic_dataset’.

The data set comprises 5 JSON files that we can read using the R package ‘jsonlite’:

library(jsonlite)

dirname <- "yelp_dataset_challenge_academic_dataset"
filestem <- "yelp_academic_dataset_"
filenamepart <- c("business", "checkin", "review", "tip", "user")
fileext <- ".json"
alldata <- list()

for (iter in 1:length(filenamepart)) {
     filename = paste(dirname, "/", filestem, filenamepart[iter], fileext, sep = "")
     dataframe = filenamepart[iter]
     alldata[[iter]] <- fromJSON(sprintf("[%s]", paste(readLines(filename), collapse = ",")))
     print(filename)
     print(dataframe)
}
     
names(alldata) <- filenamepart
     
saveRDS(alldata, file = "revs_json_read.rds")

We saved each JSON file as a data frame in a list of data frames and saved the list to an external file. Saving the data to an external file is handy, because reading the JSON files takes a while – on my computer, at least – and reading the data from our external file is faster.

The Yelp data set includes all kinds of businesses – restaurants, hotels, hardware stores – but we’re interested only in medical establishments. Each business is tagged with the categories that it’s in. We can check in the ‘business’ data frame to find the categories:

colnames(alldata$business)

##  [1] "business_id"   "full_address"  "hours"         "open"         
##  [5] "categories"    "city"          "review_count"  "name"         
##  [9] "neighborhoods" "longitude"     "state"         "stars"        
## [13] "latitude"      "attributes"    "type"

head(alldata$business$categories)

## [[1]]
## [1] "Doctors"          "Health & Medical"
## 
## [[2]]
## [1] "Nightlife"
## 
## [[3]]
## [1] "Active Life" "Mini Golf"   "Golf"       
## 
## [[4]]
## [1] "Shopping"                   "Home Services"             
## [3] "Internet Service Providers" "Mobile Phones"             
## [5] "Professional Services"      "Electronics"               
## 
## [[5]]
## [1] "Bars"           "American (New)" "Nightlife"      "Lounges"       
## [5] "Restaurants"   
## 
## [[6]]
## [1] "Bars"                   "American (Traditional)"
## [3] "Nightlife"              "Restaurants"

The reviews of the businesses are stored in the ‘review’ data frame.

colnames(alldata$review)

## [1] "votes"       "user_id"     "review_id"   "stars"       "date"       
## [6] "text"        "type"        "business_id"

alldata$review$text[c(3, 22, 54, 88)]

## [1] "Dr. Goldberg has been my doctor for years and I like him.  I've found his office to be fairly efficient.  Today I actually got to see the doctor a few minutes early!  \n\nHe seems very engaged with his patients and his demeanor is friendly, yet authoritative.    \n\nI'm glad to have Dr. Goldberg as my doctor."
## [2] "I visited this store several months ago to simply ask about smartphone plans. The agent was pleasant and helpful. I would recommend a visit to this store."                                                                                                                                                            
## [3] "Don't waste your time.  We had two different people come to our house to give us estimates for a deck (one of them the OWNER).  Both times, we never heard from them.  Not a call, not the estimate, nothing."                                                                                                         
## [4] "Great Diner!  Their breakfast is the best in the area!  Lots of  choices and always good.  Love the unique mugs they use for coffee and all the writing on the wall.  Get there early otherwise you will not get a seat!  They have homemade hot/sweet sausage and texas toast."

In the next section, we’ll figure out how to identify the medical establishments that we’re interested in.