Identifying Reviews of Medicine

Previously, we read the data set of Yelp business reviews into R.

Now we want to identify which reviews are about conventional medical establishments, like hospitals and physicians’ offices, and which are about alternative medicine, like acupuncture.

We saved our data to an external file that we can read back into R:

alldata <- readRDS("revs_json_read.rds")

The Yelp data set includes all kinds of businesses – restaurants, hotels, hardware stores – but we’re interested only in medical establishments. Each business is tagged with the categories that it’s in. We can check in the ‘business’ data frame to find the categories:

colnames(alldata$business)
##  [1] "business_id"   "full_address"  "hours"         "open"         
##  [5] "categories"    "city"          "review_count"  "name"         
##  [9] "neighborhoods" "longitude"     "state"         "stars"        
## [13] "latitude"      "attributes"    "type"
head(alldata$business$categories)
## [[1]]
## [1] "Doctors"          "Health & Medical"
## 
## [[2]]
## [1] "Nightlife"
## 
## [[3]]
## [1] "Active Life" "Mini Golf"   "Golf"       
## 
## [[4]]
## [1] "Shopping"                   "Home Services"             
## [3] "Internet Service Providers" "Mobile Phones"             
## [5] "Professional Services"      "Electronics"               
## 
## [[5]]
## [1] "Bars"           "American (New)" "Nightlife"      "Lounges"       
## [5] "Restaurants"   
## 
## [[6]]
## [1] "Bars"                   "American (Traditional)"
## [3] "Nightlife"              "Restaurants"

First, we’ll search for only the conventional medical establishments. After some further inspection of the categories, I chose the medical terms in the code below to identify conventional medicine:

medrows = unique(c(grep("Doctor", alldata$business$categories), 
                   grep("Hospital", alldata$business$categories),
                   grep("Allergist", alldata$business$categories),
                   grep("Anesthesiologist", alldata$business$categories),
                   grep("Cardiologist", alldata$business$categories),
                   grep("Surgeon", alldata$business$categories),
                   grep("Dentist", alldata$business$categories),
                   grep("Drugstore", alldata$business$categories),
                   grep("Ear Nose & Throat", alldata$business$categories),
                   grep("Endodontist", alldata$business$categories),
                   grep("Internal Medicine", alldata$business$categories),
                   grep("Laser Eye Surgery/Lasik", alldata$business$categories),
                   grep("Obstetrician", alldata$business$categories),
                   grep("Gastroenterologist", alldata$business$categories),
                   grep("Gynecologist", alldata$business$categories),
                   grep("Ophthalmologist", alldata$business$categories),
                   grep("Oncologist", alldata$business$categories),
                   grep("Orthodontist", alldata$business$categories),
                   grep("Orthopedist", alldata$business$categories),
                   grep("Orthotic", alldata$business$categories),
                   grep("Pediatric", alldata$business$categories),
                   grep("Periodontist", alldata$business$categories),
                   grep("Pharmacy", alldata$business$categories),
                   grep("Podiatrist", alldata$business$categories),
                   grep("Psychiatrist", alldata$business$categories),
                   grep("Pulmonologist", alldata$business$categories),
                   grep("Radiologist", alldata$business$categories),
                   grep("Rheumatologist", alldata$business$categories),
                   grep("Urologist", alldata$business$categories),
                   grep("Medical Center", alldata$business$categories)))

There are some judgment calls when choosing which terms to use. For example, I thought about including ‘Health’ as a term. Including it didn’t change the number of physicians or hospitals, but it did increase the numbers of establishments in the categories of ‘Counseling & Mental Health’, ‘Diagnostic Imaging’, ‘Diagnostic Services’, and ‘Optometrists’. It’s debatable whether those extra establishments should be included, but I chose to leave them out.

But one thing that we definitely want to exclude from our conventional medical establishments is anything associated with alternative medicine, because we want to make sure that there is a distinct separation between those two categories.

Some of our conventional medical establishments might be involved in things other than just conventional medicine:

alldata$business$categories[medrows[c(89, 43, 62)]]
## [[1]]
## [1] "Doctors"                   "Shopping"                 
## [3] "Beauty & Spas"             "Health & Medical"         
## [5] "Cosmetics & Beauty Supply" "Hair Removal"             
## [7] "Dermatologists"           
## 
## [[2]]
## [1] "Doctors"          "Health & Medical" "Urgent Care"     
## [4] "Chiropractors"   
## 
## [[3]]
## [1] "Doctors"             "Shopping"            "Optometrists"       
## [4] "Ophthalmologists"    "Health & Medical"    "Eyewear & Opticians"

In the examples above, there are some chiropractors, some shopping, and apparently a spa. I don’t know exactly what happens at any of these specific establishments – maybe some of the categories are mistaken – but these establishments might be providing services other than conventional medicine. To be certain that the establishments we include are ‘purely’ conventional medical establishments, we’ll exclude any that stray into these other categories.

Here are some terms in the code below that we can exclude from our conventional medical categories:

rmaltmedrows = unique(c(grep("Acupuncture", alldata$business$categories[medrows]),
                        grep("Massage", alldata$business$categories[medrows]),
                        grep("Naturopath", alldata$business$categories[medrows]),
                        grep("Psychic", alldata$business$categories[medrows]),
                        grep("Yoga", alldata$business$categories[medrows]),
                        grep("Spas", alldata$business$categories[medrows]),
                        grep("Food", alldata$business$categories[medrows]),
                        grep("Fitness", alldata$business$categories[medrows]),
                        grep("Osteopath", alldata$business$categories[medrows]),
                        grep("Chinese Medicine", alldata$business$categories[medrows]),
                        grep("Shopping", alldata$business$categories[medrows]),
                        grep("Chiropractor", alldata$business$categories[medrows]),
                        grep("Cannabis", alldata$business$categories[medrows]),
                        grep("Reflexology", alldata$business$categories[medrows]),
                        grep("Rolfing", alldata$business$categories[medrows]),
                        grep("Coach", alldata$business$categories[medrows]),
                        grep("Reiki", alldata$business$categories[medrows])))

Before excluding the categories above, we have 2632 businesses identified as conventional medical establishments.

medrows = medrows[-rmaltmedrows]

Afterwards, we have 1879 establishments remaining.

Let’s take a look at all the business categories that are associated with our conventional medical establishments:

## 
##                    Allergists             Anesthesiologists 
##                            15                             3 
##                   Audiologist                 Cardiologists 
##                             3                            10 
##       Colleges & Universities             Cosmetic Dentists 
##                             1                           263 
##             Cosmetic Surgeons    Counseling & Mental Health 
##                            26                             3 
##                      Dentists                Dermatologists 
##                           747                            62 
##            Diagnostic Imaging           Diagnostic Services 
##                            13                            18 
##                       Doctors             Ear Nose & Throat 
##                           906                            28 
##                     Education           Employment Agencies 
##                             3                             1 
##              Endocrinologists                  Endodontists 
##                             7                            26 
##               Family Practice                     Fertility 
##                           142                            10 
##             First Aid Classes            Gastroenterologist 
##                             1                             7 
##             General Dentistry              Health & Medical 
##                           494                          1879 
##         Hearing Aid Providers              Home Health Care 
##                             2                             1 
##                     Hospitals             Internal Medicine 
##                            98                            60 
##            Laboratory Testing       Laser Eye Surgery/Lasik 
##                             3                            21 
##               Medical Centers                      Midwives 
##                           214                             6 
##                   Neurologist                 Nutritionists 
##                             9                             1 
## Obstetricians & Gynecologists          Occupational Therapy 
##                           126                             1 
##                    Oncologist              Ophthalmologists 
##                             5                            40 
##                  Optometrists                 Oral Surgeons 
##                            34                            65 
##                 Orthodontists                  Orthopedists 
##                           113                            46 
##                     Orthotics            Pediatric Dentists 
##                             2                            89 
##                 Pediatricians                 Periodontists 
##                            94                            25 
##                      Pharmacy              Physical Therapy 
##                            10                            15 
##                   Podiatrists         Professional Services 
##                            29                             1 
##                   Prosthetics                 Psychiatrists 
##                             1                             4 
##                 Pulmonologist                  Radiologists 
##                             1                            17 
##         Rehabilitation Center              Retirement Homes 
##                             2                             1 
##               Rheumatologists             Specialty Schools 
##                             3                             2 
##             Speech Therapists               Sports Medicine 
##                             1                            27 
##                      Surgeons                Tattoo Removal 
##                             5                             3 
##                   Urgent Care                    Urologists 
##                            72                             5 
##               Walk-in Clinics           Weight Loss Centers 
##                             1                             9

The categories look okay. None of the categories are clearly associated with alternative medicine, and nearly all of them are clearly associated with conventional medicine (with a few exceptions like ‘Colleges & Universities’ and ‘Employment Agencies’).

Now we’ll follow the same process for alternative medicine: search for the alternative medical establishments and exclude any associated with conventional medicine.

Below are the terms for alternative medicine:

altmedrows = unique(c(grep("Acupuncture", alldata$business$categories), 
                      grep("Chiropractor", alldata$business$categories),
                      grep("Chinese Medicine", alldata$business$categories),
                      grep("Reflexology", alldata$business$categories),
                      grep("Reiki", alldata$business$categories),
                      grep("Osteopath", alldata$business$categories),
                      grep("Rolfing", alldata$business$categories),
                      grep("Naturopathic", alldata$business$categories)))

And here are the terms associated with conventional medicine that we’ll exclude:

rmmedrows = unique(c(grep("Dermatologists", alldata$business$categories[altmedrows]),
                     grep("Neurologist", alldata$business$categories[altmedrows]),
                     grep("Obstetrician", alldata$business$categories[altmedrows]),
                     grep("Gynecologist", alldata$business$categories[altmedrows]),
                     grep("Orthopedist", alldata$business$categories[altmedrows]),
                     grep("Allergist", alldata$business$categories[altmedrows]),
                     grep("Internal Medicine", alldata$business$categories[altmedrows])))

Before excluding the categories above, we have 378 businesses identified as alternative medical establishments.

altmedrows = altmedrows[-rmmedrows]

Afterwards, we have 370 alternative medical establishments remaining.

Here are all the business categories associated with the alternative medical establishments:

table(unlist(alldata$business$categories[altmedrows]))
## 
##                  Active Life                  Acupuncture 
##                            5                           66 
##         Arts & Entertainment                Beauty & Spas 
##                            4                           66 
##   Books, Mags, Music & Video                   Bookstores 
##                            1                            1 
##             Cannabis Clinics                Chiropractors 
##                            2                          246 
##      Colleges & Universities                     Day Spas 
##                            1                           12 
##          Diagnostic Services                      Doctors 
##                            1                           51 
##                   Drugstores                    Education 
##                            1                            1 
##              Family Practice                    Fertility 
##                            6                            1 
##        Fitness & Instruction                         Food 
##                            4                            1 
##                 Hair Removal               Health Markets 
##                            2                            1 
##             Health & Medical             Home Health Care 
##                          370                            1 
##        Hypnosis/Hypnotherapy                   Life Coach 
##                            1                            5 
##                      Massage              Massage Therapy 
##                           48                           47 
##              Medical Centers                 Medical Spas 
##                            8                           12 
##           Meditation Centers        Naturopathic/Holistic 
##                            1                           40 
##                Nutritionists       Osteopathic Physicians 
##                           10                            2 
##             Physical Therapy                      Pilates 
##                           29                            2 
##        Professional Services       Psychics & Astrologers 
##                            5                            4 
##                  Reflexology        Rehabilitation Center 
##                           43                            3 
##                        Reiki                      Rolfing 
##                           13                            2 
##                     Shopping                    Skin Care 
##                            2                            7 
##               Specialty Food              Sports Medicine 
##                            1                            3 
## Traditional Chinese Medicine                     Trainers 
##                           15                            2 
##                  Urgent Care          Weight Loss Centers 
##                            1                            6

The categories look reasonable: lots of chiropractors, some acupuncture, some massage, some naturopathy, and so on. There are some terms that are also associated with conventional medicine, like ‘Doctors’ and ‘Physical Therapy’, but those terms could reasonably apply to alternative medicine, too. Overall, I think we have a pretty good set of businesses that are largely involved in alternative medicine.

As I mentioned above, there’s some subjectivity in what to include and exclude from conventional and alternative medicine. For example, yoga is often associated with alternative medicine. But yoga is exercise, which is clearly consistent with conventional medical advice, even though it’s not specifically provided by conventional medical establishments. Thus, yoga businesses don’t seem to clearly fit into one category of medicine but not the other, and we want to clearly distinguish between the two medicines for our analysis. So, we haven’t included yoga businesses, but others might choose differently for other purposes.

Much of the problem in distinguishing conventional and alternative medicine is that there’s overlap between them, even while they are different. One key distinguishing feature of modern conventional medicine versus earlier medicine and today’s alternative medicine is that it works. That is, we can measure the effects of conventional medical treatments with rigorous, scientifically accepted methods and show that they prevent, alleviate, or cure diseases. You want to prove that your treatment works? Show us the evidence.

The simple story I just told above is largely true, but in the real world, there are complications; boundaries are often fuzzier than we would like them to be. Conventional medicine is largely evidence-based medicine, but not entirely. And alternative medicine generally has little or no scientific evidence on its side, but some of it does. So inevitably, there must be some judgment calls in trying to distinguish between them. The categories that we’ve outlined above seem to capture the common classification wisdom well enough.

Did any of the establishments get classified as both conventional and alternative medicine?

intersect(medrows, altmedrows)
## integer(0)
intersect(alldata$business$business_id[medrows], alldata$business$business_id[altmedrows])
## character(0)

Good, none of the establishments are classified as both; our conventional and alternative medical establishments are distinct from each other.

Now that we’ve identified our conventional and alternative medical establishments, let’s get the reviews about them:

# get row indices for reviews of mainstream medicine businesses
medrevrows = NA
for (iter in 1:length(medrows)) {
     medrevrows = c(medrevrows, which(alldata$review$business_id ==
                                      alldata$business$business_id[medrows[iter]]))
}
medrevrows = medrevrows[-1]

# get row indices for reviews of alternative medicine businesses
altmedrevrows = NA
for (iter in 1:length(altmedrows)) {
     altmedrevrows = c(altmedrevrows, which(alldata$review$business_id ==
                                            alldata$business$business_id[altmedrows[iter]]))
}
altmedrevrows = altmedrevrows[-1]

That gives us 12957 reviews about conventional medicine and 2883 reviews about alternative medicine.

We’ll put them into their own data frame:

medrevs = alldata$review[medrevrows, c(1:4, 6, 8)]
altmedrevs = alldata$review[altmedrevrows, c(1:4, 6, 8)]

medrevs[ , "medicine"] = "conventional"
altmedrevs[ , "medicine"] = "alternative"

revs = rbind(medrevs, altmedrevs)

But wait. The number of medical establishments that we originally identified isn’t the same as the number of establishments that ended up in our data frame of reviews:

length(medrows)
## [1] 1879
length(altmedrows)
## [1] 370
length(medrows) + length(altmedrows)
## [1] 2249
length(unique(revs$business_id))
## [1] 2226

It looks like 23 establishments don’t appear in the reviews data frame. Did we make a mistake?

Or maybe those 23 establishments simply weren’t reviewed. Let’s check whether the business IDs for those 23 establishments appear anywhere in Yelp’s data table of reviews:

length(setdiff(alldata$business$business_id[medrows], alldata$review$business_id))
## [1] 21
length(setdiff(alldata$business$business_id[altmedrows], alldata$review$business_id))
## [1] 2

Ah ha, there were 21 conventional medical establishments that weren’t reviewed and 2 alternative medical establishments that weren’t reviewed. Together, those fully account for the 23 missing establishments. We didn’t make a mistake. Those establishments simply had no reviews to add to our reviews data frame.

Now that we’re sure we have all the correct reviews, we’ll save the data to an external file:

saveRDS(revs, file = "revs.rds")