Identifying Reviews of Medicine
Previously, we read the data set of Yelp business reviews into R.
Now we want to identify which reviews are about conventional medical establishments, like hospitals and physicians’ offices, and which are about alternative medicine, like acupuncture.
We saved our data to an external file that we can read back into R:
alldata <- readRDS("revs_json_read.rds")
The Yelp data set includes all kinds of businesses – restaurants, hotels, hardware stores – but we’re interested only in medical establishments. Each business is tagged with the categories that it’s in. We can check in the ‘business’ data frame to find the categories:
colnames(alldata$business)
## [1] "business_id" "full_address" "hours" "open"
## [5] "categories" "city" "review_count" "name"
## [9] "neighborhoods" "longitude" "state" "stars"
## [13] "latitude" "attributes" "type"
head(alldata$business$categories)
## [[1]]
## [1] "Doctors" "Health & Medical"
##
## [[2]]
## [1] "Nightlife"
##
## [[3]]
## [1] "Active Life" "Mini Golf" "Golf"
##
## [[4]]
## [1] "Shopping" "Home Services"
## [3] "Internet Service Providers" "Mobile Phones"
## [5] "Professional Services" "Electronics"
##
## [[5]]
## [1] "Bars" "American (New)" "Nightlife" "Lounges"
## [5] "Restaurants"
##
## [[6]]
## [1] "Bars" "American (Traditional)"
## [3] "Nightlife" "Restaurants"
First, we’ll search for only the conventional medical establishments. After some further inspection of the categories, I chose the medical terms in the code below to identify conventional medicine:
medrows = unique(c(grep("Doctor", alldata$business$categories),
grep("Hospital", alldata$business$categories),
grep("Allergist", alldata$business$categories),
grep("Anesthesiologist", alldata$business$categories),
grep("Cardiologist", alldata$business$categories),
grep("Surgeon", alldata$business$categories),
grep("Dentist", alldata$business$categories),
grep("Drugstore", alldata$business$categories),
grep("Ear Nose & Throat", alldata$business$categories),
grep("Endodontist", alldata$business$categories),
grep("Internal Medicine", alldata$business$categories),
grep("Laser Eye Surgery/Lasik", alldata$business$categories),
grep("Obstetrician", alldata$business$categories),
grep("Gastroenterologist", alldata$business$categories),
grep("Gynecologist", alldata$business$categories),
grep("Ophthalmologist", alldata$business$categories),
grep("Oncologist", alldata$business$categories),
grep("Orthodontist", alldata$business$categories),
grep("Orthopedist", alldata$business$categories),
grep("Orthotic", alldata$business$categories),
grep("Pediatric", alldata$business$categories),
grep("Periodontist", alldata$business$categories),
grep("Pharmacy", alldata$business$categories),
grep("Podiatrist", alldata$business$categories),
grep("Psychiatrist", alldata$business$categories),
grep("Pulmonologist", alldata$business$categories),
grep("Radiologist", alldata$business$categories),
grep("Rheumatologist", alldata$business$categories),
grep("Urologist", alldata$business$categories),
grep("Medical Center", alldata$business$categories)))
There are some judgment calls when choosing which terms to use. For example, I thought about including ‘Health’ as a term. Including it didn’t change the number of physicians or hospitals, but it did increase the numbers of establishments in the categories of ‘Counseling & Mental Health’, ‘Diagnostic Imaging’, ‘Diagnostic Services’, and ‘Optometrists’. It’s debatable whether those extra establishments should be included, but I chose to leave them out.
But one thing that we definitely want to exclude from our conventional medical establishments is anything associated with alternative medicine, because we want to make sure that there is a distinct separation between those two categories.
Some of our conventional medical establishments might be involved in things other than just conventional medicine:
alldata$business$categories[medrows[c(89, 43, 62)]]
## [[1]]
## [1] "Doctors" "Shopping"
## [3] "Beauty & Spas" "Health & Medical"
## [5] "Cosmetics & Beauty Supply" "Hair Removal"
## [7] "Dermatologists"
##
## [[2]]
## [1] "Doctors" "Health & Medical" "Urgent Care"
## [4] "Chiropractors"
##
## [[3]]
## [1] "Doctors" "Shopping" "Optometrists"
## [4] "Ophthalmologists" "Health & Medical" "Eyewear & Opticians"
In the examples above, there are some chiropractors, some shopping, and apparently a spa. I don’t know exactly what happens at any of these specific establishments – maybe some of the categories are mistaken – but these establishments might be providing services other than conventional medicine. To be certain that the establishments we include are ‘purely’ conventional medical establishments, we’ll exclude any that stray into these other categories.
Here are some terms in the code below that we can exclude from our conventional medical categories:
rmaltmedrows = unique(c(grep("Acupuncture", alldata$business$categories[medrows]),
grep("Massage", alldata$business$categories[medrows]),
grep("Naturopath", alldata$business$categories[medrows]),
grep("Psychic", alldata$business$categories[medrows]),
grep("Yoga", alldata$business$categories[medrows]),
grep("Spas", alldata$business$categories[medrows]),
grep("Food", alldata$business$categories[medrows]),
grep("Fitness", alldata$business$categories[medrows]),
grep("Osteopath", alldata$business$categories[medrows]),
grep("Chinese Medicine", alldata$business$categories[medrows]),
grep("Shopping", alldata$business$categories[medrows]),
grep("Chiropractor", alldata$business$categories[medrows]),
grep("Cannabis", alldata$business$categories[medrows]),
grep("Reflexology", alldata$business$categories[medrows]),
grep("Rolfing", alldata$business$categories[medrows]),
grep("Coach", alldata$business$categories[medrows]),
grep("Reiki", alldata$business$categories[medrows])))
Before excluding the categories above, we have 2632 businesses identified as conventional medical establishments.
medrows = medrows[-rmaltmedrows]
Afterwards, we have 1879 establishments remaining.
Let’s take a look at all the business categories that are associated with our conventional medical establishments:
##
## Allergists Anesthesiologists
## 15 3
## Audiologist Cardiologists
## 3 10
## Colleges & Universities Cosmetic Dentists
## 1 263
## Cosmetic Surgeons Counseling & Mental Health
## 26 3
## Dentists Dermatologists
## 747 62
## Diagnostic Imaging Diagnostic Services
## 13 18
## Doctors Ear Nose & Throat
## 906 28
## Education Employment Agencies
## 3 1
## Endocrinologists Endodontists
## 7 26
## Family Practice Fertility
## 142 10
## First Aid Classes Gastroenterologist
## 1 7
## General Dentistry Health & Medical
## 494 1879
## Hearing Aid Providers Home Health Care
## 2 1
## Hospitals Internal Medicine
## 98 60
## Laboratory Testing Laser Eye Surgery/Lasik
## 3 21
## Medical Centers Midwives
## 214 6
## Neurologist Nutritionists
## 9 1
## Obstetricians & Gynecologists Occupational Therapy
## 126 1
## Oncologist Ophthalmologists
## 5 40
## Optometrists Oral Surgeons
## 34 65
## Orthodontists Orthopedists
## 113 46
## Orthotics Pediatric Dentists
## 2 89
## Pediatricians Periodontists
## 94 25
## Pharmacy Physical Therapy
## 10 15
## Podiatrists Professional Services
## 29 1
## Prosthetics Psychiatrists
## 1 4
## Pulmonologist Radiologists
## 1 17
## Rehabilitation Center Retirement Homes
## 2 1
## Rheumatologists Specialty Schools
## 3 2
## Speech Therapists Sports Medicine
## 1 27
## Surgeons Tattoo Removal
## 5 3
## Urgent Care Urologists
## 72 5
## Walk-in Clinics Weight Loss Centers
## 1 9
The categories look okay. None of the categories are clearly associated with alternative medicine, and nearly all of them are clearly associated with conventional medicine (with a few exceptions like ‘Colleges & Universities’ and ‘Employment Agencies’).
Now we’ll follow the same process for alternative medicine: search for the alternative medical establishments and exclude any associated with conventional medicine.
Below are the terms for alternative medicine:
altmedrows = unique(c(grep("Acupuncture", alldata$business$categories),
grep("Chiropractor", alldata$business$categories),
grep("Chinese Medicine", alldata$business$categories),
grep("Reflexology", alldata$business$categories),
grep("Reiki", alldata$business$categories),
grep("Osteopath", alldata$business$categories),
grep("Rolfing", alldata$business$categories),
grep("Naturopathic", alldata$business$categories)))
And here are the terms associated with conventional medicine that we’ll exclude:
rmmedrows = unique(c(grep("Dermatologists", alldata$business$categories[altmedrows]),
grep("Neurologist", alldata$business$categories[altmedrows]),
grep("Obstetrician", alldata$business$categories[altmedrows]),
grep("Gynecologist", alldata$business$categories[altmedrows]),
grep("Orthopedist", alldata$business$categories[altmedrows]),
grep("Allergist", alldata$business$categories[altmedrows]),
grep("Internal Medicine", alldata$business$categories[altmedrows])))
Before excluding the categories above, we have 378 businesses identified as alternative medical establishments.
altmedrows = altmedrows[-rmmedrows]
Afterwards, we have 370 alternative medical establishments remaining.
Here are all the business categories associated with the alternative medical establishments:
table(unlist(alldata$business$categories[altmedrows]))
##
## Active Life Acupuncture
## 5 66
## Arts & Entertainment Beauty & Spas
## 4 66
## Books, Mags, Music & Video Bookstores
## 1 1
## Cannabis Clinics Chiropractors
## 2 246
## Colleges & Universities Day Spas
## 1 12
## Diagnostic Services Doctors
## 1 51
## Drugstores Education
## 1 1
## Family Practice Fertility
## 6 1
## Fitness & Instruction Food
## 4 1
## Hair Removal Health Markets
## 2 1
## Health & Medical Home Health Care
## 370 1
## Hypnosis/Hypnotherapy Life Coach
## 1 5
## Massage Massage Therapy
## 48 47
## Medical Centers Medical Spas
## 8 12
## Meditation Centers Naturopathic/Holistic
## 1 40
## Nutritionists Osteopathic Physicians
## 10 2
## Physical Therapy Pilates
## 29 2
## Professional Services Psychics & Astrologers
## 5 4
## Reflexology Rehabilitation Center
## 43 3
## Reiki Rolfing
## 13 2
## Shopping Skin Care
## 2 7
## Specialty Food Sports Medicine
## 1 3
## Traditional Chinese Medicine Trainers
## 15 2
## Urgent Care Weight Loss Centers
## 1 6
The categories look reasonable: lots of chiropractors, some acupuncture, some massage, some naturopathy, and so on. There are some terms that are also associated with conventional medicine, like ‘Doctors’ and ‘Physical Therapy’, but those terms could reasonably apply to alternative medicine, too. Overall, I think we have a pretty good set of businesses that are largely involved in alternative medicine.
As I mentioned above, there’s some subjectivity in what to include and exclude from conventional and alternative medicine. For example, yoga is often associated with alternative medicine. But yoga is exercise, which is clearly consistent with conventional medical advice, even though it’s not specifically provided by conventional medical establishments. Thus, yoga businesses don’t seem to clearly fit into one category of medicine but not the other, and we want to clearly distinguish between the two medicines for our analysis. So, we haven’t included yoga businesses, but others might choose differently for other purposes.
Much of the problem in distinguishing conventional and alternative medicine is that there’s overlap between them, even while they are different. One key distinguishing feature of modern conventional medicine versus earlier medicine and today’s alternative medicine is that it works. That is, we can measure the effects of conventional medical treatments with rigorous, scientifically accepted methods and show that they prevent, alleviate, or cure diseases. You want to prove that your treatment works? Show us the evidence.
The simple story I just told above is largely true, but in the real world, there are complications; boundaries are often fuzzier than we would like them to be. Conventional medicine is largely evidence-based medicine, but not entirely. And alternative medicine generally has little or no scientific evidence on its side, but some of it does. So inevitably, there must be some judgment calls in trying to distinguish between them. The categories that we’ve outlined above seem to capture the common classification wisdom well enough.
Did any of the establishments get classified as both conventional and alternative medicine?
intersect(medrows, altmedrows)
## integer(0)
intersect(alldata$business$business_id[medrows], alldata$business$business_id[altmedrows])
## character(0)
Good, none of the establishments are classified as both; our conventional and alternative medical establishments are distinct from each other.
Now that we’ve identified our conventional and alternative medical establishments, let’s get the reviews about them:
# get row indices for reviews of mainstream medicine businesses
medrevrows = NA
for (iter in 1:length(medrows)) {
medrevrows = c(medrevrows, which(alldata$review$business_id ==
alldata$business$business_id[medrows[iter]]))
}
medrevrows = medrevrows[-1]
# get row indices for reviews of alternative medicine businesses
altmedrevrows = NA
for (iter in 1:length(altmedrows)) {
altmedrevrows = c(altmedrevrows, which(alldata$review$business_id ==
alldata$business$business_id[altmedrows[iter]]))
}
altmedrevrows = altmedrevrows[-1]
That gives us 12957 reviews about conventional medicine and 2883 reviews about alternative medicine.
We’ll put them into their own data frame:
medrevs = alldata$review[medrevrows, c(1:4, 6, 8)]
altmedrevs = alldata$review[altmedrevrows, c(1:4, 6, 8)]
medrevs[ , "medicine"] = "conventional"
altmedrevs[ , "medicine"] = "alternative"
revs = rbind(medrevs, altmedrevs)
But wait. The number of medical establishments that we originally identified isn’t the same as the number of establishments that ended up in our data frame of reviews:
length(medrows)
## [1] 1879
length(altmedrows)
## [1] 370
length(medrows) + length(altmedrows)
## [1] 2249
length(unique(revs$business_id))
## [1] 2226
It looks like 23 establishments don’t appear in the reviews data frame. Did we make a mistake?
Or maybe those 23 establishments simply weren’t reviewed. Let’s check whether the business IDs for those 23 establishments appear anywhere in Yelp’s data table of reviews:
length(setdiff(alldata$business$business_id[medrows], alldata$review$business_id))
## [1] 21
length(setdiff(alldata$business$business_id[altmedrows], alldata$review$business_id))
## [1] 2
Ah ha, there were 21 conventional medical establishments that weren’t reviewed and 2 alternative medical establishments that weren’t reviewed. Together, those fully account for the 23 missing establishments. We didn’t make a mistake. Those establishments simply had no reviews to add to our reviews data frame.
Now that we’re sure we have all the correct reviews, we’ll save the data to an external file:
saveRDS(revs, file = "revs.rds")
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });