Classifying Medicine

Introduction

In recent centuries, conventional medicine has developed practices to treat disease that are largely based on scientific evidence, and it has developed professional training and licensing standards to ensure that medical treatments are delivered safely and effectively. Practitioners of alternative medicine – such as naturopathy and reiki – also claim to treat disease and improve health, even though their practices are seldom supported by robust scientific evidence.

Patients, or patrons, of conventional and alternative medicines have similar goals of alleviating disease and improving health. But do these patients/patrons have different perceptions of their treatment experiences with each kind of medicine? And if so, what drives these different perceptions? Any such differences might arise from different types of patients/patrons, different goals for treatment, different treatment results, different “customer” service (e.g., “bedside manner”), or other factors.

Analyzing customer reviews of medical businesses/establishments might provide insight into these questions. The company Yelp hosts a website where customers may rate and review many types of businesses, including conventional and alternative medical establishments. Yelp reviews were analyzed to provide insight into the questions posed above. More specifically, do patients/patrons of each kind of medicine rate their treatment experiences differently? Can textual analysis of the reviews of medical establishments predict whether the review was written about conventional or alternative medicine? If so, the predictive factors may illuminate how the treatment experiences differ between conventional and alternative medicine. Such information might be useful to those who run medical establishments; such persons may wish to understand why patrons find their own establishment, or competing establishments, appealing. Public policy makers, healthcare investors, and researchers of the healthcare system might also find the results of this analysis interesting.

Methods

Data source. Yelp reviews were obtained from a publicly available data set that included reviews of businesses in 10 cities in Canada, Germany, the United Kingdom, and the United States (http://www.yelp.com/dataset_challenge).

Identifying reviews of conventional and alternative medicine. Conventional and alternative medical establishments were identified by searching for a broad set of business category terms that were associated with each type of medicine and then eliminating any businesses that were categorized with any terms associated with the other type of medicine. This approach ensured that any establishments that were associated with both conventional and alternative medicines were not included in the analysis, so that the remaining establishments were clearly and unambiguously associated with only one type of medicine.

Ratings analysis. Each Yelp review had an associated rating of 1 – 5 stars (1 = lowest rating of the business; 5 = highest).

Features / Predictor variables. The final feature set for each review included the numbers of words, characters, syllables, and poly-syllables; the numbers of characters, syllables, and poly-syllables per word; the numbers of periods, question marks, exclamation marks, and incomplete statements; the numbers of pronouns (e.g., “I”, “you”, “it”); polarity; diversity (Shannon, Simpson, and Berger-Parker indices); lexical classification; formality; and the usage rates of the most common words that appeared in the reviews. The number of sentences (and variables based on this metric) was excluded as a feature because the function measuring sentence number relies on punctuation consistent with formal English, which was not reliably present in Yelp reviews.

Machine learning classification. The reviews were divided so that 60% were assigned to a training data set and the remainder were assigned to a testing set. These assignments were stratified by star ratings, so that approximately equal proportions of each rating (1 – 5 stars) were in the training and testing sets.

Because there were many fewer reviews of alternative medicine establishments compared to conventional medicine establishments (i.e., imbalanced classes), reviews for conventional medicine were downsampled, so that there were equal numbers of reviews for conventional and alternative medicine during training iterations.

A random forest algorithm was selected to classify reviews as being about conventional or alternative medicine. Random forest was chosen because its provides more interpretable results than many alternatives. A parameter search was conducted on the training data for the number of predictors to try for optimal splitting at each tree node. The model was then trained at the optimal parameter value and tested. The model’s performance was evaluated with several metrics that are non-biased in the presence of imbalanced classes, including informedness, markedness, and Matthews correlation coefficient (MCC), as well as with ROC and precision-recall curves.

Once the model’s performance was shown to be sufficient, Mean Decreases in Accuracy and in the Gini Index were used to assess the importance of features in making classifications.

Results

For analysis, 12895 reviews of 1847 conventional medical establishments and 2866 reviews of 368 alternative medical establishments were identified.

I first investigated whether Yelp reviewers rated their treatment experiences differently between conventional and alternative medicine. The proportions of reviews receiving 1, 2, 3, 4, or 5 stars for each kind of medicine are below.

Most ratings are at the extremes – 1s or 5s – of the scoring scale. This pattern may occur because the patrons, or patients, who are the most motivated to write about and rate their experiences on Yelp are probably the ones who feel the strongest about those experiences. Notably, alternative medicine is rated more highly than conventional medicine. Alternative medicine receives a higher proportion of 5s and a lower proportion of 1s. The 2s, 3s, and 4s are rare enough that they do not matter much. Yet a majority of the conventional medicine patrons/patients still rate their experiences at the highest possible score. So conventional medicine still does well – just not as well as alternative medicine.

When the random forest model predicted whether reviews were about conventional or alternative medicine on the testing data set, it achieved an MCC of 0.79. The out-of-bag metrics during training were very similar to the corresponding metrics on the testing set, suggesting that the model did not overfit, despite the parameter tuning on the training set. The ROC curve is plotted below.

The random forest model at the default tree voting threshold of 50% is plotted as a solid red dot on the red ROC curve. Black dots represent random and otherwise uninformative classifiers. The black line passing through the red dot is the iso-performance line. The line’s slope deviates from 1 due to the presence of imbalanced classes. The costs of false positives and false negatives were considered equal, and so did not affect the slope of the line.

The presence of a region of the ROC curve to the upper left of the iso-performance line suggested the existence of a voting threshold that would increase the model’s classification performance. Maximizing the MCC to 0.81 identified a voting threshold (0.54) that elevated performance, as represented by the green dot in the zoomed-in plot below of the ROC space.

The same classifiers are plotted on a corresponding precision-recall curve below.

The open circles are markedness-informedness points that correspond to the closed precision-recall dots of the same color. As unbiased metrics in the presence of imbalanced classes, the markedness and informedness suggest a less optimistic appraisal of the model’s performance than do precision and recall.

Given the model’s sufficient performance, the importance of the model’s features might provide some insights into how patrons’/patients’ experiences with conventional and alternative medicine differ. Variable importances for the top 30 features as measured by two methods are plotted below.

Most of the most important features provided sensible but not novel information. The words ‘massage’ and ‘chiropractor’ were far more often used in reviews for alternative medicine than for conventional medicine, while ‘dentist’ was far more prevalent in reviews for conventional medicine. Such information provides confidence that the model identified salient features, but it does not provide any new insights.

Cursory inspection did reveal some potentially more informative insights. Patients mentioned pain more often for alternative medicine than for conventional, and they rated their alternative medical providers higher when they mentioned pain. Patients wrote about waiting more for conventional medicine than for alternative medicine, and they apparently did not like it. Patients of alternative medicine also mentioned ‘relaxing’ and ‘price’ more often, and they associated their treatment experiences with more positive emotions, compared with conventional medicine. These insights merit a more systematic and rigorous investigation.

Discussion

The random forest model provided generally accurate classifications of reviews according to whether they were written about conventional or alternative medical establishments. Patrons/patients rated alternative medical providers more highly than conventional medical providers and wrote about alternative treatment experiences with more positive emotion. Alternative medical patients wrote about pain more often, compared with conventional medical patients, and they were apparently satisfied with their treatment experience. They also wrote about being relaxed and prices, whereas conventional medical patients wrote about waiting, which they apparently did not like.

The random forest model’s performance generalized very well from the training set to the testing set, suggesting that it did not overfit. Even so, the analysis was implemented on reviews of only 368 alternative medical establishments in only a few cities in North American and Europe. A larger and more geographically diverse data set would give greater confidence that the model’s results truly reflect widespread differences between conventional and alternative medicine.

While the model performs very well on multiple metrics, it might be underfit, and the MCC is low enough to realistically allow for some notable improvement in its performance. The usage rates of only the most common words were used as features. This saved computational resources but probably sacrificed some valuable features. Conversely, many of the most common words – especially ‘stop words’ – were probably not useful discriminators and could have been removed during preprocessing. Stemming the words might also reduce the size of the feature set and improve consistency in feature importance.

Feature importance might also be improved by re-scaling some of the features. While random forest does not require feature normalization for accurate classifications, non-normalized features can bias the importance scores. In this case, the vast majority of the features, including the most important features, were on the same scale, but a few were not and their importance might be exaggerated as currently measured.

Importance was assessed for each feature individually. However, many of the most important words were semantically related to each other. Instead of manually inspecting the individual results to detect patterns, one might use an automated approach like topic modeling. Such an approach could improve this pattern detection and produce more novel insights into the differences between conventional and alternative medicine.

Further understanding how patrons/patients of conventional and alternative medicine experience their treatments differently would require a more proactive approach. Online reviewers can provide substantial information, but they are a small and non-representative segment of all patrons/patients. Businesses and other institutions that wished to understand these differences further would want to expand their approach to include survey data, as well as official records, if available. These more diverse sources of information could provide us with a more complete view of how patients make and evaluate their medical choices.