“Every good work of software starts by scratching a developer’s personal itch.” -Eric Raymond
I recently moved to San Francisco to join the Insight Health Data Science program as a Fellow. One of the first things that I needed to do in a new city was to find a primary care physician, but I didn’t have time to read all of the reviews of doctors online.
I realized it would be nice to have a tool that could highlight the pros and cons of each doctor in concise snippets of information, saving me needless hours diving into lengthy reviews. I decided to build such a tool myself for my Insight project and call it DoctorSnapshot .
When people search for doctors using business review websites, they naturally choose among the doctors that have the highest ratings and a large number of reviews that support those high ratings. These highly-rated doctors could have hundreds or even thousands of reviews under their profiles, and comparing highly-rated doctors to each other becomes a tedious task.
Furthermore, even if there is only one highly-rated doctor, one may still want to read the reviews to see why people like this doctor and if the reviewers addressed his or her concerns. This, again, could be time-consuming. In both cases, some sort of review summarizer would be helpful.
Web services such as Zocdoc and Yelp have offered their own version of “doctor snapshots” to help users quickly see what other reviewers have said about doctors. Zocdoc rates doctors based on three categories: “overall rating,” “bedside manner,” and “wait time”. However, this does not cover any other useful points that users made in their specific reviews. Yelp automatically highlights representative review sentences that share common phrases with other sentences ( see example ), but no explicit rating is given for the topics mentioned in those sentences.
I decided that my tool would combine the best of both the above products. DoctorSnapshot first detects the topics that have been discussed in the reviews (e.g. bedside manner). Then, it analyzes whether people were talking positively or negatively about those topics, and finally assigns appropriate ratings to the topics.
My first step in building DoctorSnapshot was collecting a large number of reviews. As far as I knew, there was no existing dataset with reviews of doctors available online, so the only way to acquire these data was through web-scraping. I started by looking at a database of physicians called BetterDoctor , whose API allows for easy querying of doctors’ profiles using geographical locations.
Although BetterDoctor itself does not contain reviews of doctors, it provides the Yelp url for a doctor if he/she has a Yelp page. I retrieved all doctors in San Francisco from BetterDoctor based on the longitudes and latitudes of their addresses and ended up with 187 doctors that have independent Yelp pages. While 187 doesn’t sound like a large number, this gave me 5,088 reviews composed of 700,000 words. (For comparison, all seven Harry Potter books are comprised of about 1 million words.)
Latent Dirichlet Allocation (LDA) is a popular Natural Language Processing (NLP) tool that can automatically identify topics from a corpus. LDA assumes each topic is made of a bag of words with certain probabilities, and each document is made of a bag of topics with certain probabilities — this concept is illustrated in the figure below (see here for a more detailed explanation). The goal of LDA is to learn the word and topic distribution underlying the corpus. Gensim is an NLP package that is particularly suited for LDA and other word-embedding machine learning algorithms, so I used Gensim to implement my project.
I obtained 11 meaningful topics from LDA and I categorized them manually into “general topics” and “doctor specialty topics.”
The general topics are:
The doctor specialty topics are:
As I mentioned above, in LDA, each topic is composed of words with probabilities. For example, the most frequent words for the “Payments” topic are insurance, pay, company, cover, charge, cost, paid, pocket, price, office, payment, medicine, service, amount, and claim . Here, I should point out that LDA only group words into topics. It takes a human to manually understand a topic and assign the topic a meaning by looking at the words in the topics.
In order to score doctors by the topics mentioned in their reviews, I needed to analyze the sentiments of their reviews. I defined a percentage rating for a topic as the percent of reviews that gave a positive comment when they mentioned the topic (similar to Rotten Tomatoes). I used this metric to assign sentiment scores to topics.
More specifically, I used my trained LDA model to determine the topic composition of each sentence in a doctor’s reviews. If a sentence was dominated by one topic by 70% or more, I considered that sentence as belonging to that specific topic. Then, I calculated the sentiment of the sentence, either positive or negative, and finally counted the total percent of positive sentences in each topic for the final ratings of that doctor.
To supplement my ratings by topic, I also added in highlights from reviews for users to read. These highlights are the three most positive and three most negative sentences in a doctor’s reviews, based on the sentiment scores.
Before using my trained models to generate snapshots of doctors, I tried a couple of ways to validate my model. First, I made sure that medical specialty topics appeared in the reviews of a doctor with the same specialty. I found that my specialty topics automatically learned by my LDA model actually aligned with the doctors’ recorded specialties (found via the BetterDoctor API). The figure below illustrates this fact. For example, the topic “Skin procedures” only appears in the reviews of dermatologists.
For the topics that can appear in any doctor’s reviews, i.e. for “general topics,” I used a different validation method. LDA is only one of the many word embedding algorithms. Another well-used word embedding method is word2vec, perfect for LDA validation. I trained a word2vec neural network and projected the top words in the LDA-obtained general topics onto the word2vec space. I presented the projection result in a 2-dimensional figure produced by t-SNE dimensionality reduction method, shown below. We can see that LDA general topics separate nicely in the word2vec space, a further validation that the LDA general topics are meaningful.
I also tested the sentiment analyzer that I chose, VADER. Before VADER, I tried another sentiment analyzer called TextBlob. I plotted the sentiment scores for reviews (-1 meaning most negative and 1 meaning most positive) against the ratings associated with the reviews. TextBlob gave very similar sentiments across different review ratings, but VADER gave more positive sentiments for higher ratings and vice versa, which made it a good fit for DoctorSnapshot.
After validating both the LDA topics and the sentiment analyzer, I generated “snapshots” for all 187 doctors in my dataset. These snapshots are hosted on heroku . At this website, you can search doctors by their names and/or addresses.
For example, in the snapshot below, most people like Dr. Mitchell for her dental work, but at the same time, people generally had negative experiences when it came to payments and appointments at her clinic. These ratings are reflected in the review highlights section. However, you notice that the first and second negative highlights are negative expressions about dentists in general, not Dr. Mitchell, and these reviewers most likely mentioned their negative experiences to illustrate how much they appreciate Dr. Mitchell.
While these sentences are definitely negative, my algorithm does not yet have the ability to recognize what specifically the negative comments are in reference to. This is certainly something to improve upon in the future.
Last, I’d like to share an interesting finding of mine. Term frequency–inverse document frequency (TF-IDF) is a very commonly used technique in NLP to filter out high-frequency words that do not contain valuable information while emphasizing important low-frequency words. However, as shown in the toy example below, when I applied TF-IDF to my dataset, common words like “teeth” are penalized in dentist reviews and the TF-IDF-weighted LDA ended up learning the artificially enhanced rare words that are not meaningful for my dataset. This is because, unlike Wikipedia articles where every article is relatively unique, reviews for the same type of doctor tend to share a large number of topics.
After observing this pitfall, it seemed obvious why it doesn’t make sense to use TF-IDF here, but I did not realize this before hand. I was happy that I learned something new about a commonly used technique in NLP pipelines. Every dataset is unique, so one should always test out different methods, validate the results and try to improve them. It’s a good way to learn and grow as a data scientist.
The concept of DoctorSnapshot has the potential to be a very useful tool for patients looking for new doctors or, more generally, for people shopping for products online. However, there is a lot of room for improvement, particularly around entity recognition and pronoun resolution, which are still very hard to achieve given the current state of NLP research.
During the first 3 weeks at Insight, I learned a tremendous amount about machine learning, NLP in particular. I was able to conduct a fun and self-fulfilling project called DoctorSnapshot that turns large amounts of reviews about doctors into concise snapshots to help patients like me to efficiently select the doctor that is best for them.