Analyzing The Yelp Review Dataset With Tidepool

Make sure your embeddings encode similarity the way you want them to

Something like 90% of company data is unstructured data. Unstructured text data can include emails, messaging conversations, user generated content (like surveys, feedback, reviews, etc.), or user prompts to LLM apps.

However, it’s traditionally been pretty hard to derive any useful insights from this data. SQL is very accessible, but it only allows you to do basic string matching or keyword search. Training a traditional natural language processing model works well but requires you to set up a machine learning stack and laboriously label data to train and evaluate on, limiting usage to competent data scientists in mature machine learning teams.

Tidepool combines two key technologies:

Large language models (LLMs) that allow a user to ask a question of text data with natural language and get an answer (via categorizations of the data) without needing a human to manually label data.
Lightweight embedding classifiers that can be fine tuned on LLM labels to categorize data at scale with a price 100x lower than running an LLM across the entire dataset.

In this post, we’re going to walk through an example of how you can use Tidepool to derive useful insights from text data by looking at the Yelp Review dataset.

Yelp Review Dataset

The Yelp Review dataset on Huggingface consists of reviews from Yelp and an associated star rating ranging from 0 to 4 (0 being very bad, 4 being very good). The dataset consists of 130,000 training samples for each review star for a total of 650,000 training samples. Yelp is a review site for various businesses ranging from hotels, restaurants, home services, and other types of physical stores. so there’s a lot of variety in this quantity of data.

An example row from the YelpReviewFull test set looks like:


{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\""this time\\"". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}

This dataset is pretty compact: 323 MB for the entire dataset. For this blog post, we’re only going to look at the training set because it’s a lot bigger.

We can create a project in Tidepool and type in some context about the analysis we’re trying to do. In this case, we can tell Tidepool that we’re analyzing Yelp reviews and that Yelp is a review site for businesses. In this case, I actually copy-pasted the description of Yelp from an AI-powered overview for a Google search of “Yelp.”

We can upload the training set with a simple Python script that reads the Yelp Review parquet file and then sends some POST requests to the Tidepool API to upload that data – here’s the script I used to do the upload.

Once the upload finishes, we can look at the dataset in Tidepool, filter by “star” label, and then inspect a few examples from each label value.

Filtering for reviews where label = 4 (5 star reviews)

Clicking into an arbitrary review to see the text of the review and its associated metadata (in this case, just the “star” label)

Sentiment vs Stars

Let’s start asking some questions of the data! The Yelp Review dataset is commonly used for sentiment classification tasks where one trains a traditional NLP model on the text and labels. However, LLMs let us directly predict sentiment without needing to explicitly train a model! Let’s try this in Tidepool and see how well it works.

Creating an Attribute

First we can go into Tidepool and create an attribute. An attribute is a customized grouping of data – in this case, we are going to create a “sentiment” attribute that groups data according to its sentiment. We can simply type out a description of the attribute…

And then Tidepool can generate suggestions for the groupings based on the attribute description. In this case, we will categorize the data into positive, negative, neutral, and “not applicable” (basically unknown) sentiment.

Now Tidepool will run an LLM on a subset of the data, categorize some examples according to the attribute and category descriptions, then train a lightweight embedding classifier on the LLM labels. Note that in this example, the LLM does not use the label information in any way, it is only using the descriptions we typed in to categorize the text in the dataset.

Category Refinement

We can then go into the “category refinement” view that lets us view examples of each category and adjust the categories – editing category descriptions, creating new categories or deleting existing ones, or manually labeling certain examples into a category.

You can run a few rounds of refinement until you’re happy with the result, but in this case the results are pretty good so we’re just going to “finalize” the attribute.

The category refinement view also contains confidence scores for each row

Exploring A Finalized Attribute

Now that we’ve finalized the attribute, Tidepool will run the lightweight embedding classifier across all 600,000+ rows of the training dataset and categorize each row into one of the categories we defined. Not only is the categorization pretty accurate, but it’s much much cheaper than running an LLM over all of these rows – on the order of a few dollars for the lightweight classifier vs thousands of dollars for a model like GPT-3.5.

At this point, we can export this “enriched” categorization data via a CSV / push it back into a data warehouse so we can do further analysis in a BI tool like Mode or Looker.

However, Tidepool also has some rudimentary charting functionality. We’ll use that for the rest of the post for convenience’s sake. First, let’s chart the distribution of sentiment across the entire dataset…

Chart of sentiment across the entire dataset

And we can even chart the labels colored by sentiment.

As we expect, label = 4 rows (5 star reviews) tend to have positive sentiment in the review text, label = 0 rows (1 star reviews) tend to have negative sentiment in the review text, and middling reviews have a more even split across sentiment.

However, there is a small minority of positive sentiment reviews with label = 0. What’s going on here? Let’s filter for that subset and look at some examples. We see that these reviews were written by some very polite and optimistic people despite their low star rating:

“Went for dinner with friends tonight. The place itself looks great, loved the lighting and projections on walls. That is the only thing that we found we liked. The music was a little loud and played during the commercials on TV, it was good club music though… Staff were friendly and for the most part attentive although apologized a lot for things they did not yet have… Hope this is just the issue that happens with a new opening and that they have success.”

“I jumped on to write a review, and sadly looks like I’ll just be echoing a lot of bad reviews. Despite my poor review, my 5 year old had a great time… Positives: The Staff was incredible. Really great people wanting to help you have a good time, many in character, with great attitudes. I bet they heard a lot of complaints which is unfortunate as we liked every person we met… I would give this event a thumbs down, and the VIP Pass a double thumbs down. If they reduce the number of folks, it may improve the event.”

We can also look at the opposite: 5 star reviews with negative sentiment. We can see some positive traits in these reviews but most of it is complaining about things that didn’t go well, which is kind of puzzling:

“Don’t come here without making a reservation. I didn’t know you could make reservations, but Yelp is showing an Open Table link, so if it’s true, MAKE A RESERVATION… The only negative is sitting in the line next to SMOKERS at the craps table. I don’t mind sitting in line, but I do mind choking on someone else’s cancer sticks. And just as the SLS was poorly designed, this restaurant should have anticipated the crowd madness and built a larger space.

“I have great expectations, but they let me down, i order the carne asada salad… the dressing is to watery, too runny, teh steak was blend and the rice mushy… the staff was very friendly, thats why i give the 5 stars, i hope you fix it all…”

Pretty interesting! Now let’s take a look at another question.

Topics vs Stars

We can answer a more nebulous question: what types of topics do reviews tend to mention? What review topics correlate with high or low ratings? For example, if the majority of 1 star reviews are about service quality, that would be a pretty useful thing for businesses on Yelp to optimize for.

To answer this question, we can create an attribute for the review topic. Again, Tidepool suggests some potential categories based on the attribute description, such as food quality, service quality, ambience, etc. However, we really want to have topics surfaced from the data bottom-up. Here we can use a handy feature called subcategory discovery, where Tidepool will discover subcategories of existing topics in the data during category refinement.

We actually created an “other” category to capture all the topics that we couldn’t think of up-front. When we look into the subcategories of the “other category,” we actually see a lot of interesting topics that we wouldn’t have initially thought of.

The “facility amenities” subcategory tends to contain reviews of hotels and their services:

Whereas the “location convenience” subcategory talks about location, parking availability, etc.

We can then promote some of these subcategories into proper categories if they’re interesting enough, so we’ll do that with the two above and a few others before finalizing the categories.

When we chart the labels against the categories, we see that food and service quality are the most represented across the dataset, but the service quality and business practices categories are highly represented in the 1 star reviews.

We can dig into 1 star reviews in the “service quality” category and see a lot of complaints about restaurant serving staff and wait times:

“We just left this Dennys after waiting for forty two minutes to eat. From the time we sat down, it took twenty minutes to actually have our server greet us. I actually had to go back up to the hostess to ask for a server… after today,I have absolutely no intention of ever going to another Dennys again…”

Meanwhile, 1 star reviews in the “business practices” category mostly features complaints about shady billing schemes:

“Less than a week later I received an unknown charge for $70.00 from a company I didn’t recognize (SMELL/SNELL?) so I followed up on it and it turned out to be the hotel trying to charge me for the room that I had already pre-paid for on travelocity. When I called for a refund, they didn’t try to dispute/find the fault in their error, which makes me very suspicious. If you stay/have stayed in this hotel check your statements!”

Conclusion

That was quite the ride! Looking back, we were able to walk through the following steps on the Yelp Review dataset:

Created an “attribute” which defines a question you want to answer from the text data.
Defined the categories in that attribute through iterative refinement on a subset of the data, including through suggestions of subcategories from the app.
“Finalized” the attribute and categorized the entire dataset according to our attribute and category definitions.
Used some basic charting functionality to draw interesting correlations between the categorizations of the raw text and a target piece of metadata, the star rating of reviews.

Hopefully this was interesting! The advantage of using Tidepool is that you can ask arbitrary, very specific questions across large swaths of text data, get back a pretty good answer at a reasonable cost, and integrate it into the rest of your business intelligence stack. If you have any ideas about interesting questions to ask, feel free to email me at pgao@aquarium-learn.com or go to tidepool.so and try Tidepool out for yourself! 🙂

Peter Gao

Cofounder and CEO at Aquarium. Formerly an early employee at Cruise, deep learning researcher at UC Berkeley, and intern at Pinterest + Khan Academy.