NoviLens – Text and Numerical Data? No Problem!
This case study explores a complex data set to determine what wines match a particular need. We then move on to develop a machine learning model to classify the score for a wine given it’s description.
Find the Right Wine
Imagine you want to find the most economical wine with the highest rating given a flavor profile to be used for a large dinner party. An example question to ask is “What are the least expensive wines that has a wine reviewer score of 90 or higher that has a cherry and sage flavor notes?”
To answer this question, we need wine review data that includes metadata about the flavor of the wine, review scores, and price. We can find this data from Kaggle who obtained the data from Wine Spectator. This dataset includes information from almost 130,000 wine reviews.
This data includes free form text data. After uploading this data into NoviLens, we need to create some dictionaries to define context for wine flavor descriptions. Fortunately, these descriptions are relatively structured and consistent, so creating a dictionary isn’t hard. After getting a common list of wine flavor terms, simply upload the list into a NoviLens dictionary and link the dictionary to the text field of the data to run a dictionary match process. It may be useful later to have this dictionary split into categories of flavor. This will allow us to filter on a particular fruit flavor along with a particular spice.
After the data is uploaded and context is defined for the text field, we build a dashboard to interactively explore and filter the data. The first widget to build is a ‘count’ widget for the foreclosure data that shows the total number of records in the dataset and the number of records that fit your filtering criteria. This gives you an idea of the specificity of results from your filtering activities.
Next, create visualization widgets as follows:
Bar chart histogram of wine price
Bar chart of wine scores
Bar chart histogram of dictionary match of spice flavors
Bar chart histogram of dictionary match of fruit flavors
Bar chart of wine titles
Table of raw data that lists filtered data
Pie chart of countries of origin
A selector widget for grape variety
A table widget that displays filtered data
On this dashboard, you can select data from the widgets to answer the question above. To do this, click on the following
On the score widget, click on the bar that represents a score of 95+
On the fruit flavor widget, click on the bar that represents cherry flavor
On the spice flavor widget, click on the bar that represents sage flavor
Click the bar in the price widget that corresponds to the price you are willing to pay (less than $50)
Here’s the resulting dashboard view:
This quickly reduces our choices from 130,000 to four of the best wines that fit our need.
With NoviLens, analysis of complex data is easy.
Create a Machine Learning Model to Classify a Wine’s Score From the Review Text
Now that we have a dashboard to explore wine reviews, let’s consider creating a classification model based on the wine description. We’ll consider a score of >=95 as a high score and the converse as a low score.
Now that we have an objective, let’s build the machine learning model.
First, we created a few new widgets in the dashboard to show other facets of the data. Namely:
A histogram of reviewer names
A Natural Language Processing dictionary match table to show matched phrases from the wine descriptions
A histogram of relevant phrases
Table showing the point score with related sentences from reviews
Here is the dashboard for reference:
We’ll go back through the data and map scores with greater than or equal to 95 as “high” and others as “low”. Using this data, we’ll build a machine learning model to classify the score as “high” (>= 95) or “low” (<95). Here is a report showing the effectiveness of the model:
From this model, we can see that we have an accuracy of 73% for classifying wines as a low score given their descriptions and an 83% accuracy of classifying wines with a high score.
Given this example, you could imagine building similar models to gauge customer feedback and how that would correlate to public reviews or third-party articles.