In this post, we will delve into a publicly-available dataset from Kaggle.com that includes environmental factors that could lead to someone having a stroke. We’ll walk through the NoviSystems process for developing the appropriate question to answer. After we’ve defined our question, we’ll upload the data to NoviLens and start investigating. At the end, we’ll assess the results.
Define the Question and Collect Data
Our first step is to craft a well formed question, and to determine whether the data we have is able to answer that question. With this problem, our task is clear: can we create a model that predicts whether a person will have a stroke given the features in the dataset?
The dataset (Stroke Predictions), includes features such as previously diagnosed heart disease, rural/urban living, and marital status. NoviLens is capable of creating a model for any set of data with any features, but we need to ensure that the dataset features support the answer to the question. The dataset features include: age, gender, marital status, employment status, and urban vs rural lifestyle. The dataset also includes more specific features that a Subject Matter Expert (SME) could confirm are known risk factors for strokes: previously diagnosed hypertension, diagnosed heart disease, BMI, average glucose level, and smoking status.
This dataset looks as though it has a good amount and type of information to answer our question. At this point we’re ready to dive into our data to create a model.
Create a Predictive Model
Now that we have defined the question and collected sufficient data, we can begin our investigation with NoviLens. We download the Stroke Prediction dataset from Kaggle, and use a python script to split the data into two CSVs. One CSV with 70% of the data for training the model and another with the remaining data for testing the model. Our script is shown below. Note that this functionality will be integrated into NoviLens in a future release. This script also handles a small amount of data cleansing for the BMI field, where “N/A” is used in place of a true null value. We’ll want to cast this field to a float later on, so we’ll go ahead and substitute “N/A” with an empty string.
import pandas as pd from sklearn.model_selection import train_test_split stroke_df = pd.read_csv('healthcare-dataset-stroke-data.csv', low_memory=False) stroke_df = stroke_df.replace(["N/A"], "") train, test = train_test_split(stroke_df, test_size=0.3) train_df = pd.DataFrame(train) train_df.to_csv("stroke-train.csv") test_df = pd.DataFrame(test) test_df.to_csv("stroke-test.csv")
Next, we upload both stroke_train.csv and stroke_test.csv using the CSV Import function included with NoviLens. To save time, we’ll cast age, hypertension, heart_disease, and stroke as integers and average_glucose_levels and BMI as floats for both datatables.
After loading the data, we’ll create a dashboard to visually assess the data and spot trends.
With our Stroke Train and Stroke Test datatables and dashboard set up, we can begin using NoviLens Machine Learning to analyze the data. Thanks to our visualization widgets, we can see that every field (also known as a feature) in the dataset is categorical or continuous, with no unstructured text fields that would need natural language processing. Our categorical fields are represented by the pie charts and bar charts on our dashboard, and we’ve chosen to show our continuous float fields with scatter plots. This tells us that the type of machine learning we want to use is structured classification or regression, as opposed to unstructured classification or prediction. The next step is to create a new ML dataset based off of our Stroke Train datatable.
We select the stroke field as our dependent feature (what we’re trying to predict) and all of the other fields, except for id, as our independent features (what will contribute to our model).
As we continue through the model creation process, we’re offered choices such as converting categorical features to binary or manually excluding low-scoring features. For now, we’ll go with the NoviLens recommendations including feature removal, algorithm selection, and algorithm parameters. Here’s what our first regression training report looks like:
It appears that this dataset is sparse, which means the data has a very low ratio of stroke positives to negatives. This will lead our model to predict that there will never be any stroke positives because it is statistically unlikely that a stroke will occur.
NoviLens has a built-in capability for handling sparse data for classification algorithms. If a classification model is built and the system detects that data is sparse, we automatically upsample the dataset with the Synthetic Minority Oversampling Technique (SMOTE). The classification algorithms with upscaling offered by NoviLens include Decision Tree, Random Forest, Extra Trees, Adaptive Boosting, and Gradient Boosting. Ideally, we’d try all of them with various parameters to select the best model. For the purposes of this post, we’ll try Decision Tree because NoviLens provides an explanatory graphic at each decision point for us to view. We’ll create a new model, with the same Stroke Train dataset but configure it to create a Decision Tree instead of calculating Logistic Regression.
This time, the model’s training report looks much better, and we can now assess the results to determine how well the model predicts the occurrence of strokes. As we look through the Raw Data of our Decision Tree report, we can see that our model does predict that some people will have strokes, compared to our Regression model that predicted no one would ever have a stroke.
As seen here, different algorithms produce different models of varying performance. For this reason, NoviSystems encourages building multiple models with different algorithms and parameters. NoviLens makes it easy to prepare a dataset for ML and create new models with just a few clicks. The computer does what it does best — calculate statistics and create the models — leaving the decision about which approach is best to the person operating the software.
Assess the Model
The final step after training our model is to run our model with the Stroke Test dataset to assess the accuracy of the model. The process of creating a test dataset is similar to creating a training dataset in NoviLens, with some intelligence to guess which features you’ll need based on the training data.
When we run our Stroke Test dataset with our trained model, we see that our final training accuracy is reported at 92.88%. Of course, we can always spend more time tweaking the model via the filtered dashboard to achieve a higher score, as well as compare models built with different machine learning algorithms. Below is the full Decision Tree that our model has created.
This diagram shows each of the features in the dataset (average glucose level, age, hypertension, and heart disease) and at what value for these features that the model creates a new path to an outcome. Orange paths lead to a prediction that no stroke occurs, while blue paths lead to a prediction that a stroke may occur. A quick review of the literature indicates that this is a clinically valid prediction [1-5].
This example illustrates the process of defining a question, collecting supporting data, creating a prediction model, and assessing the results using NoviLens.
Because this dataset is test data that’s heavily skewed towards well-known stroke prediction factors, the models we built in this case study will have limited clinical value. If we had access to real patient data through NoviLens’s FHIR interface, we would be able to train a model on the electronic medical records (EMRs) of patients with diagnosed strokes. We would use a combination of patient statistics over time (including vitals, lab values, and demographics) in conjunction with natural language processing to pick out important pieces of information physicians have written in patients’ medical records. By combining all types of data available to us, we can construct a model that looks at many different facets of a patient’s health without skewing the data towards features we believe are important — we can let the machine prove which features are important.
With NoviLens, it’s easy to aggregate your data to create useful and effective predictive models. Contact us to discuss how NoviLens can work in your organization.
1: Cui Q, Naikoo NA. Modifiable and non-modifiable risk factors in ischemic stroke: a meta-analysis. Afr Health Sci. 2019 Jun;19(2):2121-2129. doi: 10.4314/ahs.v19i2.36. PMID: 31656496; PMCID: PMC6794552.
2: Sakakibara BM, Kim AJ, Eng JJ. A Systematic Review and Meta-Analysis on Self- Management for Improving Risk Factor Control in Stroke Patients. Int J Behav Med. 2017 Feb;24(1):42-53. doi: 10.1007/s12529-016-9582-7. PMID: 27469998; PMCID: PMC5762183.
3: Mitsios JP, Ekinci EI, Mitsios GP, Churilov L, Thijs V. Relationship Between Glycated Hemoglobin and Stroke Risk: A Systematic Review and Meta-Analysis. J Am Heart Assoc. 2018 May 17;7(11):e007858. doi: 10.1161/JAHA.117.007858. PMID: 29773578; PMCID: PMC6015363.
4: Global Burden of Metabolic Risk Factors for Chronic Diseases Collaboration (BMI Mediated Effects), Lu Y, Hajifathalian K, Ezzati M, Woodward M, Rimm EB, Danaei G. Metabolic mediators of the effects of body-mass index, overweight, and obesity on coronary heart disease and stroke: a pooled analysis of 97 prospective cohorts with 1·8 million participants. Lancet. 2014 Mar 15;383(9921):970-83. doi: 10.1016/S0140-6736(13)61836-X. Epub 2013 Nov 22. PMID: 24269108; PMCID: PMC3959199.
5: Wei W, Li S, San F, Zhang S, Shen Q, Guo J, Zhang L. Retrospective analysis of prognosis and risk factors of patients with stroke by TOAST. Medicine (Baltimore). 2018 Apr;97(15):e0412. doi: 10.1097/MD.0000000000010412. PMID: 29642209; PMCID: PMC5908632.