6. Machine Learning¶
The fundamental principle of machine learning is for the machine to apply a series of calculations to the data points within your corpus of data in order to determine their relationship to each other. Basically, what group does the point fall into: a “one” or a “zero”. There are a number of algorithms that can be deployed to try to provide this separation. The one that provides the best “fit”, has the least amount of false positives or false negatives and can handle a large range of data.
The algorithms used in NoviLens are: K means, NMF, LDA, MNB, Decision-Trees, Random Forest, Linear Regression (Lasso,Ridge) ,Logistic Regression, ADA Boost, Gradient Boost, Neural Net. New models are added at client request. Scikit-Learn is used extensively in this application. 1
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Machine Learning in NoviLens makes use of testing and validation grids across multiple algorithms to determine which algorithm provides the best overall “fit” for that task (classification or prediction). That doesn’t mean that the algorithm can’t be tuned by the user. Each step in the process can be modified by the user in an attempt to improve performance. In addition, the user can select the algorithm to use rather than let the machine decide. You must decide if the difference between 80% and 85% accuracy makes a difference in the decision you’re making. Go back to the question. Is it import to classify all of the statements? The majority? Filter the key words? All of these determine how you use these tools.
If the machine is used to select and tune the algorithm, it is a time consuming activity. Depending on the data set, and the size of the appliance, this can take up to several days. Be prepared! There is some heavy number crunching happening. The good news is once the algorithm is in place, you can just run your new data against it until you feel its time to update the model with new data. You need to decide how frequently you should update your model.
Finally, when using Machine Learning, remember what question you’re trying to answer. Machine learning is limited in its capabilities. It can solve simple, non ambiguous problems. It does a good job of sorting information and reducing features of data sets but it does require adult supervision. The machine does not understand context. The user is the best judge of the quality of the return. To help you with these concepts, we’ve included the attached presentation.
6.1. Dealing with Features from Multiple Tables in a Dashboard¶
It is important to understand that widgets produce data filters, not actual tables. This means you may have to perform some data aggregation of data dashboard views in order to perform your machine learning tasks depending on the question posed. For example, Data from two different tables viewed in a dashboard that are linked by a common column filter are not joined in a single table. This must be done outside of NoviLens at this time. NoviSystems can provide you with an application that performs this task at your request. Just contact us at firstname.lastname@example.org.
6.2. Adding a Dictionary as a Feature¶
Dictionary returns can be converted to features using the Generated Field function in the Table Feature of NoviLens. This feature turns a dictionary term into a feature or attribute into a value that can be used in classification for machine learning.
Machine Learning starts with the evaluation of a data set. The source of the dataset can be a filtered Dashboard where a configured widget table’s results are sent to the Dataset configuration function of NoviLens via the Bookmark Button on the Dashboard Page, or where a data table that’s been imported as a table can be selected as a dataset.
Once Dataset has been selected, you must select a table to work with and the type of analytic to perform. The Datatable provides a view of the tables in the system - select the table for analysis and then give the analysis a name.
There are a number of actions to select:
Create a new model: This creates a new model for use in evaluating data.
Update an Existing Model: Do you have more data that you want to use to improve the performance of an existing model?
Run an existing Model: You’re happy with the model and have some new data you want to evaluate.
6.3. The Problem Type.¶
Predict or Classify mostly numerical or categorical data. This generally results in a regression analysis,decision tree, or Neural Net type model depending on conditions.
Predict or Classify Full Text Field: used to classify text sentences by a given label such as a dictionary term.
Refine Dictionaries using Existing Data: This is a version of Topic Modelling that has been refined to improve specificity and context.
6.3.1. Understanding the Process¶
Once the above mentioned have been performed, a series of activities are initiated within the machine. These are illustrated in the following flow diagram:
6.4. Data Analysis¶
During this step in the machine learning process, the system is evaluating the data in the datatable. It is checking for missing data, if found, it will prompt the user on how to handle. Statistical analysis is performed to check on mean, mode, median and kurtosis. These factors along with others are used to determine if there are categorical data in the data sets.
6.5. Choosing the Columns¶
Once the analysis is complete, the user will be asked to select the Dependent column (what is to be evaluated) then to select the independent columns. See Figure Below
The system will then perform another set of data cleaning tasks.
6.6. Fixing Missing Values¶
If Values are missing in the tables, the user will be asked if they want to drop the rows or use the mean value.
6.7. Data Normalization¶
Data normalization is performed by the system.
6.8. Balanced Datasets¶
For numeric, float, and categorical data, data is evaluated for an imbalance. If the ratio is grater than 80:20, SMOTE analysis is performed on the data set.
Data used in text classification systems is not checked for “balance” The user must evaluate the datasets to determine if an imbalance exists between training label values.
6.9. Convert Categorical Columns¶
THE MACHINE IS NOT PERFECT!!
The system uses a series of algorithms based on sample size, kurtosis, and statistics to determine if the column represented in the table is categorical or linear. It can make mistakes. We show you this in this example. If the data set is small and the range of the data is small, the system will interpret the column as a category. If you are doing regression analysis be aware! You will need to correct the machine in a subsequent step.
6.10. Calculating Recommended Features¶
In this step the system is performing a Select K-Best data set reduction to lower the number of features analysed. The default value is five. You can override the machine and either accept the values or choose your own based on the return.
6.11. Verify Dataset Configuration¶
Verify Dataset configuration provides the overview of the data and is where you can CORRECT MACHINE CLASSIFICATION ERRORS. In the dataset highlighted in this tutorial, the machine classified the data as categorical while we wanted linear (continuous). Here is where we would change the fields.
6.11.1. The Algorithms¶
As stated previously, the machine will try to match the algorithm to the question you’re asking when selecting the Problem Type. You can change the algorithm at your discretion. Remember, the data type may restrict the algorithm selected. The most common issue is when numeric data is saved as text in the table and not converted to numeric or float values using the recast function in the Tables section of NoviLens.
As a reminder:
Linear Regression: continuous (linear) data to calculate a value; Fast. Problem types are generally associated with numeric data and used for predictive modeling such as determining changes over time. The different types of linear regression vary in their method of handling “outliers” in the data. Speed of processing: Fast
Logistic Regression: binary classification, good for non-linear or non-correlated data. This can be used to model yes/no questions such as was the service good or bad. The independent features should not be mathematically related. Deals in probabilities. Speed of processing: Moderate
Decision Trees: Non-parametric, good for non-linear Data Analysis. A good method for determining feature importance. Will sub group data by features to get to binary decision. Speed of processing: Moderate
Random Forest, Gradient Boost: ensemble methods of Decision Tree; Moderate/Slow
K-Means: clustering algorithm. This is an unsupervised method that creates groups of data based on the number of “clusters” or “data centroids” the user selects. Speed of processing: Moderate
SVM: complex, non-linear data (many features); Slow
Stochastic Gradient Decent: classification, regression - data smoothing
Multinomial Naive Bayes: This is used in classification of unstructured text where the algorithm is “trained” with a given set of text and labels. Speed of processing: Fast
Neural Network: This classification system is used with text and images. Requires extensive training. Speed of processing:Very Slow
NoviLens has not been optimized for image analysis. The algorithms (SVM, Neural Net) can be used for this task but will perform very slowly. If you want to perform image analysis, we would suggest upgrading your appliance with GPUs.
NoviLens is constantly adding new features to the product. If you desire an new algorithm be added to the application, simply e-mail us with your request
The algorithms all have limitations. Understand the question your asking. Is the decision a classification - good/bad, yes/no? Are the date formatted to allow the algorithm to work? Do you need to make sub categories of categories? For example, above x is good, below x is bad.
If you are performing classification of sentences, make sure you have at east two different phrase types in the data set to train against and that there are sufficient and relatively balanced numbers of each. The system does not correct for imbalanced text data.
6.12. Running Models¶
6.12.1. Classification Systems¶
The classification Algorithms will produce a Classification report consisting of the following:
Precision: It is implied as the measure of the correctly identified positive cases from all the predicted positive cases. Thus, it is useful when the costs of False Positives is high.
Recall: It is the measure of the correctly identified positive cases from all the actual positive cases. It is important when the cost of False Negatives is high.
Accuracy: One of the more obvious metrics, it is the measure of all the correctly identified cases. It is most used when all the classes are equally important.
F1-score: This is the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric.
The default values deployed in each model were determined to provide a baseline or directional value. The machine can tune the values to improve the accuracy of the algorithm. The autotuning feature does consume compute resources and takes time but will improve the model.
The report will also produce a readout of the actual vs predicted values for the data analysed. This can be downloaded as a CSV.
6.12.2. Decision Trees¶
The Decision Tree algorithms will perform a cross validation and selection unless otherwise directed used a pre configured tuning grid. The results are displayed for each algorithm with the highest score selected for use in the data model. The user has a number of options at this point:
accept the algorithm or change it
accept the model parameters, have the machine auto tune, or manually change parameters of the selected algorithm
Once completed, selecting the model performs the model training. This may take time depending on the size of the data set. Once complete, the model is saved and can be used against subsequent data sets imported into lens via the Table Import function.
6.12.3. Regression Problems¶
Regression Analysis uses different methods for evaluating continuous data.
6.12.4. Classification Based on Full Text¶
This algorithm(s) provide the opportunity to classify text by features that are either present in the data tables or by features the user isolates from the tables. Sentiment analysis or call log analysis are frequent uses.
The options are presented to the user as follows:
Selecting Prediction / Classification Based on Full Text
Once data is analysed by the system the user must select the dependent and independent variables:
The results are presented in Tabular form:
Results of Analytic
While the user can tune aspects of the algorithm used (MNB/NN), often times, better results are obtained by using dictionaries to better define the sentence or paragraph that is being classified. Secondly, we have found MNB performs as well as NN in this system in a fraction of the time.
6.12.5. Refine Dictionaries Using Exiting Text Datasets¶
This machine learning model uses NMF as the default algorithm. These calculations depend on extensive text preprocessing (this can be modified in the dictionary configuration section of NoviLens) followed by TF-IDF and vector transformation. Results are used in the generation of comparative vectors consisting of noun-verb-noun (NVN) phrases. The comparisons are of vectors of noun-verb-noun vs noun-verb-noun minus NVN that contain dictionary terms.
The process can be used for unsupervised text classification (Topic Modeling) if desired.
It begins with selecting the function in the Machine Learning Topic Dataset section:
Once selected, the interface offers several inputs:
Chunking Dictionary: this directs how the machine will set up vectors of the documents (Noun-Verb-Noun works very well)
Context Dictionary: the Dictionary that will be used for the mathematical comparison.
Text Field: What will be analysed
The recommend algorithms and parameters will perform auto-tuning.
The results are presented as new dictionary options.
Points to consider:
If there are no dictionary matches for the dictionary you’ve selected, the machine is interpreting the vectors as as have no duplicate results between dictionaries. The new terms display the relationship between the known values of the terms used in the dictionaries compared to the new terms. Similar clusters are displayed as topics.
The default number of Topics are three.
You can use the results to modify existing dictionaries by copying selections then moving to the NLP Dictionary function.
If you want to change the chunkling function, use NLTK syntax.