2. Dictionaries

Dictionaries are used by Analytics applications to provide specificity and context. Grouping “like” terms together and searching for their co-location results in returns with meaning for the topic under investigation. For instance, if three Dictionaries containing the names of Baseball Teams another containing Baseball equipment, and a third Dictionary of synonyms for scoring are used to search against sentences or paragraphs of text and return positive results for all three, the paragraph or sentence is most likely about Baseball. This is a Rules-Based System

The use of Dictionaries provides the following functions in Unstructured Text Analytics (UTA) and Natural Language Processing (NLP):

  • fact extraction using exact term or Regular Expression (REGEX) matching

  • specificity for sentence/paragraph extraction in a rules-based system

  • comparators for use in Machine Learning models

Rule-Based systems are of little value in large documents where finding combinations of terms spaced large distances apart may not reveal context. NoviLens uses paragraphs and sentences for Rules-based Systems.

2.1. Building Dictionaries

Dictionaries can be easily built by importing CSV files or lists. A great source of lists are simply to search for lists of whatever topic you’re exploring. Just choose the file and upload it.

There is also “cutting and pasting” or just typing in terms.

Once the terms have been uploaded, typed, or “pasted” be sure to Update Dictionary to save the terms.

NoviLens has a built-in thesaurus that can be used to add additional terms that you may want to consider.

2.2. Advanced Functions

The advanced functions within the Lens Dictionary tools makes use of several features:

  • the ability to search for synonyms of the words in the dictionary

  • the ability to search for the lemma of terms in the dictionary

In addition, more advanced functions of Natural Language Processing can be configured including the prepping of documents for NLP by removing unwanted characters, removing pronouns, etc.

Here is where the user can define a REGEX and search for a fact or value. For instance,

REGEX: $d+.d{2} could be a dictionary named “Money” (Dictionaries can be just one term)

The system also uses NLTK Part of Speech parser that can be configured as a dictionary term. For instance a noun-verb-noun phrase can be defined as: NLTK:{<NN><V.*>*<NN>} the Dictionary name would be noun_phrase Search is the activity of applying a Dictionary(s) to a data table. You only have to do this once. If you modify the dictionary, the search will be performed again by the machine.

If the machine has generated a dictionary, it will only be executed if you select Create Dictionary. The dictionary will then search on its own.

2.3. Dictionaries from Machine Learning Models

NoviLens has the capability of combining Rules-Based Systems (Dictionaries) with Machine Learning (Topic Modelling) to provide the user with terms that may be of relevance. The user can either use these terms as new dictionaries or modify existing Dictionaries with the terms suggested by the machine.

Access to this functionality is via the Machine Learning function.

The process begins with selecting the table to perform the analysis on. The table should have had a Dictionary Search using user-defined dictionaries prior to conducting this analysis. If none has been performed, the system defaults to Topic Modelling.

  • The Chucking Dictionary initiates the vectorization of the data. We have found that noun-verb-noun provides optimal results.

  • The Context Dictionary provides a comparator for the vectors

  • The Text Field is the field to be analyzed

The default algorithm for this analysis is Non-Negative Matrix Factorization and is sufficient in the majority of cases for supplementing terms in dictionaries

Create the model

Once completed, view the report. The results are expressed as Topics with 0 representing the those having the most relevance (as determined by vector space) You can either add the terms as a new dictionary then modify or add the terms to a current dictionary by “cut and paste” editing of the appropriate dictionary.

2.4. Libraries

Libraries are a convenient way to cluster Dictionaries together for a common topic.