Natural Language Processing and Electronic Medical Records

The ability to analyze EMRs with Natural Language Processing (NLP) and Unstructured Text Analytics (UTA) enables hospitals to enhance their revenue cycles, optimize care offerings for their patient populations, and optimize their allocation of resources.

NLP and UTA are ideal processes for extracting facts from electronic medical records.  The challenge has been the scarcity and cost of developers or data scientists trained in these techniques in order to accomplish this task.  NoviSystems Inc. has developed a process and a system called NoviLens that anyone in the healthcare field can use to perform NLP and UTA in the healthcare setting.

The Concept

Machines can convert sentences to numeric representations of the terms.  Each sentence in a document becomes a vector by converting a word to a number.  Linear algebra can be applied to assess all the sentences in a document to form a matrix.  Vectors of similar value in this matrix should be related.  If the vector has a label (classification), then “like” sentences with a similar number pattern should be related. 

In many systems, the classified sentences have been established by bulk assessment of a large collection of documents (eg. Bidirectional Encoder Representations from Transformers – BERT).  In medicine, a similar strategy has been built using research papers as a source of learning material. While effective, the compute resources required to perform these tasks are significant, often requiring specialized hardware. An alternative strategy that is quite effective and less resource intensive is the combination of Rules with Machine Learning, outlined below.  

Rules are simple to construct in NoviLens. The principle method is to construct a Dictionary; a list of common terms that can be grouped together.  Figure 1 illustrates the NLP/UTA process implemented in NoviLens.

Figure 1. NLP/UTA Process

 A medical example are MESH terms.  The lexicon for each topic can be viewed as a rule.  If a sentence contains the terms in a category of MESH terms such as Congestive Heart Failure (CHF), then all sentences that contain the terms or their synonyms as defined by MESH are labeled as Congestive Heart Failure.  As the machine searches for these terms, it “learns” what sentences can be classified as CHF. Combining dictionaries to create a label improves specificity and context.

Because sentences have been reduced to vectors, a sentence that contains Dictionary terms now has a label.  Sentences that are mathematically of a similar value can have the same label applied. The “closeness” of the mathematical relationship can be filtered by the user. This allows for the extraction of unlabeled sentences representing a similar concept to that of labeled or Dictionary identified sentences.  

The ability to perform the task of Dictionary association requires the generation of these dictionaries.  NoviLens is available with MESH and MedDRA terms as dictionaries.  Generating a new set of dictionaries is a straightforward process guided by the software (

Application of NLP/UTA 

COVID Pandemic

Determining risk factors and planning resources with NLP/UTA extraction of facts are ideal examples of how machines can lessen the stress on the healthcare facility. While the diagnosis of COVID can be determined by testing, outcomes are somewhat difficult to predict.  Using NLP/UTA, it is possible to extract facts across this patient population and determine if patterns exist that can predict outcomes.  

Using Dictionaries to construct groupings of symptoms and patient characteristics combined with lab values can provide the necessary features to classify patients as high or low risk and then compare the effectiveness of treatment.  Once these attributes have been defined, the healthcare service can screen records that identify patients that can be proactively managed through enhanced communication efforts.

The ability for the machine to sort and cluster to augment treatment is not resource intensive when using the appropriate software (eg NoviLens) and may result in an overall decrease in the burden of care by preventing high risk behaviour in the community setting.

Rare Disease Patient Identification

In this example, the identification of patients with the potential to be diagnosed with a Rare Disease can be performed using NLP/UTA. If a physician has a set of diagnostic criteria, these can be placed in a dictionary.  Using the diagnostic criteria for alpha 1 antitrypsin deficiency, the user constructs a series of dictionaries using the MedDRA terms for the following:

  • Chronic obstructive pulmonary disease (COPD), regardless of age or ethnicity
  • Unexplained chronic liver disease
  • Unexplained bronchiectasis
  • Necrotizing panniculitis
  • Granulomatosis with polyangiitis

The system can read through the various tables including text narratives and label the associated sentences containing the above mentioned terms. These sentences can then be grouped by patient id. The system can then score the patients by the number of terms matched.  The abstracted table can be viewed by the clinician to determine if the diagnosis should be ruled out. 

This methodology can be applied to any set of criteria where patient grouping may be desired such as identification of patients suitable for clinical trials by matching the desired attributes with the patient record. 

Quality of Healthcare

The quality of care is another example where NLP/UTA of the EMRcan improve understanding of the relationship between patient care and outcomes.  The challenge in preparing a discharge plan for a patient is the aggregation of the facts necessary to assure proper care following discharge.  The lack of synchronization of care or understanding of the home environment can lead to readmissions. 

The ability to construct Dictionaries for each domain allows the Discharge Planner to aggregate all the necessary facts required and determine if the discharge plan is adequate.  Because the NLP/UTA system “reads” the notes, information from each perspective can be isolated then combined for viewing. Figure 2 highlights the possible domains that can be fused to provide a holistic view of patient needs.

Figure 2: Aggregation of Patient Care Domains

These are just two examples of how NLP/UTA can enable the healthcare provider to improve care.  Neither of the cases involves having the machine “diagnose” a patient, but rather has the machine aggregate the information so that the healthcare professional can use their knowledge and understanding more effectively (Intelligence Augmentation).

NoviLens for healthcare professionals is an appliance that comes with the software necessary to connect to any EMR system (FHIR compliant), preloaded dictionaries, and full machine learning capabilities. The system fuses data from multiple sources, provides easy to use NLP, UTA, and interactive visualizations for data analytics.

To learn more, visit Visit to learn more about NoviLens and join our email list to get updates on new features and case studies. If you’d like to learn how NoviLens can help your organization, contact us for a 30 minute demo at