NLP/IE Resources

Expand all


Download the Automated De-Identification Model License

Automated De-Identification of Distributional Semantics Models available from Technology Commercialization (

Access the Automated De-Identification Algorithm GitHub

GitHub - nlpie/modeldeid: De-identification for co-occurrence models

An automated model de-identification algorithm applies aggressive de-identification to a word co-occurrence model without sacrificing performance for word sense disambiguation. While some very common words must be included in the model (i.e., names in some of their occurrences, like “white”), the de-identification process removes anything that is not part of the SPECIALIST Lexicon and any words in patient information databases (e.g., names and addresses). The one exception to this rule, critical to maintaining good word disambiguation performance, is that the 2,000 most common words in the patient database are included in the model to allow for homonyms like “white,” as mentioned above.


Learn More About Using BioMedICUS on GitHub

The BioMedical Information Collection and Understanding System (BioMedICUS) leverages open source solutions for text analysis and provides new analytic tools for processing and analyzing text of biomedical and clinical reports. The system is being developed by our biomedical NLP/IE program at the University of Minnesota. This is a collaborative project that aims to serve biomedical and clinical researchers, allowing for customization with different texts.


Family history information is essential for understanding disease risk. It is, more specifically, critical for individualized disease prevention, diagnosis, and treatment. Our previous work has included analyzing the representation of family history information in the EHR and developing a more comprehensive family history representation model. BioMedICUS includes a family history module, which identifies family history statements, observations (e.g., disease or procedure), relative or side of family with attributes (i.e., vital status, age of diagnosis, certainty, and negation), and predications (“indicator phrases”) that are used to establish relationships between observations and family member.


Proposed extended hierarchy(high-level)for roles

The HL7/LOINC Document Ontology (DO) aids the use and exchange of clinical documents using a multi-axis structure of document attributes for Kind of Document, Setting, Role, Subject Matter Domain and Type of Service. Prior studies have demonstrated the need for extension of values in select axes. This master list of 220 Role values was created from seven resources: HL7/LOINC Document Ontology Role values, proposed values based on Healthcare Provider Taxonomy, values from local research-oriented clinical data repository, CMS Medicare Specialty Code values, International Standard Classification of Occupations (ISCO), Standard Occupational Classification (SOC), Values from Title V of Patient Protection and Affordable Care Act (ACA).


Learn more about MTAP

MTAP is a foundational framework that enables users to create text analysis and NLP components. MTAP bridges the gap between idea prototyping and production-scale deployments by providing distributed data models and processing tools using gRPC as an underlying communication framework. MTAP allows for support of Python and Java based components to work in tandem and is designed with ease-of-use in mind to facilitate users with minimal development experience.


The Natural Language Processing Artifact Discovery And Preparation Toolkit (NLP-ADAPT) is a collection of programs and scripts presented as a Virtual Machine (VM) image and as a repository of Docker and Kubernetes specifications (NLP-ADAPT-Kube). These formats are designed to help researchers who wish to use clinical Natural Language Processing get off the ground fast. The VM image is designed for use in initial investigations, but as research moves to larger amounts of text users may desire to use the Docker and Kubernetes versions of the the programs to scale their pipelines and make better use of their computing resources.Installation instructions and download links can be found at the project’s Github repositories.

Learn more about NLP-ADAPT

Learn more about NLP-ADAPT-Kube

Aspects of the tools and pipelines were presented at AMIA Symposium 2019 as a workshop. Helpful resources are still available in the Github repository that was assembled to accompany the workshop presentation.

View the Workshop presentation at AMIA Symposium 2019


Learn more about NLP-PIER

PIER (Patient Information Extraction for Research) is an Information Extraction (IE) platform that provides direct access to patient data stored in free text of clinical notes. The underlying framework of PIER uses Elasticsearch technology and features the University of Minnesota Clinical NLP/IE program’s open source Natural Language Processing (NLP) application, BioMedICUS (BioMedical Information Collection and Understanding System). This resource aims to serve biomedical and clinical researchers and is a result of the collaborative efforts between the NLP/IE program, Clinical Translational Science Institute (CTSI), Minnesota Supercomputing Institute (MSI), and Academic Health Center Information System Research Development and Support Group.


Access NLP-TAB

We present NLP Type and Annotation Browser(NLP-TAB), an open-source system that facilitates exploration and analysis of NLP applications and their components without prior knowledge of their implementation. By storing and analyzing the results produced by each NLP application on one or more corpora using a type-agnostic data model, we allow users to discover which annotations best match their specific information retrieval tasks, as well as, run comparisons between annotation types of separate applications.

The ultimate goal of NLP-TAB is to facilitate the development and deployment of information extraction systems that make use of the results of multiple NLP applications developed using the Apache Unstructured Information Management (UIMA) platform (, maximizing their relative strengths and minimizing their weaknesses. To reach that goal, NLP-TAB has a threefold purpose. First, it allows users to explore and evaluate disparate NLP applications and the annotations they create through several visualization and information retrieval techniques. Second, it combines the results of different NLP systems for subsequent information retrieval. Here, leveraging multiple NLP applications may improve accuracy and reliability of information extraction from medical texts particularly when the NLP applications produce complementary results. NLP-TAB is designed to elucidate the degree to which different NLP applications are complementary. Third, NLP-TAB may eventually enable the reuse and interoperability of components from different pipelines through analysis and unsupervised creation of mappings between data types.T


This is a collection of reference standards created to test and validate computerized approaches to quantifying the degree of semantic relatedness and similarity between medical terms. Each dataset consists of a list of term pairs that have been evaluated by various healthcare professionals (e.g., medical coders, residents, clinicians) to determine the degree of semantic relatedness and similarity. The details pertaining to each dataset are provided in the referenced publications. 



Clinical abbreviation sense inventory

A sense inventory is a collection of abbreviations and acronyms (short forms) with their possible senses (long forms), along with other corresponding information about these terms. For our comprehensive sense inventory for clinical abbreviations and acronyms, a total of 440 most frequently used abbreviations and acronyms were selected from 604,944 dictated clinical notes. 949 senses of each abbreviation and acronym were manually annotated from 500 random instances within clinical notes and lexically aligned with 17,359 long forms of the Unified Medical Language System (UMLS), 5,233 long forms of Another Database of Abbreviations in Medline (ADAM), and 4,879 long forms in Stedman’s Medical Abbreviations, Acronyms & Symbols (4th edition).


The ‘procedure description’ section in operative note contains a significant amount of description of actions performed during an operation. The action predicates (e.g., fill, incision, irrigate, etc.) encode predicative relations between nominal arguments (e.g., chamber, viscoelastic, Murphy hook, L5 root, antibiotic solution). These predicate arguments convey the important details about actions performed during a procedure. This dataset includes frequent action predicates collected from 362,310 operation narratives obtained from University of Minnesota-affiliated Fairview Health Services with the UMLS and SPECIALIST lexicon mapping.


This data was collected from the Semantic MEDLINE Database (SemMedDb) ver 30, December 2016 release. It contains sentences, subject/object entity information, and predicate information as output by SemRep. It also contains annotations indicating whether each semantic predication is indeed expressed in the sentence. The data was used for the paper "Evaluating Active Learning Methods for Annotating Semantic Predications Extracted from MEDLINE", the associated manuscript is under review.


The integrated Dietary Supplements Knowledge Base (iDISK) covers a variety of dietary supplements, including vitamins, herbs, minerals, etc. It was standardized and integrated from the Dietary Supplements Label Database (DSLD), the "About Herbs" database from Memorial Sloan Kettering Cancer Center (MSKCC), the Canadian Natural Health Products and Ingredients database (NHP), and the Natural Medicines Comprehensive Database (NMCD) developed by the Therapeutic Research Center (TRC). iDISK contains a variety of attributes and relationships describing information about each dietary supplement such as which products it is an ingredient of and what drugs it might interact with.