NLP/IE Resources

submenu

Our Resources

BioMedICUS

BioMedICUS

Access BioMedICUS

The BioMedical Information Collection and Understanding System (BioMedICUS) leverages open source solutions for text analysis and provides new analytic tools for processing and analyzing text of biomedical and clinical reports. The system is being developed by our biomedical NLP/IE program at the University of Minnesota. This is a collaborative project that aims to serve biomedical and clinical researchers, allowing for customization with different texts.

Family history extraction from clinical texts

Family history extraction from clinical texts

Access the family history module demonstration

Family history information is essential for understanding disease risk. It is, more specifically, critical for individualized disease prevention, diagnosis, and treatment. Our previous work has included analyzing the representation of family history information in the EHR and developing a more comprehensive family history representation model. BioMedICUS includes a family history module, which identifies family history statements, observations (e.g., disease or procedure), relative or side of family with attributes (i.e., vital status, age of diagnosis, certainty, and negation), and predications (“indicator phrases”) that are used to establish relationships between observations and family member.

HL7/LOINC Document Ontology: Role axis evaluation

HL7/LOINC Document Ontology: Role axis evaluation

Proposed extended hierarchy (high-level) for roles

The HL7/LOINC Document Ontology (DO) aids the use and exchange of clinical documents using a multi-axis structure of document attributes for Kind of Document, Setting, Role, Subject Matter Domain and Type of Service. Prior studies have demonstrated the need for extension of values in select axes. This master list of 220 Role values was created from seven resources: HL7/LOINC Document Ontology Role values, proposed values based on Healthcare Provider Taxonomy, values from local research-oriented clinical data repository, CMS Medicare Specialty Code values, International Standard Classification of Occupations (ISCO), Standard Occupational Classification (SOC), Values from Title V of Patient Protection and Affordable Care Act (ACA).

HL7/LOINC Document Ontology: Setting axis evaluation

HL7/LOINC Document Ontology: Setting axis evaluation

Proposed extensions to HL7/LOINC Document Ontology setting axis

The HL7/LOINC Document Ontology (DO) is an ontology that provides a standard representation of clinical document metadata in a hierarchical structure comprised of five axes: Kind of Document (KOD), Type of Service (TOS), Setting, Subject Matter Domain (SMD) and Role. The DO supports exchange of clinical documents across organizations and systems and also facilitates retrieval and reuse of documents for research and other secondary uses. A number of studies, however, demonstrate the need for extension of values in DO axes. This dataset was created with evaluation of different data sources, which arrived at a reorganized hierarchy with 254 additional values from local research-oriented clinical data repository, CMS Place of Service (POS) code set, Minnesota Health Care Plans (MHCP) provider manual, Minnesota Electronic Health Record mandate guidance for settings, the HL7 version 3 (v3) code set, and the National Healthcare Safety Network (NHSN).

NLP-PIER

NLP-PIER

Learn more about NLP-PIER

PIER (Patient Information Extraction for Research) is an Information Extraction (IE) platform that provides direct access to patient data stored in free text of clinical notes. The underlying framework of PIER uses Elasticsearch technology and features the University of Minnesota Clinical NLP/IE program’s open source Natural Language Processing (NLP) application, BioMedICUS (BioMedical Information Collection and Understanding System). This resource aims to serve biomedical and clinical researchers and is a result of the collaborative efforts between the NLP/IE program, Clinical Translational Science Institute (CTSI), Minnesota Supercomputing Institute (MSI), and Academic Health Center Information System Research Development and Support Group.

NLP-TAB

NLP-TAB

Access NLP-TAB

We present NLP Type and Annotation Browser (NLP-TAB), an open-source system that facilitates exploration and analysis of NLP applications and their components without prior knowledge of their implementation. By storing and analyzing the results produced by each NLP application on one or more corpora using a type-agnostic data model, we allow users to discover which annotations best match their specific information retrieval tasks, as well as, run comparisons between annotation types of separate applications.

The ultimate goal of NLP-TAB is to facilitate the development and deployment of information extraction systems that make use of the results of multiple NLP applications developed using the Apache Unstructured Information Management (UIMA) platform (http://uima.apache.org/), maximizing their relative strengths and minimizing their weaknesses. To reach that goal, NLP-TAB has a threefold purpose. First, it allows users to explore and evaluate disparate NLP applications and the annotations they create through several visualization and information retrieval techniques. Second, it combines the results of different NLP systems for subsequent information retrieval. Here, leveraging multiple NLP applications may improve accuracy and reliability of information extraction from medical texts particularly when the NLP applications produce complementary results. NLP-TAB is designed to elucidate the degree to which different NLP applications are complementary. Third, NLP-TAB may eventually enable the reuse and interoperability of components from different pipelines through analysis and unsupervised creation of mappings between data types.

Semantic similarity and relatedness package

Semantic similarity and relatedness package

The UMLS similarity is an open source Perl package for similarity and relatedness measures. Currently the package implements a variety of semantic such as edge counts, information contents, shortest path, etc. based on ontologies and terminologies found in the Unified Medical Language System (UMLS) and WordNet. The package assigns numeric values between pairs of input medical concepts depend on selected measure type indicating how similar or related they are.

The semantic similarity and relatedness package is publicly available: http://search.cpan.org/dist/UMLS-Similarity.

Sense inventories

Sense inventories

Clinical abbreviation sense inventory

http://purl.umn.edu/137703

A sense inventory is a collection of abbreviations and acronyms (short forms) with their possible senses (long forms), along with other corresponding information about these terms. For our comprehensive sense inventory for clinical abbreviations and acronyms, a total of 440 most frequently used abbreviations and acronyms were selected from 604,944 dictated clinical notes. 949 senses of each abbreviation and acronym were manually annotated from 500 random instances within clinical notes and lexically aligned with 17,359 long forms of the Unified Medical Language System (UMLS), 5,233 long forms of Another Database of Abbreviations in Medline (ADAM), and 4,879 long forms in Stedman’s Medical Abbreviations, Acronyms & Symbols (4th edition).

Clinical symbol sense inventory

http://purl.umn.edu/137704

Although clinical texts contain many symbols, relatively little attention has been given to symbol resolution by medical natural language processing (NLP) researchers. Interpreting the meaning of symbols may be viewed as a special case of Word Sense Disambiguation (WSD). One thousand instances of four common non-alphanumeric symbols (‘+’, ‘–’, ‘/’, and ‘#’) were randomly extracted from a clinical document repository and annotated by experts. De-identified data are available for researchers.

Surgical Action Predicates with Mapping

Surgical Action Predicates with Mapping

http://purl.umn.edu/137705

The ‘procedure description’ section in operative note contains a significant amount of description of actions performed during an operation. The action predicates (e.g., fill, incision, irrigate, etc.) encode predicative relations between nominal arguments (e.g., chamber, viscoelastic, Murphy hook, L5 root, antibiotic solution). These predicate arguments convey the important details about actions performed during a procedure. This dataset includes frequent action predicates collected from 362,310 operation narratives obtained from University of Minnesota-affiliated Fairview Health Services with the UMLS and SPECIALIST lexicon mapping.