IN34A-07
Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

Wednesday, 16 December 2015: 17:30
2020 (Moscone West)
Selina Chu1, Giuseppe Totaro2,3, Nipurn Doshi2,4, Shivika Thapar2,4, Chris A Mattmann5 and Paul Ramirez2, (1)NASA Jet Propulsion Laboratory, Machine Learning and Instrument Autonomy, Pasadena, CA, United States, (2)NASA Jet Propulsion Laboratory, Pasadena, CA, United States, (3)Sapienza University of Rome, Rome, Italy, (4)Indiana University Bloomington, Bloomington, IN, United States, (5)NASA Jet Propulsion Laboratory, Instrument and Data Systems Section, Pasadena, CA, United States
Abstract:
We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously.

Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction process. We will describe our experience and implementation of our system and share lessons learned from our development. We will also discuss ways in which this could be adapted to other science fields.

[1] Funk et al., 2014.

[2] Kang et al., 2014.

[3] Utopia Documents, http://utopiadocs.com

[4] Apache cTAKES, http://ctakes.apache.org