IN44A-04
GeoDeepDive: Towards a Machine Reading-Ready Digital Library and Information Integration Resource

Thursday, 17 December 2015: 16:53
2020 (Moscone West)
Shanan E Peters, Miron Livny and Ian Ross, University of Wisconsin Madison, Geoscience, Madison, WI, United States
Abstract:
Recent developments in machine reading and learning approaches to text and data mining hold considerable promise for accelerating the pace and quality of literature-based data synthesis, but these advances have outpaced even basic levels of access to the published literature. For many geoscience domains, particularly those based on physical samples and field-based descriptions, this limitation is significant. Here we describe a general infrastructure to support published literature-based machine reading and learning approaches to information integration and knowledge base creation. This infrastructure supports rate-controlled automated fetching of original documents, along with full bibliographic citation metadata, from remote servers, the secure storage of original documents, and the utilization of considerable high-throughput computing resources for the pre-processing of these documents by optical character recognition, natural language parsing, and other document annotation and parsing software tools. New tools and versions of existing tools can be automatically deployed against original documents when they are made available. The products of these tools (text/XML files) are managed by MongoDB and are available for use in data extraction applications. Basic search and discovery functionality is provided by ElasticSearch, which is used to identify documents of potential relevance to a given data extraction task. Relevant files derived from the original documents are then combined into basic starting points for application building; these starting points are kept up-to-date as new relevant documents are incorporated into the digital library. Currently, our digital library stores contains more than 360K documents supplied by Elsevier and the USGS and we are actively seeking additional content providers. By focusing on building a dependable infrastructure to support the retrieval, storage, and pre-processing of published content, we are establishing a foundation for complex, and continually improving, information integration and data extraction applications. We have developed one such application, which we present as an example, and invite new collaborations to develop other such applications.