IN33A-1794
MememxGATE: Unearthing Latent Content Features for Improved Search and Relevancy Ranking Across Scientific Literature

Wednesday, 16 December 2015
Poster Hall (Moscone South)
Lewis John McGibbney1, Chris A Mattmann2, Paul Ramirez3, Michael Joyce1, Kim D Whitehall2 and Brian D Wilson3, (1)NASA Jet Propulsion Laboratory, Pasadena, CA, United States, (2)Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States, (3)Jet Propulsion Laboratory, Pasadena, CA, United States
Abstract:
Quantifying scientific relevancy is of increasing importance to NASA and the research community. Scientific relevancy may be defined by mapping the impacts of a particular NASA mission, instrument, and/or retrieved variables to disciplines such as climate predictions, natural hazards detection and mitigation processes, education, and scientific discoveries. Related to relevancy, is the ability to expose data with similar attributes. This in turn depends upon the ability for us to extract latent, implicit document features from scientific data and resources and make them explicit, accessible and useable for search activities amongst others.

This paper presents MemexGATE; a server side application, command line interface and computing environment for running large scale metadata extraction, general architecture text engineering, document classification and indexing tasks over document resources such as social media streams, scientific literature archives, legal documentation, etc. This work builds on existing experiences using MemexGATE (funded, developed and validated through the DARPA Memex Progrjam PI Mattmann) for extracting and leveraging latent content features from document resources within the Materials Research domain. We extend the software functionality capability to the domain of scientific literature with emphasis on the expansion of gazetteer lists, named entity rules, natural language construct labeling (e.g. synonym, antonym, hyponym, etc.) efforts to enable extraction of latent content features from data hosted by wide variety of scientific literature vendors (AGU Meeting Abstract Database, Springer, Wiley Online, Elsevier, etc.) hosting earth science literature. Such literature makes both implicit and explicit references to NASA datasets and relationships between such concepts stored across EOSDIS DAAC’s hence we envisage that a significant part of this effort will also include development and understanding of relevancy signals which can ultimately be utilized for improved search and relevancy ranking across scientific literature.