Emergent Data-Networks from Long-Tail Collections 

Friday, 19 December 2014: 9:43 AM
Mostafa Elag1, Praveen Kumar2, Luigi Marini1, Margaret Hedstrom3, James D Myers3 and Beth A Plale4, (1)University of Illinois at Urbana Champaign, Urbana, IL, United States, (2)University of Illinois, Urbana, IL, United States, (3)University of Michigan Ann Arbor, School of Information, Ann Arbor, MI, United States, (4)Indiana University Bloomington, Bloomington, IN, United States
Synthesis of scientific data coming from individuals and small research group activities, known as long-tail data, with the existing resources elucidates useful scientific knowledge. In general, long-tail data are irreplaceable, expensive to reproduce, infrequently reused, follow no predefined data model, and they are often bounded in different information systems. The contextual relationships across the many attributes among such data in a data collection are herewith defined as data-network. These relationships have the potential to provide deep insights for the scientific challenges that require multidisciplinary interaction by identifying a new data object in the broader context of other data objects, and characterizing its spatial and temporal dependencies with others. Despite the advancement that has been achieved in various geoscience information models, it is not always straightforward to identify and characterize the contextual relationships among long-tail data because information models focus on profiling data attributes more than exploring data tie-ins. To address this need, we have designed the Long Tail Data Networks (LTDN) engine, which depends on a context-based approach to analyze the data attributes, predict data contextual relationships, and publish these relationships as a RDF graph. The engine groups data using their geographic location in spatial collections, and applies binary logic predicates to analyze the spatial, temporal, and variable attributes associated with data entities of each spatial collection to infer their relationships. Here we present the design of the LTDN engine and demonstrate its application for predicting the latent connectivity among long-tail data collections. To demonstrate the capabilities of the engine, we implemented this approach within the Sustainable Environment Actionable Data (SEAD) environment, an open-source semantic content repository that supports long-tail data curation and preservation, and show how relationships among datasets can be extracted. Results of this work demonstrates the capabilities of LTDN engine to predict the latent connectivity among long-tail geoscience data across their domain boundaries, as well as temporal and spatial windows to establish dynamic Web-based data-networks in the Semantic Web context.