IN23E-08
Scientific Datasets: Discovery and Aggregation for Semantic Interpretation.

Tuesday, 15 December 2015: 15:25
2020 (Moscone West)
Luis Alberto Lopez1, Soren Scott2, Siri-Jodha S Khalsa2 and Ruth Duerr3, (1)National Snow and Ice Data Center, Boulder, CO, United States, (2)University of Colorado at Boulder, Boulder, CO, United States, (3)Ronin Institute for Independent Scholarship, Westminster, CO, United States
Abstract:
One of the biggest challenges that interdisciplinary researchers face is finding suitable datasets in order to advance their science; this problem remains consistent across multiple disciplines. A surprising number of scientists, when asked what tool they use for data discovery, reply “Google”, which is an acceptable solution in some cases but not even Google can find -or cares to compile- all the data that’s relevant for science and particularly geo sciences. If a dataset is not discoverable through a well known search provider it will remain dark data to the scientific world.


For the past year, BCube, an EarthCube Building Block project, has been developing, testing and deploying a technology stack capable of data discovery at web-scale using the ultimate dataset: The Internet. This stack has 2 principal components, a web-scale crawling infrastructure and a semantic aggregator. The web-crawler is a modified version of Apache Nutch (the originator of Hadoop and other big data technologies) that has been improved and tailored for data and data service discovery. The second component is semantic aggregation, carried out by a python-based workflow that extracts valuable metadata and stores it in the form of triples through the use semantic technologies.


While implementing the BCube stack we have run into several challenges such as a) scaling the project to cover big portions of the Internet at a reasonable cost, b) making sense of very diverse and non-homogeneous data, and lastly, c) extracting facts about these datasets using semantic technologies in order to make them usable for the geosciences community.


Despite all these challenges we have proven that we can discover and characterize data that otherwise would have remained in the dark corners of the Internet. Having all this data indexed and ‘triplelized’ will enable scientists to access a trove of information relevant to their work in a more natural way. An important characteristic of the BCube stack is that all the code we have developed is open sourced and available to anyone who wants to experiment and collaborate with the project at: http://github.com/b-cube/