IN51C-06:
The BCube Crawler: Web Scale Data and Service Discovery for EarthCube.

Friday, 19 December 2014: 9:15 AM
Luis Alberto Lopez1, Siri-Jodha S Khalsa2, Ruth Duerr1, Abeve Tayachow1 and Erik Mingo1, (1)National Snow and Ice Data Center, Boulder, CO, United States, (2)University of Colorado at Boulder, Boulder, CO, United States
Abstract:
Web-crawling, a core component of the NSF-funded BCube project, is researching and applying the use of big data technologies to find and characterize different types of web services, catalog interfaces, and data feeds such as the ESIP OpenSearch, OGC W*S, THREDDS, and OAI-PMH that describe or provide access to scientific datasets.

Given the scale of the Internet, which challenges even large search providers such as Google, the BCube plan for discovering these web accessible services is to subdivide the problem into three smaller, more tractable issues. The first, to be able to discover likely sites where relevant data and data services might be found, the second, to be able to deeply crawl the sites discovered to find any data and services which might be present. Lastly, to leverage the use of semantic technologies to characterize the services and data found, and to filter out everything but those relevant to the geosciences.

To address the first two challenges BCube uses an adapted version of Apache Nutch (which originated Hadoop), a web scale crawler, and Amazon’s ElasticMapReduce service for flexibility and cost effectiveness. For characterization of the services found, BCube is examining existing web service ontologies for their applicability to our needs and will re-use and/or extend these in order to query for services with specific well-defined characteristics in scientific datasets such as the use of geospatial namespaces.

The original proposal for the crawler won a grant from Amazon’s academic program, which allowed us to become operational; we successfully tested the Bcube Crawler at web scale obtaining a significant corpus, sizeable enough to enable work on characterization of the services and data found. There is still plenty of work to be done, doing “smart crawls” by managing the frontier, developing and enhancing our scoring algorithms and fully implementing the semantic characterization technologies. We describe the current status of the project, our successes and issues encountered.

The final goal of the BCube crawler project is to provide relevant data services to other projects on the EarthCube stack and third party partners so they can be brokered and used by a wider scientific community.