Mash-up of techniques between data crawling/transfer, data preservation/stewardship and data processing/visualization technologies on a science cloud system designed for Earth and space science: a report of successful operation and science projects of the NICT Science Cloud

Friday, 19 December 2014
Ken T. Murata, NICT National Institute of Information and Communications Technology, Tokyo, Japan
Data-intensive or data-centric science is 4th paradigm after observational and/or experimental science (1st paradigm), theoretical science (2nd paradigm) and numerical science (3rd paradigm). Science cloud is an infrastructure for 4th science methodology. The NICT science cloud is designed for big data sciences of Earth, space and other sciences based on modern informatics and information technologies [1]. Data flow on the cloud is through the following three techniques; (1) data crawling and transfer, (2) data preservation and stewardship, and (3) data processing and visualization. Original tools and applications of these techniques have been designed and implemented. We mash up these tools and applications on the NICT Science Cloud to build up customized systems for each project.
In this paper, we discuss science data processing through these three steps. For big data science, data file deployment on a distributed storage system should be well designed in order to save storage cost and transfer time. We developed a high-bandwidth virtual remote storage system (HbVRS) and data crawling tool, NICTY/DLA and Wide-area Observation Network Monitoring (WONM) system, respectively. Data files are saved on the cloud storage system according to both data preservation policy and data processing plan. The storage system is developed via distributed file system middle-ware (Gfarm: GRID datafarm). It is effective since disaster recovery (DR) and parallel data processing are carried out simultaneously without moving these big data from storage to storage. Data files are managed on our Web application, WSDBank (World Science Data Bank). The big-data on the cloud are processed via Pwrake, which is a workflow tool with high-bandwidth of I/O. There are several visualization tools on the cloud; VirtualAurora for magnetosphere and ionosphere, VDVGE for google Earth, STICKER for urban environment data and STARStouch for multi-disciplinary data. There are 30 projects running on the NICT Science Cloud for Earth and space science.
In 2003 56 refereed papers were published. At the end, we introduce a couple of successful results of Earth and space sciences using these three techniques carried out on the NICT Sciences Cloud.