IN21A-3692:
Heavy Analysis and Light Virtualization of Water Use Data with Python

Tuesday, 16 December 2014
Hyungtae Kim, Center for Hydrologic Modeling, Irvine, CA, United States, Neeta Bijoor, University of California Irvine, Irvine, CA, United States and James S Famiglietti, Univ California Irvine, Irvine, CA, United States
Abstract:
Water utilities possess a large amount of water data that could be used to inform urban ecohydrology, management decisions, and conservation policies, but such data are rarely analyzed owing to difficulty in analyzation, visualization, and interpretion. We have developed a high performance computing resource for this purpose. We partnered with 6 water agencies in Orange County who provided 10 years of parcel-level monthly water use billing data for a pilot study. The first challenge that we overcame was to refine all human errors and unify the many different formats of data over all agencies. Second, we tested and applied experimental approaches to the data, including complex calculations, with high efficiency. Third, we developed a method to refine the data so it can be browsed along a time series index and/or geo-spatial queries with high efficiency, no matter how large the data. Python scientific libraries were the best match to handle arbitrary data sets in our environment. Further milestones include agency entry, sets of formulae, and maintaining 15M rows X 70 columns of data with high performance of cpu-bound processes. To deal with billions of rows, we performed an analysis virtualization stack by leveraging iPython parallel computing. With this architecture, one agency could be considered one computing node or virtual machine that maintains its own data sets respectively. For example, a big agency could use a large node, and a small agency could use a micro node. Under the minimum required raw data specs, more agencies could be analyzed. The program developed in this study simplifies data analysis, visualization, and interpretation of large water datasets, and can be used to analyze large data volumes from water agencies nationally or worldwide.