Discovering and Responding to the Challenges of Data Quality Throughout the Data Lifecycle

Friday, 19 December 2014
David F Moroni, NASA Jet Propulsion Laboratory, Pasadena, CA, United States
Data quality is perhaps one of the most valuable yet misunderstood and unresolved elements of the science data life cycle. This is not without significant effort by many within the international science data community to help develop and improve the meaning of data quality, corresponding standards, tools, and services which, when properly applied, collectively serve the interests of the data provider, data center, and ultimately the end user. It is often thought that the concerns of data quality should be primarily focused on ensuring science data is well characterized and understood by the end user. Although this is a crucial goal, the common result of this singular emphasis is a tendency toward dataset-specific solutions, which are often not planned with long-term preservation in mind. Given the recent flurry and plethora of existing tools and standards with which many of the data quality concerns may be addressed, it can almost be a lifelong pursuit for a single data user or provider to sift through it all or at least to become a savvy expert in a particular standard such as ISO-19157. The other concern is that not all standards are open source (e.g., ISO), thus providing a financial hurdle on top of the already difficult learning curve.

A systems engineering approach offers a solution to the current data quality debacle by establishing and promoting a uniform and ubiquitous application of standards and solutions across heterogeneous datasets of many science disciplines. Here I present real-world examples along with both existing and theoretical solutions to known data quality concerns using a NASA-inspired systems engineering approach. Part of the problem is “knowing” the specific data quality concerns, which is why one of the tools I use is a simple “Use Case” template, custom-tailored for data quality. This template is designed with heterogeneity of data quality issues in mind. As an aid to this template is a corresponding “Use Case Response”, which provides the systems engineer with an inventory of existing solutions and the degree to which those solutions may meet deliverables required by each use case. The coupling of the “Use Case” with the “Use Case Response” is the primary key to mapping the elements of the knowledge base to the solutions.