Collections and user tools for utilization of persistent identifiers in cyberinfrastructures
Abstract:The main use of persistent identifiers (PIDs) for data objects has so far been for formal publication and citation purposes with a focus on long-term availability and trust. This core use case has now evolved and broadened to include basic data management tasks as identifiers are increasingly seen as a possible anchor element in the deluge of data for purposes of large-scale automation of tasks. The European Data Infrastructure (EUDAT) for instance uses PIDs in their back-end services and distinctly so for entities where the identifier may be more persistent than a resource with limited lifetime. Despite breaking with the traditional metaphor, this offers new opportunities for data management and end-user tools, but also requires a clear demonstrated benefit of value-added services because en masse identifier assignment does not come at zero costs.
There are several obstacles to overcome when establishing identifiers at large scale. The administration of large numbers of identifiers can be cumbersome if they are treated in an isolated manner. Here, identifier collections can enable automated mass operations on groups of associated objects. Several use cases rely on base information that is rapidly available from the identifier systems without the need to retrieve objects, yet they will not work efficiently if the information is not consistently typed. Tools that span cyberinfrastructures and address scientific end-users unaware of the varying back-ends must overcome such obstacles. The Working Group on PID Information Types of the Research Data Alliance (RDA) has developed an interface specification and prototype to access and manipulate typed base information. Concrete prototypes for identifier collections exist as well. We will present some first data and provenance tracking tools that make extensive use of these recent developments and address different user needs that span from administrative tasks to individual end-user services with particular focus on data available from the Earth System Grid Federation (ESGF). We will compare the tools along their respective use cases with existing approaches and discuss benefits and limitations.