Detailed Provenance Metadata from Statistical Analysis Software: TaPP Applications Track

Authors: 

George Alter, University of Michigan; Jack Gager, Pascal Heus, and Carson Hunter, Metadata Technology North America; Sanda Ionescu, University of Michigan; Jeremy Iverson, Colectica; H V Jagadish, University of Michigan; Bertram Ludaescher, University of Illinois Urbana-Champaign; Jared Lyle, University of Michigan; Timothy McPhillips, University of Illinois Urbana-Champaign; Alexander Mueller, University of Michigan; Sigve Nordgaard and Ørnulf Risnes, Norwegian Centre for Research Data; Dan Smith, Colectica; Jie Song, University of Michigan; Thomas Thelen, University of California Santa Barbara

Abstract: 

We have created a set of tools for automating the extraction of fine-grained provenance from statistical analysis software used for data management. Our tools create metadata about steps within programs and variables (columns) within data-frames in a way consistent with the ProvONE extension of the PROV model. Scripts from the most widely used statis-tical analysis programs are translated into Structured Data Transformation Language (SDTL), an intermediate language for describing data transformation commands. SDTL can be queried to create histories of each variable in a dataset. For example, we can ask, “Which commands modified variable X?” or “Which variables were affected by variable Y?” SDTL was created to solve several problems. First, research-ers are divided among a number of mutually unintelligible statistical languages. SDTL serves as a lingua franca provid-ing a common language for downstream applications. Sec-ond, SDTL is a structured language that can be serialized in JSON, XML, RDF, and other formats. Applications can read SDTL without specialized parsing, and relationships among elements in SDTL are not defined by an external grammar. Third, SDTL provides general descriptions for operations that are handled differently in each language. For example, the SDTL MergeDatasets command describes both earlier languages (SPSS, SAS, Stata), in which merging is based on sequentially sorted files, and recent languages (R, Python) modelled on SQL. In addition, we have developed a flexible tool that translates SDTL into natural language. Our tools also embed variable histories including both SDTL and natu-ral language translations into standard metadata files, such as Data Documentation Initiative (DDI) and Ecological Metadata Language (EML), which are used by data reposito-ries to inform data catalogs, data discovery services, and codebooks. Thus, users can receive detailed information about the effects of data transformation programs without un-derstanding the language in which they were written.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {274853,
author = {George Alter and Jack Gager and Pascal Heus and Carson Hunter and Sanda Ionescu and Jeremy Iverson and H V Jagadish and Bertram Ludaescher and Jared Lyle and Timothy McPhillips and Alexander Mueller and Sigve Nordgaard and {\O}rnulf Risnes and Dan Smith and Jie Song and Thomas Thelen},
title = {Detailed Provenance Metadata from Statistical Analysis Software: {TaPP} Applications Track},
booktitle = {13th International Workshop on Theory and Practice of Provenance (TaPP 2021)},
year = {2021},
url = {https://www.usenix.org/conference/tapp2021/presentation/alter},
publisher = {USENIX Association},
month = jul
}