Cell-based Causality for Data Repairs

Maxime Debosschere; Floris Geerts; Sean Riddle; Bertram Ludäscher; Bertram Ludäscher; Sherif Akoush; Ripduman Sohan; Margo Seltzer; Andy Hopper

Workshop Program

The workshop papers are available for download below. Copyright to the individual works is retained by the author(s).
Links to the slides are available here.

Wednesday, July 8, 2015

12:00–13:55	Wednesday
Registration (lunch available)
13:55–14:00	Wednesday
Welcome
14:00–15:00	Wednesday
Keynote Address I Session Chair: Paolo Missier, Newcastle University Big Data Curation Renee Miller, University of Toronto/IBM More than a decade ago, Peter Buneman used the term curated databases to refer to databases that are created and maintained using the (often substantial) effort and domain expertise of humans. These human experts clean the data, integrate it with new sources, prepare it for analysis, and share the data with other experts in their field. In data curation, one seeks to support human curators in all activities needed for maintaining and enhancing the value of their data over time. Curation includes data provenance, the process of understanding the origins of data, how it was created, cleaned, or integrated. Big Data offers opportunities to solve curation problems in new ways. The availability of massive data is making it possible to infer semantic connections among data, connections that are central to solving difficult integration, cleaning, and analysis problems. More than a decade ago, Peter Buneman used the term curated databases to refer to databases that are created and maintained using the (often substantial) effort and domain expertise of humans. These human experts clean the data, integrate it with new sources, prepare it for analysis, and share the data with other experts in their field. In data curation, one seeks to support human curators in all activities needed for maintaining and enhancing the value of their data over time. Curation includes data provenance, the process of understanding the origins of data, how it was created, cleaned, or integrated. Big Data offers opportunities to solve curation problems in new ways. The availability of massive data is making it possible to infer semantic connections among data, connections that are central to solving difficult integration, cleaning, and analysis problems. Some of the nuanced semantic differences that eluded enterprise-scale curation solutions can now be understood using evidence from Big Data. Big Data Curation leverages the human expertise that has been embedded in Big Data, be it in general knowledge data that has been created through mass collaboration, or in specialized knowledge-bases created by incentivized user communities who value the creation and maintenance of high quality data. In this talk, I describe our experience in Big Data Curation. This includes our experience over the last five years curating NIH Clinical Trials data that we have published as Open Linked Data at linkedCT.org. I overview how we have adapted some of the traditional solutions for data curation to account for (and take advantage of) Big Data. Read more about Big Data Curation
15:00–15:30	Wednesday
Break
15:30–17:10	Wednesday
Session I: Capture Retrospective Provenance Without a Runtime Provenance Recorder Timothy McPhillips, University of Illinois at Urbana-Champaign; Shawn Bowers, Gonzaga University; Khalid Belhajjame, Paris Dauphine University; Bertram Ludäscher, University of Illinois at Urbana-Champaign The YesWork ow (YW) toolkit aims to provide users of scripting languages such as Python, Perl, and R with many of the benefits of scientific workflow automation. YW requires neither the use of a workflow engine nor the overhead of adapting or instrumenting code to run in such a system. Instead, YW enables scientists to annotate their scripts with special comments that reveal the main computational blocks and dataflow dependencies otherwise implicit in scripts. YW tools extract and analyze these comments, represent scripts in terms of entities based on a typical scientific workflow model, and provide graphical workflow views (i.e., prospective provenance) of scripts. In this paper, we present a new extension of YW for inferring retrospective provenance from script executions without relying on a runtime provenance recorder. Instead we exploit the common practice of scientists to embed important pieces of provenance in directory structures and file names. For such “provenance-friendly” data organizations, we offer a new annotation mechanism based on URI templates. YW uses these to link conceptual-level prospective provenance with data files created at runtime, resulting in a powerful, integrated model of prospective and retrospective provenance.We present scientifically meaningful retrospective provenance queries for investigating an execution of a data acquisition workflow implemented as a Python script, and show how these queries can be evaluated using the YW toolkit. Available Media Provenance of Publications: A PROV Style for LaTeX Luc Moreau, University of Southampton; Paul Groth, Elsevier Labs In general, the task of generating provenance is still tedious, and the community still lacks tools to generate provenance easily. In particular, when writing papers, researchers should be able to produce the provenance of their papers, make it available online, and embed provenance metadata directly in their PDF files. To address this goal, we introduce `prov.sty`, a PROV style for LaTeX, allowing LaTeX source to be marked up, and associated provenance to be generated automatically. Provenance captured by this style currently includes: authors, organisations, funders, bibliographic citations, and embedded images. PROV provenance is automatically generated and exported as a Turtle file; further, a link to a provenance resource can be embedded in PDF using the XMP metadata format. Available Media Decoupling Provenance Capture and Analysis from Execution Manolis Stamatogiannakis, VU University Amsterdam; Paul Groth, Elsevier Labs; Herbert Bos, VU University Amsterdam Capturing provenance usually involves the direct observation and instrumentation of the execution of a program or workflow. However, this approach restricts provenance analysis to pre-determined programs and methods. This may not pose a problem when one is interested in the provenance of a well-defined workflow, but may limit the analysis of unstructured processes such as interactive desktop computing. In this paper, we present a new approach to capturing provenance based on full execution record and replay. Our approach leverages full-system execution trace logging and replay, which allows the complete decoupling of analysis from the original execution. This enables the selective analysis of the execution using progressively heavier instrumentation. Available Media Provenance Tipping Point David Gammack, Marymount University; Adriane Chapman, The MITRE Corporation Capture is a known, difficult problem for provenance. Obtaining from the systems and programs exactly what happened has been a continuing struggle outside of database and workflow systems. The provenance research community has created libraries to log provenance, and has also embedded instances of capture agents within operating systems, specific programs, etc. However, it is impossible to know if we are inserting capture agents at both the optimal location and frequency in a given system for a high quality provenance graph. In this work, we develop an initial agent based model to simulate Activity and Entity interactions in a complex system of software. Using this model, we can attempt to define some generalized principles about type, frequency and distribution of provenance capture agents given a new system. Available Media

Thursday, July 9, 2015

09:00–10:00	Thursday
Keynote Address II Session Chair: Jun Zhao, Lancaster University Uncertainty and Provenance in Collaborative Situation Awareness Trevor Martin, University of Bristol The so-called big data revolution has been characterised by an increase in sources of data as well as in the volume of data to be processed. In many cases—for example, network behaviour and control, security monitoring, enterprise management information—the data for situation awareness and decision-making is drawn from multiple sources and must be integrated into a coherent whole as far as possible. The so-called big data revolution has been characterised by an increase in sources of data as well as in the volume of data to be processed. In many cases—for example, network behaviour and control, security monitoring, enterprise management information—the data for situation awareness and decision-making is drawn from multiple sources and must be integrated into a coherent whole as far as possible. This process generally requires both machines and human analysts and experts, It includes compensating for different formats, different granularities and resolutions, identifying and correcting errors (both systematic and intermittent), as well as managing uncertainties and gaps in data. Often the process requires assumptions and choices to be made in arriving at a reasonably robust overview of a situation—for example, in deciding that a failed attempt to access a building is potentially malicious, we might need to take account of someone's recent travel, long term patterns of behaviour, current schedules of close colleagues, etc. where each of these components may have been derived from lower-level raw data. Provenance in this context refers to the derivation pathways and their overall reliability. In this talk, I will describe the use of graded (fuzzy) representations in modelling and managing the uncertainties, reliability and granularity of derived data in combining sources for situation awareness. Read more about Uncertainty and Provenance in Collaborative Situation Awareness
10:00–10:30	Thursday
Break
10:30–12:10	Thursday
Session II: Query and System Towards a Unified Query Language for Provenance and Versioning Amit Chavan, University of Maryland; Silu Huang, University of Illinois at Urbana-Champaign; Amol Deshpande, University of Maryland; Aaron Elmore, University of Chicago; Samuel Madden, MIT; Aditya Parameswaran, University of Illinois at Urbana-Champaign Organizations and teams collect and acquire data from various sources, such as social interactions, financial transactions, sensor data, and genome sequencers. Different teams in an organization as well as different data scientists within a team are interested in extracting a variety of insights which requires combining and collaboratively analyzing datasets in diverse ways. DataHub is a system that aims to provide robust version control and provenance management for such a scenario. To be truly useful for collaborative data science, one also needs the ability to specify queries and analysis tasks over the versioning and the provenance information in a unified manner. In this paper, we present an initial design of our query language, called VQuel, that aims to support such unified querying over both types of information, as well as the intermediate and final results of analyses. We also discuss some of the key language design and implementation challenges moving forward. Available Media Interoperability for Provenance-aware Databases using PROV and JSON Xing Niu, Raghav Kapoor, and Boris Glavic, Illinois Institute of Technology; Dieter Gawlick, Zhen Hua Liu, and Vasudha Krishnaswamy, Oracle Corporation; Venkatesh Radhakrishnan, Facebook Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance-aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations. Available Media Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs Adam Bates and Kevin R.B. Butler, University of Florida; Thomas Moyer, MIT Lincoln Laboratory When performing automatic provenance collection within the operating system, inevitable storage overheads are made worse by the fact that much of the generated lineage is uninteresting, describing noise and background activities that lie outside the scope the system’s intended use. In this work, we propose a novel approach to policy-based provenance pruning—leverage the confinement properties provided by Mandatory Access Control (MAC) systems in order to identify subdomains of system activity for which to collect provenance. We consider the assurances of completeness that such a system could provide by sketching algorithms that reconcile provenance graphs with the information flows permitted by the MAC policy. We go on to identify the design challenges in implementing such a mechanism. In a simplified experiment, we demonstrate that adding a policy component to the Hi-Fi provenance monitor could reduce storage overhead by as much as 82%. To our knowledge, this is the first practical policy-based provenance monitor to be proposed in the literature. Available Media Recent Advances in Computer Architecture: The Opportunities and Challenges for Provenance Nikilesh Balakrishnan, Thomas Bytheway, Lucian Carata, Oliver R. A. Chick, James Snee, Sherif Akoush, and Ripduman Sohan, University of Cambridge; Margo Seltzer, Harvard University; Andy Hopper, University of Cambridge In recent years several hardware and systems fields have made advances in technology that open new opportunities and challenges for provenance systems. In this paper we look at such technologies and discuss the implications they have for provenance. First, we discuss processor and memory controller technologies that enable fine-grained lineage capture, resulting in more precise and accurate provenance. Then, we look at programmable storage, 3D memory and co-processor technologies discussing how lineage capture in these heterogeneous environments results in richer and more complete provenance. We finally look at technological advances in the field of networking, namely NFV and SDN, discussing how these technologies enable easier provenance capture in the network. Available Media
12:10–12:40	Thursday
3-Minute Gong Show—Poster Pitches
12:40–14:00	Thursday
Poster Session and Lunch
14:00–15:15	Thursday
Session III: Scientific Applications How Much Domain Data Should Be in Provenance Databases? Daniel de Oliveira, Universidade Federal Fluminense; Vítor Silva and Marta Mattoso, Federal University of Rio de Janeiro Provenance databases are an important asset in data analytics of large-scale scientific data. The data derivation path allows for identifying parameters, files and domain data values of interest. In scientific workflows, provenance data is automatically captured by workflow systems. However, the power of provenance data analyses depends on the expressiveness of domain-specific data along the provenance traces. While much has been done through the W3C PROV initiative and its PROV-DM to represent generic provenance data, representing domain-specific data in provenance traces has received little attention, yet it accounts for a large number of provenance analytical queries. Such queries are based on selections on data values from input/output artifacts along workflow activities. There are several problems in modeling and capturing values from domain-specific attributes, some of them are related to managing provenance granularity, others to addressing data values hidden inside files and representing the semantics of domain data. In this work, we discuss these open issues and propose some alternatives to domain-specific provenance data capture, representation, storage and queries. Addressing these issues may be decisive in using provenance to drive scientific data analyses at large-scale. Available Media Collecting and Analyzing Provenance on Interactive Notebooks: When IPython Meets noWorkflow João Felipe Nicolaci Pimentel, Vanessa Braganholo, and Leonardo Murta, Universidade Federal Fluminense; Juliana Freire, New York University Interactive notebooks help users explore code, run simulations, visualize results, and share them with other people. While these notebooks have been widely adopted in teaching as well as by scientists and data scientists that perform exploratory analyses, their provenance support is limited to the visualization of some intermediate results and code sharing. Once a user arrives at a result, it is hard, and sometimes impossible, to retrace the steps that led to the result, since they do not collect the provenance for intermediate resuls or of the environment. As a result, users must fulfill this gap using external tools such as workflow management systems. To overcome this limitation, we propose a new approach to capture provenance from notebooks. We build upon noWorkflow, a system that systematically collects provenance for Python scripts. By integrating noWorkflow and notebooks, provenance is automatically and transparently captured, allowing users to focus on their exploratory tasks within the notebook. In addition, they are able to analyze provenance information within the notebook, to both reason about and debug their work, using visualizations, SQL queries, Prolog queries, and Python code. Available Media Linking Prospective and Retrospective Provenance in Scripts Saumen Dey, University of California, Davis; Khalid Belhajjame, Université Paris-Dauphine; David Koop, University of Massachusetts Dartmouth; Meghan Raul, University of California, Davis; Bertram Ludäscher, University of Illinois at Urbana-Champaign Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we propose that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution. Available Media
15:15–15:45	Thursday
Break
15:45–17:00	Thursday
Session IV: Foundations Language-integrated Provenance in Links Stefan Fehrenbach and James Cheney, University of Edinburgh Today’s programming languages provide no support for data provenance. In a world that increasingly relies on data, we need provenance to judge the reliability of data and therefore should aim for making it easily accessible to programmers. We report our work in progress on an extension to the Links programming language that builds on its support for language-integrated query to support where-provenance queries through query rewriting and a type system extension that distinguishes provenance metadata from other data. Our approach aims to work solely within the language implementation and thus require no changes to the database system. The type system together with automatic propagation of provenance metadata will prevent programmers from accidentally changing provenance, losing it, or misattributing it to other data. Available Media Towards Constraint-based Explanations for Answers and Non-Answers Boris Glavic, Illinois Institute of Technology; Sven Köhler, University of California, Davis; Sean Riddle, Athenahealth Corporation; Bertram Ludäscher, University of Illinois at Urbana-Champaign Explaining why an answer is present (traditional provenance) or absent (missing answer provenance) from a query result is important for many use cases. Most existing approaches use the existence (or absence) of input data to explain a (missing) answer. However, for realistically sized databases, these explanations can be very large and, thus, may not be very helpful to a user. In this paper, we argue that constraints as a concise description of large (or even infinite) sets of existing or missing inputs can provide a natural way of answering a Why- or Why-not question. For instance, to explain why no non-US citizen is in the answer of a query returning US presidents we could list all possible combinations of persons and countries of citizenship that are missing from the input. However, a more concise and insightful explanation is that the constraint that US presidency implies US citizenship prevents any results from being returned by our query. We demonstrate how a taxonomy expressed as inclusion dependencies can provide meaningful justifications for (non-) answers and outline how to find a most general such explanation for a given query using datalog. Furthermore, we sketch several variations of this framework derived by considering other types of constraints as well as alternative definitions of explanation and generalization. Available Media Cell-based Causality for Data Repairs Maxime Debosschere and Floris Geerts, Universiteit Antwerpen In recent work, Salimi and Bertossi provide a tight connection between causality and tuple-based data repairs. We investigate this connection between causality and two other kinds of repair models. First, we consider cell-based V-repairs, i.e., repairs that are obtained by modifying cells in the data. In contrast, tuple-based repairs only allow for the deletion of tuples. Second, we introduce a new notion of repairs, called chase repairs, that take into account the procedural (chase) steps that lead to a repair. We establish a connection between causes (and the associated notion of responsibility) and V-repairs, and analyse the complexity of verifying whether a cell is a cause and whether its responsibility is above a certain threshold. Our understanding of chase repairs is still very preliminary, and we argue that provenance models that are specifically targeted to data repairs and data quality in general are needed to make formal connections between causality and chase repairs. Available Media

Workshop Program

Wednesday, July 8, 2015

Registration (lunch available)

Welcome

Break

Thursday, July 9, 2015

Break

3-Minute Gong Show—Poster Pitches

Poster Session and Lunch

Break