Report From the {CoalFace}: Lessons Learnt Building A {General-Purpose} {Always-On} Provenance System

Nikilesh Balakrishnan; Thomas Bytheway; Ripduman Sohan; Andy Hopper; Boris Glavic; Bertram Ludäscher

Agenda

Thursday, June 12, 2014

9:00 a.m.–9:15 a.m.	Thursday
Welcome
9:15 a.m.–10:15 a.m.	Thursday
Keynote Address Introduction to Abstract Argumentation Stefan Woltran, DBAI, Vienna University of Technology, Austria Read more about Introduction to Abstract Argumentation
10:15 a.m.–10:45 a.m.	Thursday
Break
10:45 a.m.–noon	Thursday
Session I: Usage Reorganizing Workflow Evolution Provenance David Koop and Juliana Freire, New York University The provenance of related computations presents the opportunity to better understand and explore the differences and similarities of various approaches. As users design and refine workflows, evolution provenance captures the relationships between workflows as actions that mutate one workflow to another. However, such provenance may not always be the most compact or intuitive. This paper presents algorithms to update and transform workflow evolution provenance to achieve a representation that better exposes the correspondences between computations. We evaluate these algorithms based on the efficiency of the representation as well as the speed of the transformation. Available Media Influence Factor: Extending the PROV Model With a Quantitative Measure of Influence Matthew Gamble and Carole Goble, University of Manchester A central tenet of provenance is to support the assessment of the quality, reliability, or trustworthiness of data. The World Wide Web Consortium’s (W3C) PROV provenance data model shares this goal, and provides a domain-agnostic interchange language for provenance representation. In this paper we suggest that given the PROV model as it stands, there are cases where information relating to how one entity has influenced another falls short of that required to make these assessments. In light of this, we propose a simple extension to the model to capture a quantitative measure of influence. To understand how provenance publishers use PROV to describe influence we have consulted the current Provbench datasets and evaluated the usage of the 13 sub properties of `wasInfluencedBy`. The findings suggest that publishers are willing to provide additional information about how an influencer affected an influencee beyond a simple `wasInfluencedBy` relation. In the paper, we define influence factor as a quantitative measure of influence that one PROV entity, agent, or activity has had over another and introduce `influenceFactor` as property to enrich any qualified influence in the PROV model. To demonstrate the use of the use of `influenceFactor` we have extended the Wikipedia-provenance dataset and tooling from ProvBench to capture a quantitative measure of influence between the provenance elements involved. We also briefly discuss how we have used the proposed influence factor to support the development of a probabilistic approach to information quality (IQ) assessment using Bayesian Networks. Available Media Model-based Abstraction of Data Provenance Christian W. Probst, Technical University of Denmark; Rene Rydhof Hansenm, Aalborg University Identifying provenance of data provides insights to the origin of data and intermediate results, and has recently gained increased interest due to data-centric applications. In this work we extend a data-centric system view with actors handling the data and policies restricting actions. This extension is based on provenance analysis performed on system models. System models have been introduced to model and analyse spatial and organisational aspects of organisations, to identify, e.g., potential insider threats. Both the models and analyses are naturally modular; models can be combined to bigger models, and the analyses adapt accordingly. Our approach extends provenance both with the origin of data, the actors and processes involved in the handling of data, and policies applied while doing so. The model and corresponding analyses are based on a formal model of spatial and organisational aspects, and static analyses of permissible actions in the models. While currently applied to organisational models, our approach can also be extended to work flows, thus targeting a more traditional model of provenance. Available Media
Noon–1:15 p.m.	Thursday
Lunch
1:15 p.m.–2:45 p.m.	Thursday
Session II: Capture Approximated Provenance for Complex Applications Eleanor Ainy, Tel Aviv University; Susan B. Davidson, University of Pennsylvania; Daniel Deutch and Tova Milo, Tel Aviv University Many applications now involve the collection of large amounts of data from multiple users, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes it difficult to understand how information was derived, and consequently difficult to asses its credibility, to optimize and debug its derivation, etc. Provenance has been helpful in achieving such goals in different contexts, and we illustrate its potential for novel complex applications such as those performing crowd-sourcing. Maintaining (and presenting) the full and exact provenance information may be infeasible for such applications, due to the size of the provenance and its complex structure. We propose some initial directions towards addressing this challenge, through the notion of approximated provenance. Available Media RDataTracker: Collecting Provenance in an Interactive Scripting Environment Barbara Lerner, Mount Holyoke College; Emery Boose, Harvard University Available Media Provenance Capture Disparities Highlighted through Datasets Blake Coe, The MITRE Corporation; R. Christopher Doty, Georgia Institute of Technology; M. David Allen and Adriane Chapman, The MITRE Corporation Provenance information is inherently affected by the method of its capture. Different capture mechanisms create very different provenance graphs. In this work, we describe an academic use case that has corollaries in offices everywhere. We also describe two distinct possibilities for provenance capture methods within this domain. We generate three datasets using these two capture methods: the capture methods run individually and a trace of what an omniscient capture agent would see. We describe how the different capture methods lead to such very different graphs and release the graphs for others to use via the ProvBench effort. Available Media UP & DOWN: Improving Provenance Precision by Combining Workflow- and Trace-Level Information Saumen Dey, University of California, Davis; Khalid Belhajjame, Université Paris-Dauphine; David Koop, New York University; Tianhong Song, University of California, Davis; Paolo Missier, Newcastle University; Bertram Ludäscher, University of California, Davis Workflow-level provenance declarations can improve the precision of coarse provenance traces by reducing the number of “false” dependencies (not every output of a step depends on every input). Conversely, fine-grained execution provenance can be used to improve the precision of input-output dependencies of workflow actors. We present a new logic-based approach for improving provenance precision by combining downward and upward inference, i.e., from workflows to traces and vice versa. Available Media
2:45 p.m.–3:00 p.m.	Thursday
Coffee Break
3:00 p.m.–4:15 p.m.	Thursday
Session III: Theory Immutably Answering Why-Not Questions for Equivalent Conjunctive Queries Nicole Bidoit, Melanie Herschel and Katerina Tzompanaki, Université Paris-Sud Answering Why-Not questions consists in explaining to developers of complex data transformations or manipulations why their data transformation did not produce some specific results, although they expected them to do so. Different types of explanations that serve as Why-Not answers have been proposed in the past and are either based on the available data, the query tree, or both. Solutions (partially) based on the query tree are generally more efficient and easier to interpret by developers than solutions solely based on data. However, algorithms producing such query-based explanations so far may return different results for reordered conjunctive query trees, and even worse, these results may be incomplete. Clearly, this represents a significant usability problem, as the explanations developers get may be partial and developers have to worry about the query tree representation of their query, losing the advantage of using a declarative query language. As remedy to this problem, we propose the Ted algorithm that produces the same complete query-based explanations for reordered conjunctive query trees. Available Media Towards Constraint Provenance Games Sean Riddle, Sven Köhler and Bertram Ludäscher, University of California, Davis Provenance for positive queries is well understood and elegantly handled by provenance semirings, which subsume many earlier approaches. However, the semiring approach does not extend easily to why-not provenance or, more generally, first-order queries with negation. An alternative approach is to view query evaluation as a game between two players who argue whether, for given database I and query Q, a tuple t is in the answer Q(I) or not. For first-order logic, the resulting provenance games yield a new provenance model that coincides with provenance semirings (how provenance) on positive queries, but also is applicable to first-order queries with negation, thus providing an elegant, uniform treatment of earlier approaches, including why-not provenance and negation. In order to obtain a finite answer to a why-not question, provenance games employ an active domain semantics and enumerate tuples that contribute to failed derivations, resulting in a domain dependent formalism. In this paper, we propose constraint provenance games as a means to address this issue. The key idea is to represent infinite answers (e.g., to why-not questions) by finite constraints, i.e., equalities and disequalities. Available Media Regular Expressions for Provenance Michael Luttenberger and Maximilian Schlund, Technische Universität München As noted by Green et al. several provenance analyses can be considered a special case of the general problem of computing formal polynomials resp. power-series as solutions of an algebraic system. Specific provenance is then obtained by means of evaluating the formal polynomial under a suitable homomorphism. Recently, we presented the idea of approximating the least solution of such algebraic systems by means of unfolding the system into a sequence of simpler algebraic systems. Similar ideas are at the heart of the semi-naive evaluation algorithm for datalog. We apply these results to provenance problems: Semi-naive evaluation can be seen as a particular implementation of fixed point iteration which can only be used to compute (finite) provenance polynomials. Other unfolding schemes, e.g. based on Newton’s method, allow us to compute a regular expression which yields a finite representation of (possibly infinite) provenance power series in the case of commutative and idempotent semirings. For specific semirings (e.g. Why(X)) we can then, in a second step, transform these regular expressions resp. power series into polynomials which capture the provenance. Using techniques like subterm sharing both the regular expressions and the polynomials can be succinctly represented. Available Media

Friday, June 13, 2014

9:00 a.m.–10:30 a.m.	Friday
Session IV: Practice Provenance-Only Integration Ashish Gehani and Dawood Tariq, SRI International As provenance records are collected from an increasingly diverse set of sources, the need to integrate them grows. The alternative approach of reconciling semantics scales when the records are queried infrequently. However, as the use of provenance grows, normalizing the diverse provenance via formal integration will yield better query performance. We describe two motivating cases for integrating provenance, provide an initial formal model for integration that is domain-agnostic, and identify a possible direction for optimizing the integration process itself. Available Media A Generic Provenance Middleware for Queries, Updates, and Transactions Bahareh Arab, Illinois Institute of Technology; Dieter Gawlick and Venkatesh Radhakrishnan, Oracle Corporation; Hao Guo and Boris Glavic, Illinois Institute of Technology We present an architecture and prototype implementation for a generic provenance database middleware (GProM) that is based on the concept of query rewrites, which are applied to an algebraic graph representation of database operations. The system supports a wide range of provenance types and representations for queries, updates, transactions, and operations spanning multiple transactions. GProM supports several strategies for provenance generation, e.g., on-demand, rule-based, and “always on”. To the best of our knowledge, we are the first to present a solution for computing the provenance of concurrent database transactions. Our solution can retroactively trace transaction provenance as long as an audit log and time travel functionality are available (both are supported by most DBMS). Other noteworthy features of GProM include: extensibility through a declarative rewrite rule specification language, support for multiple database backends, and an optimizer for rewritten queries. Available Media Start Smart and Finish Wise: The Kiel Marine Science Provenance-Aware Data Management Approach Peer C. Brauer, Kiel University; Andreas Czerniak, GEOMAR Helmholtz Centre for Ocean Research Kiel; Wilhelm Hasselbring, Kiel University While creating or processing scientific data, it is very important to capture and to archive the corresponding provenance data. “Start smart and finish wise” is our approach for a provenance aware tooling, which helps data managers and scientists not only to manage their data, but also to capture their scientific data in the field, to record the provenance data, to store it for further analysis and finally to publish the scientific data to the data centres. The tool chain consists of four major components, (1) the digital Pen for capturing the (meta) data and the corresponding provenance information in the field, (2) the OCN database for data-acquisition workflows and the data repository, (3) the PubFlow framework for scientific data publication and (4) CAPS for capturing provenance data in Java based scientific software. During each processing step in “Start smart and finish wise” the provenance data for the scientific data is captured and archived. Available Media Report From the CoalFace: Lessons Learnt Building A General-Purpose Always-On Provenance System Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, and Andy Hopper, University of Cambridge Over the past year we have implemented OPUS, an always-on system for observed provenance capture in user-space. In this paper we present some important lessons for anyone hoping to implement a general purpose provenance system operating at user-level. In particular, we highlight the problems and solutions associated with the explosion of interposition requirements attributable to function variants, challenges in maintaining semantic equivalence with POSIX and the importance of deactivating function interception in response to runtime errors. We also provide some insights on choosing the right database to manage provenance data. Available Media
10:45 a.m.–11:15 a.m.	Friday
Coffee Break
11:15 a.m.–12:15 p.m.	Friday
TaPP Town Hall
12:15 p.m.–12:30 p.m.	Friday
Closing Comments
12:15 p.m.–1:30 p.m.	Friday
Lunch

Agenda

Thursday, June 12, 2014

Welcome

Break

Lunch

Coffee Break

Friday, June 13, 2014

Coffee Break

TaPP Town Hall

Closing Comments

Lunch