Towards Automated Collection of {Application-Level} Data Provenance

Workshop Program

All sessions will be held in Constitution B unless otherwise noted.

June 14, 2012

9:00 a.m.–9:10 a.m.				Thursday
Opening Remarks Program Co-Chairs: Umut A. Acar, Max Planck Institute for Software Systems, and Todd J. Green, University of California, Davis, and LogicBlox
9:10 a.m.–10:10 a.m.				Thursday
Invited Talk on Provenance and Security Session Chair: Umut A. Acar, Max Planck Institute for Software Systems, and Todd J. Green, University of California, Davis, and LogicBlox Speaker: Deepak Garg, Max Planck Institute for Software Systems Tracking Flows of Information: Web, OS, & Mobile Available Media Read more about Tracking Flows of Information: Web, OS, & Mobile
10:10 a.m.–10:30 a.m.				Thursday
Break Constitution Foyer
10:30 a.m.–Noon				Thursday
Provenance and Security Session Chair: Deepak Garg, Max Planck Institute for Software Systems Tag-based Information Flow Analysis for Document Classiﬁcation in Provenance Jyothsna Rachapalli, Murat Kantarcioglu, and Bhavani Thuraisingham, The University of Texas at Dallas A crucial aspect of certain applications such as the ones pertaining to Intelligence domain or Health-care, is to manage and protect sensitive information effectively and efficiently. In this paper, we propose a tagging mechanism to track the flow of sensitive or valuable information in a provenance graph and automate the process of document classification. When provenance is initially recorded, the documents of a provenance graph are assumed to be annotated with tags representing their sensitivity or priority. We then propagate the tags appropriately on the newly generated documents using additional inference rules defined in this paper. This approach enables users to conveniently query to identify sensitive or valuable information, which can now be efficiently managed or protected once identified. Available Media Toward Provenance-Based Security for Conﬁguration Languages Paul Anderson and James Cheney, University of Edinburgh Large system installations are increasingly configured using high-level, mostly-declarative languages. Often, different users contribute data that is compiled centrally and distributed to individual systems. Although the systems themselves have been developed with reliability and availability in mind, the configuration compilation process can lead to unforeseen vulnerabilities because of the lack of access control on the different components combined to build the final configuration. Even if simple change-based access controls are applied to validate changes to the final version, changes can be lost or incorrectly attributed. Based on the growing literature on provenance for database queries and other models of computation, we identify a potential application area for provenance to securing configuration languages. Available Media Provenance as a Security Control Andrew Martin, John Lyle, and Cornelius Namilkuo, University of Oxford Much has been written about security and provenance. Although both have their own large areas of concern, there is a very significant intersection. One is often brought to bear upon the other, in the study of the security of provenance. We discuss through a series of examples how provenance might be regarded as a security control in its own right. We argue that a risk-based approach to provenance is appropriate, and is already being used informally. A case study illustrates the applicability of this line of reasoning. Available Media Dependency Path Patterns as the Foundation of Access Control in Provenance-aware Systems Dang Nguyen, Jaehong Park, and Ravi Sandhu, Institute for Cyber Security, University of Texas at San Antonio A unique characteristics of provenance data is that it forms a directed acyclic graph (DAG) in accordance with the underlying causality dependencies between entities (acting users, action processes and data objects) involved in transactions. Data provenance raises at least two distinct security-related issues. One is how to control access to provenance data which we call Provenance Access control (PAC). The other is Provenance-based Access Control (PBAC) which focuses on how to utilize provenance data to control access to data objects. Both PAC and PBAC are built on a common foundation that requires security architects to define application-specific dependency path patterns of provenance data. Assigning application-specific semantics to these path patterns provides the foundation for effective security policy specification and administration. This paper elaborates on this common foundation of PAC and PBAC and identifies some of the differences in how this common foundation is applied in these two contexts. Available Media
Noon–1:30 p.m.				Thursday
FCW Luncheon Back Bay CD
1:30 p.m.–2:30 p.m.				Thursday
Practical Tools Session Chair: Bertram Ludäscher, University of California, Davis BioLite, a Lightweight Bioinformatics Framework with Automated Tracking of Diagnostics and Provenance Mark Howison, Nicholas A. Sinnott-Armstrong, and Casey W. Dunn, Brown University We present a new Python/C++ framework, BioLite, for implementing bioinformatics pipelines for Next-Generation Sequencing (NGS) data. BioLite tracks provenance of analyses, automates the collection and reporting of diagnostics (such as summary statistics and plots at intermediate stages), and profiles computational requirements. These diagnostics can be accessed across multiple stages of a pipeline, from other pipelines, and in HTML reports. Finally, we describe several use cases for diagnostics in our own analyses. Available Media A General-Purpose Provenance Library Peter Macko and Margo Seltzer, Harvard University Most provenance capture takes place inside particular tools – a workflow engine, a database, an operating system, or an application. However, most users have an existing toolset – a collection of different tools that work well for their needs and with which they are comfortable. Currently, such users have limited ability to collect provenance without disrupting their work and changing environments, which most users are hesitant to do. Even users who are willing to adopt new tools, may realize limited benefit from provenance in those tools if they do not integrate with their entire environment, which may include multiple languages and frameworks. We present the Core Provenance Library (CPL), a portable, multi-lingual library that application programmers can easily incorporate into a variety of tools to collect and integrate provenance. Although the manual instrumentation adds extra work for application programmers, we show that in most cases, the work is minimal, and the resulting system solves several problems that plague more constrained provenance collection systems. Available Media BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure Philip J. Guo, Stanford University; Margo Seltzer, Harvard University Researchers in fields such as bioinformatics, CS, finance, and applied math have trouble managing the numerous code and data files generated by their computational experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures. We created a Linux-based system called BURRITO that automates aspects of this tedious experiment organization and notetaking process, thus freeing researchers to focus on more substantive work. BURRITO automatically captures a researcher’s computational activities and provides user interfaces to annotate the captured provenance with notes and then make queries such as, “Which script versions and command-line parameters generated the output graph that this note refers to?” Available Media
2:30 p.m.–3:00 p.m.				Thursday
Break Constitution Foyer
3:00 p.m.–3:40 p.m.				Thursday
Provenance and Ranking Session Chair: Philip Guo, Stanford University It’s About the Data: Provenance as a Tool for Assessing Data Fitness Adriane Chapman, M. David Allen, and Barbara Blaustein, The MITRE Corporation The end goal of provenance is to assist users in understanding their data: How was it created? When? By whom? How was it manipulated? In other words, provenance is a powerful tool to help users answer the question, “Is this data fit for use?” However, there is no one set of criteria that make data “fit for use”. The criteria depend on the user, the task at hand, and the current situation. In this work we describe Fitness Widgets, predefined queries over provenance graphs that users can customize to determine data fitness. We have implemented Fitness Widgets in our provenance system, PLUS. Available Media Querying Provenance for Ranking and Recommending Zachary G. Ives, Andreas Haeberlen, and Tao Feng, University of Pennsylvania; Wolfgang Gatterbauer, Carnegie Mellon University As has been frequently observed in the literature, there is a strong connection between a derived data item’s provenance and its authoritativeness, utility, relevance, or probability. A standard way of obtaining a score for a derived tuple is by first assigning scores to the “base” tuples from which it is derived — then using the semantics of the query and the score measure to derive a value for the tuple. This “provenance-enabled” scoring has led to a variety of scenarios where tuples’ intrinsic value is based on their provenance, independent of whatever other tuples exist in the data set. However, there is another class of applications, revolving around sharing and recommendation, in which our goal may be to rank tuples by their “importance” or the structure of their connectivity within the provenance graph. We argue that the most natural approach is to exploit the structure of a provenance graph to rank and recommend “interesting” or “relevant” items to users, based on global and/or local provenance graph structure and random walk-based algorithms. We further argue that it is desirable to have a high-level declarative language to extract portions of the provenance graph and then apply the random walk computations. We extend the ProQL provenance query language to support a wide array of random walk algorithms in a high-level way, and identify opportunities for query optimization. Available Media
3:40 p.m.–5:00 p.m.				Thursday
Poster Session

June 15, 2012

9:00 a.m.–10:00 a.m.				Friday
Invited Talk: Provenance and Higher-Order Software Contracts Session Chair: Umut A. Acar, Max Planck Institute for Software Systems Speaker: Christos Dimoulas, Northeastern University Provenance information plays a critical role in judging whether the semantics of software contracts is correct. Higher-order software contracts dynamically check whether objects and functions meet the interface specifications of a component. When an object or a function fails to live up to a specification, the contract system must pinpoint the guilty party. Equipped with this blame information, a software engineer can narrow down the search for the violation—if it is correct. Provenance offers a way, based on the origin and history of the values that contracts check, to reason about the correctness of blame assignment and the effectiveness of contracts. Thus provenance provides the key element for evaluating the semantics of software contracts. In this talk, I will introduce software contracts and the problems of checking higher-order contracts. I will present two distinct attempts to assign a semantics to contract checking in this world, and I will then demonstrate the shortcomings of both. These failures motivate the search for a formal compass for contract system designers. With semantic provenance information, we found such a compass that helped us explain why the proposed semantics failed to be useful and that guided the design of a new semantics, which is now implemented in the Racket contract system. Provenance and Higher-Order Software Contracts Available Media Read more about Provenance and Higher-Order Software Contracts
10:00 a.m.–10:30 a.m.				Friday
Break Constitution Foyer
10:30 a.m.–11:15 a.m.				Friday
Provenance Models Session Chair: Todd J. Green, University of California, Davis, and LogicBlox Hierarchical Models of Provenance Peter Buneman, James Cheney, and Egor V. Kostylev, University of Edinburgh There is general agreement that we need to understand provenance at various levels of granularity; however, there appears, as yet, to be no general agreement on what granularity means. It can refer both to the detail with which we can view a process or the detail with which we view the data. We describe a simple and straightforward method for imposing a hierarchical structure on a provenance graph and show how it can, if we want, be derived from the program whose execution created that graph. Available Media Provenance Management in Databases Under Schema Evolution Shi Gao and Carlo Zaniolo, University of California, Los Angeles Since changes caused by database updates combine with the internal changes caused by database schema evolution, an integrated provenance management for data and metadata represents a key requirement for modern information systems. In this paper, we introduce the Archived Metadata and Provenance Manager (AM&PM) system which addresses this requirement by (i) extending the Information Schema with the capability of representing the provenance of the schema and other metadata, (ii) providing a simple time-stamp based representation of the provenance of the actual data, and (iii) supporting powerful queries on the provenance of the data and the history of the metadata. Available Media
11:15 a.m.–Noon				Friday
Querying Provenance Session Chair: Todd J. Green, University of California, Davis, and LogicBlox Experiment Explorer: Lightweight Provenance Search over Metadata Delmar B. Davis and Hazeline U. Asuncion, University of Washington, Bothell; Ghaleb Abdulla, Lawrence Livermore National Laboratory Scientific experiments typically produce a plethora of files in the form of intermediate data or experimental results. As the project grows in scale, there is an increased need for tools and techniques that link together relevant experimental artifacts, especially if the files are heterogeneous and distributed across multiple locations. Current provenance and search techniques, however, fall short in efficiently retrieving experiment-related files, presumably because they are not tailored towards the common use cases of researchers. In this position paper, we propose Experiment Explorer, a lightweight and efficient approach that takes advantage of metadata to retrieve and visualize relevant experiment-related files. Available Media Datalog as a Lingua Franca for Provenance Querying and Reasoning Saumen Dey and Sven Köhler, UC Davis; Shawn Bowers, Gonzaga University; Bertram Ludäscher, UC Davis Provenance, i.e., the lineage and processing history of data, has become increasingly important within scientific workflow systems. Provenance information can be used, e.g., to explain, debug, and reproduce the results of computational experiments as well as to determine the validity and quality of data products. Standard models for representing provenance information (such as OPM) largely focus on providing a minimal, common set of observables and constraints (in terms of causal and temporal relationships). For scientific workflow applications, however, the workflow itself and the corresponding (implicit) contraints on provenance relationships are often essential for interpreting and querying provenance information. In this paper, we propose Datalog as a “lingua franca” for representing, querying, and specifying integrity constraints over provenance information, and introduce a unifying provenance model for specifying workflows, traces, and temporal constraints. We also demonstrate advantages of using Datalog together with the unified model through a number of examples. Available Media
Noon–1:30 p.m.				Friday
FCW Luncheon Back Bay CD
1:30 p.m.–1:50 p.m.				Friday
Provenance and Software Engineering Session Chair: James Cheney, University of Edinburgh Provenance Support for Rework Xiang Zhao, University of Massachusetts Amherst; Barbara Staudt Lerner, Mount Holyoke College; Leon J. Osterweil, University of Massachusetts Amherst; Emery R. Boose and Aaron M. Ellison, Harvard University Rework occurs commonly in software development. This paper describes a simple rework example, namely the code refactoring process. We show that contextual information is central to supporting such rework, and we present an artifact provenance support approach that can help developers keep track of previous decisions to improve their effectiveness in rework. Available Media
1:50 p.m.–2:30 p.m.				Friday
Provenance Instrumentation Session Chair: James Cheney, University of Edinburgh Toward Provenance Capturing as Cross-Cutting Concern Martin Schäler, Sandro Schulze, and Gunter Saake, University of Magdeburg, Germany Although provenance gained much attention, solutions to capture provenance do not meet all the requirements. For instance, most solution currently assume a closed world and are explicitly designed to capture provenance. Thus, they fail in integrating the provenance concern into existing environments. Hence, we argue that provenance should be considered as cross-cutting concern that can easily be integrated into existing systems and aims at establishing a universe of provenance. In this paper, we propose a solution concept, introduce different types of provenance systems, adequate software engineering techniques, and report our experiences from a first prototype. Available Media Towards Automated Collection of Application-Level Data Provenance Dawood Tariq, Maisem Ali, and Ashish Gehani, SRI International Gathering data provenance at the operating system level is useful for capturing system-wide activity. However, many modern programs are complex and can perform numerous tasks concurrently. Capturing their provenance at this level, where processes are treated as single entities, may lead to the loss of useful intra-process detail. This can, in turn, produce false dependencies in the provenance graph. Using the LLVM compiler framework and SPADE provenance infrastructure, we investigate adding provenance instrumentation to allow intra-process provenance to be captured automatically. This results in a more accurate representation of the provenance relationships and eliminates some false dependencies. Since the capture of fine-grained provenance incurs increased overhead for storage and querying, we minimize the records retained by allowing users to declare aspects of interest and then automatically infer which provenance records are unnecessary and can be discarded. Available Media
2:30 p.m.–3:00 p.m.				Friday
Break Constitution Foyer
3:00 p.m.–4:00 p.m.				Friday
Whiteboard Chats