Workshop Program

All sessions will be held in Constitution B unless otherwise noted.


June 14, 2012

9:00 a.m.–9:10 a.m. Thursday

Opening Remarks

Program Co-Chairs: Umut A. Acar, Max Planck Institute for Software Systems, and Todd J. Green, University of California, Davis, and LogicBlox

9:10 a.m.–10:10 a.m. Thursday

Invited Talk on Provenance and Security

Session Chair: Umut A. Acar, Max Planck Institute for Software Systems, and Todd J. Green, University of California, Davis, and LogicBlox

Speaker: Deepak Garg, Max Planck Institute for Software Systems

10:10 a.m.–10:30 a.m. Thursday

Break

Constitution Foyer

10:30 a.m.–Noon Thursday

Provenance and Security

Session Chair: Deepak Garg, Max Planck Institute for Software Systems

Tag-based Information Flow Analysis for Document Classification in Provenance

Jyothsna Rachapalli, Murat Kantarcioglu, and Bhavani Thuraisingham, The University of Texas at Dallas

A crucial aspect of certain applications such as the ones pertaining to Intelligence domain or Health-care, is to manage and protect sensitive information effectively and efficiently. In this paper, we propose a tagging mechanism to track the flow of sensitive or valuable information in a provenance graph and automate the process of document classification. When provenance is initially recorded, the documents of a provenance graph are assumed to be annotated with tags representing their sensitivity or priority. We then propagate the tags appropriately on the newly generated documents using additional inference rules defined in this paper. This approach enables users to conveniently query to identify sensitive or valuable information, which can now be efficiently managed or protected once identified.

 

Available Media

Toward Provenance-Based Security for Configuration Languages

Paul Anderson and James Cheney, University of Edinburgh

Large system installations are increasingly configured using high-level, mostly-declarative languages. Often, different users contribute data that is compiled centrally and distributed to individual systems. Although the systems themselves have been developed with reliability and availability in mind, the configuration compilation process can lead to unforeseen vulnerabilities because of the lack of access control on the different components combined to build the final configuration. Even if simple change-based access controls are applied to validate changes to the final version, changes can be lost or incorrectly attributed. Based on the growing literature on provenance for database queries and other models of computation, we identify a potential application area for provenance to securing configuration languages.

 

Available Media

Provenance as a Security Control

Andrew Martin, John Lyle, and Cornelius Namilkuo, University of Oxford

Much has been written about security and provenance. Although both have their own large areas of concern, there is a very significant intersection. One is often brought to bear upon the other, in the study of the security of provenance. We discuss through a series of examples how provenance might be regarded as a security control in its own right. We argue that a risk-based approach to provenance is appropriate, and is already being used informally. A case study illustrates the applicability of this line of reasoning.

 

Available Media

Dependency Path Patterns as the Foundation of Access Control in Provenance-aware Systems

Dang Nguyen, Jaehong Park, and Ravi Sandhu, Institute for Cyber Security, University of Texas at San Antonio

A unique characteristics of provenance data is that it forms a directed acyclic graph (DAG) in accordance with the underlying causality dependencies between entities (acting users, action processes and data objects) involved in transactions. Data provenance raises at least two distinct security-related issues. One is how to control access to provenance data which we call Provenance Access control (PAC). The other is Provenance-based Access Control (PBAC) which focuses on how to utilize provenance data to control access to data objects. Both PAC and PBAC are built on a common foundation that requires security architects to define application-specific dependency path patterns of provenance data. Assigning application-specific semantics to these path patterns provides the foundation for effective security policy specification and administration. This paper elaborates on this common foundation of PAC and PBAC and identifies some of the differences in how this common foundation is applied in these two contexts.

 

Available Media
Noon–1:30 p.m. Thursday

FCW Luncheon

Back Bay CD

1:30 p.m.–2:30 p.m. Thursday

Practical Tools

Session Chair: Bertram Ludäscher, University of California, Davis

BioLite, a Lightweight Bioinformatics Framework with Automated Tracking of Diagnostics and Provenance

Mark Howison, Nicholas A. Sinnott-Armstrong, and Casey W. Dunn, Brown University

We present a new Python/C++ framework, BioLite, for implementing bioinformatics pipelines for Next-Generation Sequencing (NGS) data. BioLite tracks provenance of analyses, automates the collection and reporting of diagnostics (such as summary statistics and plots at intermediate stages), and profiles computational requirements. These diagnostics can be accessed across multiple stages of a pipeline, from other pipelines, and in HTML reports. Finally, we describe several use cases for diagnostics in our own analyses.

 

Available Media

A General-Purpose Provenance Library

Peter Macko and Margo Seltzer, Harvard University

Most provenance capture takes place inside particular tools – a workflow engine, a database, an operating system, or an application. However, most users have an existing toolset – a collection of different tools that work well for their needs and with which they are comfortable. Currently, such users have limited ability to collect provenance without disrupting their work and changing environments, which most users are hesitant to do. Even users who are willing to adopt new tools, may realize limited benefit from provenance in those tools if they do not integrate with their entire environment, which may include multiple languages and frameworks.

We present the Core Provenance Library (CPL), a portable, multi-lingual library that application programmers can easily incorporate into a variety of tools to collect and integrate provenance. Although the manual instrumentation adds extra work for application programmers, we show that in most cases, the work is minimal, and the resulting system solves several problems that plague more constrained provenance collection systems.

 

Available Media

BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure

Philip J. Guo, Stanford University; Margo Seltzer, Harvard University

Researchers in fields such as bioinformatics, CS, finance, and applied math have trouble managing the numerous code and data files generated by their computational experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures.

We created a Linux-based system called BURRITO that automates aspects of this tedious experiment organization and notetaking process, thus freeing researchers to focus on more substantive work. BURRITO automatically captures a researcher’s computational activities and provides user interfaces to annotate the captured provenance with notes and then make queries such as, “Which script versions and command-line parameters generated the output graph that this note refers to?”

 

Available Media
2:30 p.m.–3:00 p.m. Thursday

Break

Constitution Foyer

3:00 p.m.–3:40 p.m. Thursday

Provenance and Ranking

Session Chair: Philip Guo, Stanford University

It’s About the Data: Provenance as a Tool for Assessing Data Fitness

Adriane Chapman, M. David Allen, and Barbara Blaustein, The MITRE Corporation

The end goal of provenance is to assist users in understanding their data: How was it created? When? By whom? How was it manipulated? In other words, provenance is a powerful tool to help users answer the question, “Is this data fit for use?” However, there is no one set of criteria that make data “fit for use”. The criteria depend on the user, the task at hand, and the current situation. In this work we describe Fitness Widgets, predefined queries over provenance graphs that users can customize to determine data fitness. We have implemented Fitness Widgets in our provenance system, PLUS.

 

Available Media

Querying Provenance for Ranking and Recommending

Zachary G. Ives, Andreas Haeberlen, and Tao Feng, University of Pennsylvania; Wolfgang Gatterbauer, Carnegie Mellon University

As has been frequently observed in the literature, there is a strong connection between a derived data item’s provenance and its authoritativeness, utility, relevance, or probability. A standard way of obtaining a score for a derived tuple is by first assigning scores to the “base” tuples from which it is derived — then using the semantics of the query and the score measure to derive a value for the tuple. This “provenance-enabled” scoring has led to a variety of scenarios where tuples’ intrinsic value is based on their provenance, independent of whatever other tuples exist in the data set.

However, there is another class of applications, revolving around sharing and recommendation, in which our goal may be to rank tuples by their “importance” or the structure of their connectivity within the provenance graph. We argue that the most natural approach is to exploit the structure of a provenance graph to rank and recommend “interesting” or “relevant” items to users, based on global and/or local provenance graph structure and random walk-based algorithms. We further argue that it is desirable to have a high-level declarative language to extract portions of the provenance graph and then apply the random walk computations. We extend the ProQL provenance query language to support a wide array of random walk algorithms in a high-level way, and identify opportunities for query optimization.

 

Available Media
3:40 p.m.–5:00 p.m. Thursday

June 15, 2012

9:00 a.m.–10:00 a.m. Friday

Invited Talk: Provenance and Higher-Order Software Contracts

Session Chair: Umut A. Acar, Max Planck Institute for Software Systems

Speaker: Christos Dimoulas, Northeastern University

Provenance information plays a critical role in judging whether the semantics of software contracts is correct. Higher-order software contracts dynamically check whether objects and functions meet the interface specifications of a component. When an object or a function fails to live up to a specification, the contract system must pinpoint the guilty party. Equipped with this blame information, a software engineer can narrow down the search for the violation—if it is correct. Provenance offers a way, based on the origin and history of the values that contracts check, to reason about the correctness of blame assignment and the effectiveness of contracts. Thus provenance provides the key element for evaluating the semantics of software contracts.

In this talk, I will introduce software contracts and the problems of checking higher-order contracts. I will present two distinct attempts to assign a semantics to contract checking in this world, and I will then demonstrate the shortcomings of both. These failures motivate the search for a formal compass for contract system designers. With semantic provenance information, we found such a compass that helped us explain why the proposed semantics failed to be useful and that guided the design of a new semantics, which is now implemented in the Racket contract system.

10:00 a.m.–10:30 a.m. Friday

Break

Constitution Foyer

10:30 a.m.–11:15 a.m. Friday

Provenance Models

Session Chair: Todd J. Green, University of California, Davis, and LogicBlox

Hierarchical Models of Provenance

Peter Buneman, James Cheney, and Egor V. Kostylev, University of Edinburgh

There is general agreement that we need to understand provenance at various levels of granularity; however, there appears, as yet, to be no general agreement on what granularity means. It can refer both to the detail with which we can view a process or the detail with which we view the data. We describe a simple and straightforward method for imposing a hierarchical structure on a provenance graph and show how it can, if we want, be derived from the program whose execution created that graph.

 

Available Media

Provenance Management in Databases Under Schema Evolution

Shi Gao and Carlo Zaniolo, University of California, Los Angeles

Since changes caused by database updates combine with the internal changes caused by database schema evolution, an integrated provenance management for data and metadata represents a key requirement for modern information systems. In this paper, we introduce the Archived Metadata and Provenance Manager (AM&PM) system which addresses this requirement by (i) extending the Information Schema with the capability of representing the provenance of the schema and other metadata, (ii) providing a simple time-stamp based representation of the provenance of the actual data, and (iii) supporting powerful queries on the provenance of the data and the history of the metadata.

 

Available Media
11:15 a.m.–Noon Friday

Querying Provenance

Session Chair: Todd J. Green, University of California, Davis, and LogicBlox

Experiment Explorer: Lightweight Provenance Search over Metadata

Delmar B. Davis and Hazeline U. Asuncion, University of Washington, Bothell; Ghaleb Abdulla, Lawrence Livermore National Laboratory

Scientific experiments typically produce a plethora of files in the form of intermediate data or experimental results. As the project grows in scale, there is an increased need for tools and techniques that link together relevant experimental artifacts, especially if the files are heterogeneous and distributed across multiple locations. Current provenance and search techniques, however, fall short in efficiently retrieving experiment-related files, presumably because they are not tailored towards the common use cases of researchers. In this position paper, we propose Experiment Explorer, a lightweight and efficient approach that takes advantage of metadata to retrieve and visualize relevant experiment-related files.

 

Available Media

Datalog as a Lingua Franca for Provenance Querying and Reasoning

Saumen Dey and Sven Köhler, UC Davis; Shawn Bowers, Gonzaga University; Bertram Ludäscher, UC Davis

Provenance, i.e., the lineage and processing history of data, has become increasingly important within scientific workflow systems. Provenance information can be used, e.g., to explain, debug, and reproduce the results of computational experiments as well as to determine the validity and quality of data products. Standard models for representing provenance information (such as OPM) largely focus on providing a minimal, common set of observables and constraints (in terms of causal and temporal relationships). For scientific workflow applications, however, the workflow itself and the corresponding (implicit) contraints on provenance relationships are often essential for interpreting and querying provenance information. In this paper, we propose Datalog as a “lingua franca” for representing, querying, and specifying integrity constraints over provenance information, and introduce a unifying provenance model for specifying workflows, traces, and temporal constraints. We also demonstrate advantages of using Datalog together with the unified model through a number of examples.

Available Media
Noon–1:30 p.m. Friday

FCW Luncheon

Back Bay CD

1:30 p.m.–1:50 p.m. Friday

Provenance and Software Engineering

Session Chair: James Cheney, University of Edinburgh

Provenance Support for Rework

Xiang Zhao, University of Massachusetts Amherst; Barbara Staudt Lerner, Mount Holyoke College; Leon J. Osterweil, University of Massachusetts Amherst; Emery R. Boose and Aaron M. Ellison, Harvard University

Rework occurs commonly in software development. This paper describes a simple rework example, namely the code refactoring process. We show that contextual information is central to supporting such rework, and we present an artifact provenance support approach that can help developers keep track of previous decisions to improve their effectiveness in rework.

 

Available Media
1:50 p.m.–2:30 p.m. Friday

Provenance Instrumentation

Session Chair: James Cheney, University of Edinburgh

Toward Provenance Capturing as Cross-Cutting Concern

Martin Schäler, Sandro Schulze, and Gunter Saake, University of Magdeburg, Germany

Although provenance gained much attention, solutions to capture provenance do not meet all the requirements. For instance, most solution currently assume a closed world and are explicitly designed to capture provenance. Thus, they fail in integrating the provenance concern into existing environments. Hence, we argue that provenance should be considered as cross-cutting concern that can easily be integrated into existing systems and aims at establishing a universe of provenance. In this paper, we propose a solution concept, introduce different types of provenance systems, adequate software engineering techniques, and report our experiences from a first prototype.

 

Available Media

Towards Automated Collection of Application-Level Data Provenance

Dawood Tariq, Maisem Ali, and Ashish Gehani, SRI International

Gathering data provenance at the operating system level is useful for capturing system-wide activity. However, many modern programs are complex and can perform numerous tasks concurrently. Capturing their provenance at this level, where processes are treated as single entities, may lead to the loss of useful intra-process detail. This can, in turn, produce false dependencies in the provenance graph. Using the LLVM compiler framework and SPADE provenance infrastructure, we investigate adding provenance instrumentation to allow intra-process provenance to be captured automatically. This results in a more accurate representation of the provenance relationships and eliminates some false dependencies. Since the capture of fine-grained provenance incurs increased overhead for storage and querying, we minimize the records retained by allowing users to declare aspects of interest and then automatically infer which provenance records are unnecessary and can be discarded.

 

Available Media
2:30 p.m.–3:00 p.m. Friday

Break

Constitution Foyer

3:00 p.m.–4:00 p.m. Friday