9:00 a.m.–9:10 a.m. |
Thursday |
Program Co-Chairs: Umut A. Acar, Max Planck Institute for Software Systems, and Todd J. Green, University of California, Davis, and LogicBlox
|
9:10 a.m.–10:10 a.m. |
Thursday |
Session Chair: Umut A. Acar, Max Planck Institute for Software Systems, and Todd J. Green, University of California, Davis, and LogicBlox
Speaker: Deepak Garg, Max Planck Institute for Software Systems
|
10:10 a.m.–10:30 a.m. |
Thursday |
|
10:30 a.m.–Noon |
Thursday |
Session Chair: Deepak Garg, Max Planck Institute for Software Systems
Jyothsna Rachapalli, Murat Kantarcioglu, and Bhavani Thuraisingham, The University of Texas at Dallas
A crucial aspect of certain applications such as the ones pertaining to Intelligence domain or Health-care, is to manage and protect sensitive information effectively and efficiently. In this paper, we propose a tagging mechanism to track the flow of sensitive or valuable information in a provenance graph and automate the process of document classification. When provenance is initially recorded, the documents of a provenance graph are assumed to be annotated with tags representing their sensitivity or priority. We then propagate the tags appropriately on the newly generated documents using additional inference rules defined in this paper. This approach enables users to conveniently query to identify sensitive or valuable information, which can now be efficiently managed or protected once identified.
Paul Anderson and James Cheney, University of Edinburgh
Large system installations are increasingly configured using high-level, mostly-declarative languages. Often, different users contribute data that is compiled centrally and distributed to individual systems. Although the systems themselves have been developed with reliability and availability in mind, the configuration compilation process can lead to unforeseen vulnerabilities because of the lack of access control on the different components combined to build the final configuration. Even if simple change-based access controls are applied to validate changes to the final version, changes can be lost or incorrectly attributed. Based on the growing literature on provenance for database queries and other models of computation, we identify a potential application area for provenance to securing configuration languages.
Andrew Martin, John Lyle, and Cornelius Namilkuo, University of Oxford
Much has been written about security and provenance. Although both have their own large areas of concern, there is a very significant intersection. One is often brought to bear upon the other, in the study of the security of provenance. We discuss through a series of examples how provenance might be regarded as a security control in its own right. We argue that a risk-based approach to provenance is appropriate, and is already being used informally. A case study illustrates the applicability of this line of reasoning.
Dang Nguyen, Jaehong Park, and Ravi Sandhu, Institute for Cyber Security, University of Texas at San Antonio
A unique characteristics of provenance data is that it forms a directed acyclic graph (DAG) in accordance with the underlying causality dependencies between entities (acting users, action processes and data objects) involved in transactions. Data provenance raises at least two distinct security-related issues. One is how to control access to provenance data which we call Provenance Access control (PAC). The other is Provenance-based Access Control (PBAC) which focuses on how to utilize provenance data to control access to data objects. Both PAC and PBAC are built on a common foundation that requires security architects to define application-specific dependency path patterns of provenance data. Assigning application-specific semantics to these path patterns provides the foundation for effective security policy specification and administration. This paper elaborates on this common foundation of PAC and PBAC and identifies some of the differences in how this common foundation is applied in these two contexts.
|
Noon–1:30 p.m. |
Thursday |
|
1:30 p.m.–2:30 p.m. |
Thursday |
Session Chair: Bertram Ludäscher, University of California, Davis
Mark Howison, Nicholas A. Sinnott-Armstrong, and Casey W. Dunn, Brown University
We present a new Python/C++ framework, BioLite, for implementing bioinformatics pipelines for Next-Generation Sequencing (NGS) data. BioLite tracks provenance of analyses, automates the collection and reporting of diagnostics (such as summary statistics and plots at intermediate stages), and profiles computational requirements. These diagnostics can be accessed across multiple stages of a pipeline, from other pipelines, and in HTML reports. Finally, we describe several use cases for diagnostics in our own analyses.
Peter Macko and Margo Seltzer, Harvard University
Most provenance capture takes place inside particular tools – a workflow engine, a database, an operating system, or an application. However, most users have an existing toolset – a collection of different tools that work well for their needs and with which they are comfortable. Currently, such users have limited ability to collect provenance without disrupting their work and changing environments, which most users are hesitant to do. Even users who are willing to adopt new tools, may realize limited benefit from provenance in those tools if they do not integrate with their entire environment, which may include multiple languages and frameworks.
We present the Core Provenance Library (CPL), a portable, multi-lingual library that application programmers can easily incorporate into a variety of tools to collect and integrate provenance. Although the manual instrumentation adds extra work for application programmers, we show that in most cases, the work is minimal, and the resulting system solves several problems that plague more constrained provenance collection systems.
Philip J. Guo, Stanford University; Margo Seltzer, Harvard University
Researchers in fields such as bioinformatics, CS, finance, and applied math have trouble managing the numerous code and data files generated by their computational experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures.
We created a Linux-based system called BURRITO that automates aspects of this tedious experiment organization and notetaking process, thus freeing researchers to focus on more substantive work. BURRITO automatically captures a researcher’s computational activities and provides user interfaces to annotate the captured provenance with notes and then make queries such as, “Which script versions and command-line parameters generated the output graph that this note refers to?”
|
2:30 p.m.–3:00 p.m. |
Thursday |
|
3:00 p.m.–3:40 p.m. |
Thursday |
Session Chair: Philip Guo, Stanford University
Adriane Chapman, M. David Allen, and Barbara Blaustein, The MITRE Corporation
The end goal of provenance is to assist users in understanding their data: How was it created? When? By whom? How was it manipulated? In other words, provenance is a powerful tool to help users answer the question, “Is this data fit for use?” However, there is no one set of criteria that make data “fit for use”. The criteria depend on the user, the task at hand, and the current situation. In this work we describe Fitness Widgets, predefined queries over provenance graphs that users can customize to determine data fitness. We have implemented Fitness Widgets in our provenance system, PLUS.
Zachary G. Ives, Andreas Haeberlen, and Tao Feng, University of Pennsylvania; Wolfgang Gatterbauer, Carnegie Mellon University
As has been frequently observed in the literature, there is a strong connection between a derived data item’s provenance and its authoritativeness, utility, relevance, or probability. A standard way of obtaining a score for a derived tuple is by first assigning scores to the “base” tuples from which it is derived — then using the semantics of the query and the score measure to derive a value for the tuple. This “provenance-enabled” scoring has led to a variety of scenarios where tuples’ intrinsic value is based on their provenance, independent of whatever other tuples exist in the data set.
However, there is another class of applications, revolving around sharing and recommendation, in which our goal may be to rank tuples by their “importance” or the structure of their connectivity within the provenance graph. We argue that the most natural approach is to exploit the structure of a provenance graph to rank and recommend “interesting” or “relevant” items to users, based on global and/or local provenance graph structure and random walk-based algorithms. We further argue that it is desirable to have a high-level declarative language to extract portions of the provenance graph and then apply the random walk computations. We extend the ProQL provenance query language to support a wide array of random walk algorithms in a high-level way, and identify opportunities for query optimization.
|
3:40 p.m.–5:00 p.m. |
Thursday |
|