09:00–10:00 |
Thursday |
Session Chair: Jun Zhao, Lancaster University
Trevor Martin, University of Bristol The so-called big data revolution has been characterised by an increase in sources of data as well as in the volume of data to be processed.
In many cases—for example, network behaviour and control, security monitoring, enterprise management information—the data for situation awareness and decision-making is drawn from multiple sources and must be integrated into a coherent whole as far as possible. The so-called big data revolution has been characterised by an increase in sources of data as well as in the volume of data to be processed.
In many cases—for example, network behaviour and control, security monitoring, enterprise management information—the data for situation awareness and decision-making is drawn from multiple sources and must be integrated into a coherent whole as far as possible.
This process generally requires both machines and human analysts and experts, It includes compensating for different formats, different granularities and resolutions, identifying and correcting errors (both systematic and intermittent), as well as managing uncertainties and gaps in data. Often the process requires assumptions and choices to be made in arriving at a reasonably robust overview of a situation—for example, in deciding that a failed attempt to access a building is potentially malicious, we might need to take account of someone's recent travel, long term patterns of behaviour, current schedules of close colleagues, etc. where each of these components may have been derived from lower-level raw data. Provenance in this context refers to the derivation pathways and their overall reliability.
In this talk, I will describe the use of graded (fuzzy) representations in modelling and managing the uncertainties, reliability and granularity of derived data in combining sources for situation awareness.
|
10:00–10:30 |
Thursday |
Break
|
10:30–12:10 |
Thursday |
Amit Chavan, University of Maryland; Silu Huang, University of Illinois at Urbana-Champaign; Amol Deshpande, University of Maryland; Aaron Elmore, University of Chicago; Samuel Madden, MIT; Aditya Parameswaran, University of Illinois at Urbana-Champaign Organizations and teams collect and acquire data from various sources, such as social interactions, financial transactions, sensor data, and genome sequencers. Different teams in an organization as well as different data scientists within a team are interested in extracting a variety of insights which requires combining and collaboratively analyzing datasets in diverse ways. DataHub is a system that aims to provide robust version control and provenance management for such a scenario. To be truly useful for collaborative data science, one also needs the ability to specify queries and analysis tasks over the versioning and the provenance information in a unified manner. In this paper, we present an initial design of our query language, called VQuel, that aims to support such unified querying over both types of information, as well as the intermediate and final results of analyses. We also discuss some of the key language design and implementation challenges moving forward.
Xing Niu, Raghav Kapoor, and Boris Glavic, Illinois Institute of Technology; Dieter Gawlick, Zhen Hua Liu, and Vasudha Krishnaswamy, Oracle Corporation; Venkatesh Radhakrishnan, Facebook Since its inception, the PROV standard has been widely adopted as a standardized exchange format for provenance information. Surprisingly, this standard is currently not supported by provenance-aware database systems limiting their interoperability with other provenance-aware systems. In this work we introduce techniques for exporting database provenance as PROV documents, importing PROV graphs alongside data, and linking outputs of an SQL operation to the imported provenance for its inputs. Our implementation in the GProM system offloads generation of PROV documents to the backend database. This implementation enables provenance tracking for applications that use a relational database for managing (part of) their data, but also execute some non-database operations.
Adam Bates and Kevin R.B. Butler, University of Florida; Thomas Moyer, MIT Lincoln Laboratory When performing automatic provenance collection within the operating system, inevitable storage overheads are made worse by the fact that much of the generated lineage is uninteresting, describing noise and background activities that lie outside the scope the system’s intended use. In this work, we propose a novel approach to policy-based provenance pruning—leverage the confinement properties provided by Mandatory Access Control (MAC) systems in order to identify subdomains of system activity for which to collect provenance. We consider the assurances of completeness that such a system could provide by sketching algorithms that reconcile provenance graphs with the information flows permitted by the MAC policy. We go on to identify the design challenges in implementing such a mechanism. In a simplified experiment, we demonstrate that adding a policy component to the Hi-Fi provenance monitor could reduce storage overhead by as much as 82%. To our knowledge, this is the first practical policy-based provenance monitor to be proposed in the literature.
Nikilesh Balakrishnan, Thomas Bytheway, Lucian Carata, Oliver R. A. Chick, James Snee, Sherif Akoush, and Ripduman Sohan, University of Cambridge; Margo Seltzer, Harvard University; Andy Hopper, University of Cambridge In recent years several hardware and systems fields have made advances in technology that open new opportunities and challenges for provenance systems. In this paper we look at such technologies and discuss the implications they have for provenance. First, we discuss processor and memory controller technologies that enable fine-grained lineage capture, resulting in more precise and accurate provenance. Then, we look at programmable storage, 3D memory and co-processor technologies discussing how lineage capture in these heterogeneous environments results in richer and more complete provenance. We finally look at technological advances in the field of networking, namely NFV and SDN, discussing how these technologies enable easier provenance capture in the network.
|
12:10–12:40 |
Thursday |
3-Minute Gong Show—Poster Pitches
|
12:40–14:00 |
Thursday |
Poster Session and Lunch
|
14:00–15:15 |
Thursday |
Daniel de Oliveira, Universidade Federal Fluminense; Vítor Silva and Marta Mattoso, Federal University of Rio de Janeiro Provenance databases are an important asset in data analytics of large-scale scientific data. The data derivation path allows for identifying parameters, files and domain data values of interest. In scientific workflows, provenance data is automatically captured by workflow systems. However, the power of provenance data analyses depends on the expressiveness of domain-specific data along the provenance traces. While much has been done through the W3C PROV initiative and its PROV-DM to represent generic provenance data, representing domain-specific data in provenance traces has received little attention, yet it accounts for a large number of provenance analytical queries. Such queries are based on selections on data values from input/output artifacts along workflow activities. There are several problems in modeling and capturing values from domain-specific attributes, some of them are related to managing provenance granularity, others to addressing data values hidden inside files and representing the semantics of domain data. In this work, we discuss these open issues and propose some alternatives to domain-specific provenance data capture, representation, storage and queries. Addressing these issues may be decisive in using provenance to drive scientific data analyses at large-scale.
João Felipe Nicolaci Pimentel, Vanessa Braganholo, and Leonardo Murta, Universidade Federal Fluminense; Juliana Freire, New York University Interactive notebooks help users explore code, run simulations, visualize results, and share them with other people. While these notebooks have been widely adopted in teaching as well as by scientists and data scientists that perform exploratory analyses, their provenance support is limited to the visualization of some intermediate results and code sharing. Once a user arrives at a result, it is hard, and sometimes impossible, to retrace the steps that led to the result, since they do not collect the provenance for intermediate resuls or of the environment. As a result, users must fulfill this gap using external tools such as workflow management systems. To overcome this limitation, we propose a new approach to capture provenance from notebooks. We build upon noWorkflow, a system that systematically collects provenance for Python scripts. By integrating noWorkflow and notebooks, provenance is automatically and transparently captured, allowing users to focus on their exploratory tasks within the notebook. In addition, they are able to analyze provenance information within the notebook, to both reason about and debug their work, using visualizations, SQL queries, Prolog queries, and Python code.
Saumen Dey, University of California, Davis; Khalid Belhajjame, Université Paris-Dauphine; David Koop, University of Massachusetts Dartmouth; Meghan Raul, University of California, Davis; Bertram Ludäscher, University of Illinois at Urbana-Champaign Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we propose that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.
|
15:15–15:45 |
Thursday |
Break
|
15:45–17:00 |
Thursday |
Stefan Fehrenbach and James Cheney, University of Edinburgh Today’s programming languages provide no support for data provenance. In a world that increasingly relies on data, we need provenance to judge the reliability of data and therefore should aim for making it easily accessible to programmers. We report our work in progress on an extension to the Links programming language that builds on its support for language-integrated query to support where-provenance queries through query rewriting and a type system extension that distinguishes provenance metadata from other data. Our approach aims to work solely within the language implementation and thus require no changes to the database system. The type system together with automatic propagation of provenance metadata will prevent programmers from accidentally changing provenance, losing it, or misattributing it to other data.
Boris Glavic, Illinois Institute of Technology; Sven Köhler, University of California, Davis; Sean Riddle, Athenahealth Corporation; Bertram Ludäscher, University of Illinois at Urbana-Champaign Explaining why an answer is present (traditional provenance) or absent (missing answer provenance) from a query result is important for many use cases. Most existing approaches use the existence (or absence) of input data to explain a (missing) answer. However, for realistically sized databases, these explanations can be very large and, thus, may not be very helpful to a user. In this paper, we argue that constraints as a concise description of large (or even infinite) sets of existing or missing inputs can provide a natural way of answering a Why- or Why-not question. For instance, to explain why no non-US citizen is in the answer of a query returning US presidents we could list all possible combinations of persons and countries of citizenship that are missing from the input. However, a more concise and insightful explanation is that the constraint that US presidency implies US citizenship prevents any results from being returned by our query. We demonstrate how a taxonomy expressed as inclusion dependencies can provide meaningful justifications for (non-) answers and outline how to find a most general such explanation for a given query using datalog. Furthermore, we sketch several variations of this framework derived by considering other types of constraints as well as alternative definitions of explanation and generalization.
Maxime Debosschere and Floris Geerts, Universiteit Antwerpen In recent work, Salimi and Bertossi provide a tight connection between causality and tuple-based data repairs. We investigate this connection between causality and two other kinds of repair models. First, we consider cell-based V-repairs, i.e., repairs that are obtained by modifying cells in the data. In contrast, tuple-based repairs only allow for the deletion of tuples. Second, we introduce a new notion of repairs, called chase repairs, that take into account the procedural (chase) steps that lead to a repair. We establish a connection between causes (and the associated notion of responsibility) and V-repairs, and analyse the complexity of verifying whether a cell is a cause and whether its responsibility is above a certain threshold. Our understanding of chase repairs is still very preliminary, and we argue that provenance models that are specifically targeted to data repairs and data quality in general are needed to make formal connections between causality and chase repairs.
|