A Method to Build and Analyze Scientific Workflows from Provenance through Process Mining
Back to Full List of Papers
Scientific workflows have recently emerged as a new paradigm for representing and managing complex distributed scientific computations and are used to accelerate the pace of scientific discovery. In many disciplines, individual workflows are large due to the large quantities of data used. As scientific workflows scale quickly, they become very hard to build and maintain. Recent efforts from scientific workflow community aiming at large-scale capturing of provenance present a new opportunity for building scientific workflows using provenance. Process mining focusses on extracting information about processes by examining event logs, and has been successfully applied to business workflow management. This paper presents a method using process mining based on provenance to build and analyze scientific workflows, which provides a new direction in using captured provenance.
Incremental Workflow Improvement Through Analysis of Its Data Provenance
Back to Full List of Papers
Repeated executions of resource-intensive workflows over a large number of runs are commonly observed in e-science practice. We explore the hypothesis that, in some cases, provenance traces recorded for past runs of a workflow can be used to make future runs more efficient. This investigation is an initial step into the systematic study of the role that provenance analysis can play in the broader context of self-managing software systems. We have tested our hypothesis on a concrete case study involving a Chemical Engineering workflow deployed on a cloud infrastructure, where we can measure the cost of its repeated execution. Our approach involves augmenting the workflow with a feedback loop in which incremental analysis of the provenance of past runs is used to control some of the workflow steps in subsequent executions. We present initial experimental results and hint at future improvements as part of ongoing work.
CRMdig: A Generic Digital Provenance Model for Scientific Observation
Back to Full List of Papers
The systematic large-scale production of digital scientific objects, the diversity of the processes involved and the complexity of describing historical relationships among them, imposes the need for an innovative knowledge management system capable to handle all the semantic information in order to monitor, manage and document the origins and derivation of products in a flexible manner. We have implemented CRMdig, an extension of the CIDOC-CRM ontology, which is able to capture the modeling and the query requirements regarding the provenance of digital objects for e-science. CRMdig is particularly rich in describing the physical circumstances of scientific observation resulting in digital data.
On the Limitations of Provenance for Queries with Difference
Back to Full List of Papers
The annotation of the results of database transformations was shown to be very effective for various applications. Until recently, most works in this context focused on positive query languages. The provenance semirings is a particular approach that was proven effective for these languages, and it was shown that when propagating provenance with semirings, the expected equivalence axioms of the corresponding query languages are satisfied. There have been several attempts to extend the frame-work to account for relational algebra queries with difference. We show here that these suggestions fail to satisfy some expected equivalence axioms (that in particular hold for queries on "standard" set and bag databases). Interestingly, we show that this is not a pitfall of these particular attempts, but rather every such attempt is bound to fail in satisfying these axioms, for some semirings. Finally, we show particular semirings for which an extension for supporting difference is (im)possible.
Getting It Together: Enabling Multi-organization Provenance Exchange
Back to Full List of Papers
We present an architecture that supports provenance queries in large, dynamic, multi-organizational environments. The Provenance Challenges have explored exchange across disparate provenance systems, yet this is only a first step. We describe requirements for multi-organizational provenance, evaluate candidate architectures, describe the approach implemented in the PLUS prototype provenance manager, and present performance results that indicate the approach is scalable.
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Back to Full List of Papers
Provenance systems can produce enormous provenance graphs that can be used for a variety of tasks from determining the inputs to a particular process to debugging entire workflow executions or tracking difficult-to-find dependencies. Visualization can be a useful tool to support such tasks, but graphs of such scale (thousands to millions of nodes) are notoriously difficult to visualize. This paper presents the Provenance Map Orbiter, a tool for interactively exploring large provenance graphs using graph summarization and semantic zoom. It presents its users with a high-level abstracted view of the graph and the ability to incrementally drill down to the details.
Tracking Emigrant Data via Transient Provenance
Back to Full List of Papers
Information leaks are a constant worry for companies and government organizations. After a leak occurs it is very important for the data owner to not only determine the extent of the leak, but who originally leaked the information. We propose a technique to extend data provenance to aid in determining potential sources of information leaks. While data provenance is commonly defined as the ancestry of a file, the ancestry recorded depends on the provenance collector. Instead of only recording where a file came from, we propose to also track when and where a file leaves the system. To track these departures, we suggest the use of ghost objects when a file is either written to a mounted external storage device or copied to a client machine via NFS or any other network interface such as SSH or FTP. We present our solution for tracking emigrant data and explain the minor changes to current provenance-aware storage systems required to enable our solution.
Provenance in Dynamic Data Systems
Back to Full List of Papers
Most digital data sets are subject to modifications. For example, scientific data may be updated according to the new experimental results, and sales data updated periodically according to new sales made. We often have data derived from these digital data sets.
Our concern in this paper is the provenance of such derived data. Can we explain what a particular derived datum depends on, even if a value used in its derivation has since been modified. Can we determine if a particular derived value is still valid without performing full view maintenance. Questions of this sort are likely to arise when we derive results from modifiable data.
We present in this paper an overview of problems that arise in this context, with regard to fine-grain data provenance, and outline solutions to some of these problems.
A Framework for Policies over Provenance
Back to Full List of Papers
Provenance captures the history of a data item. This ensures the quality, the trustworthiness and the correctness of shared information, but the provenance may contain sensitive information so we may need to hide it. Sometimes we need access control policies to protect sensitive components and allow access based on certain properties. In other cases, we may need to share provenance but use redaction policies to circumvent the release of sensitive information. In this paper, we formulate an automatic procedure over provenance by combining these policies in an unified framework.
Challenges for Provenance in Cloud Computing
Back to Full List of Papers
Many applications which require provenance are now moving to cloud infrastructures. However, it is not widely realised that clouds have their own need for provenance due to their dynamic nature and the burden this places on their administrators. We analyse the structure of cloud computing to identify the unique challenges facing provenance collection and the scenarios in which additional provenance data could be useful.
Publishing Provenance-rich Scientific Papers
Back to Full List of Papers
Complete documentation and reproducibility of results are important goals for scientific publications. Standard scientific papers, however, usually contain only final results and document only parameters and processing steps that the authors considered important enough. By recording the complete provenance history of the data leading to a publication one can overcome this limitation and allow reproducibility for reviewers, publishers and readers of scientific publications. While the process of capturing provenance information is a growing research subject, here we discuss usually overlooked challenges involved in publishing provenance-complete papers. We report on our experience in preparing and publishing two specific executable papers where we used the VisTrails workflow system to embed full provenance information of the paper and discuss open challenges and issues we encountered.
Efficient Query Computing for Uncertain Possibilistic Databases with Provenance
Back to Full List of Papers
We propose an extension of possibilistic databases that also includes provenance. The introduction of provenance makes our model closed under selection with equalities, projection and join. In addition the computation of query computing with possibilities is polynomial, in contrast with current models that combine provenance with probabilities and have #P complexity.
One of These Records Is Not Like the Others
Back to Full List of Papers
This position paper argues the need to develop provenances, and provenance systems, in such a way that errors in the provenance (whether deliberate or not) can be detected and corrected. The requirement that a provenance have high assurance leads to some suggestions about the way a provenance should be constructed.
Fine Grain Provenance Using Temporal Databases
Back to Full List of Papers
Database applications often require rigorous provenance; e.g., it is important to know who is responsible for a change and at what time a change was done. Additionally, it is important to know which program, rule, or model was executed, and which data was used as input. The standard solution is to include provenance into the application logic. The consequence is that such a program extends significantly — a factor of 3 to 5 is often cited — and most importantly provenance by applications is normally badly designed and hard to use. This paper shows that temporal databases can be used to automate the provisioning process and that they enable a application design based on three key concepts: Facts, knowledge, and information.
Provenance Needs Incentives for Everyone
Back to Full List of Papers
Despite fervent early adopters, a rich research community and top-down mandates requiring its use, digital provenance has not become a pervasive and mainstream technology. While technological barriers still exist, the provenance community also must address thorny nontechnical issues. In particular, for critical stakeholders, the cost (time, expenses) of using and maintaining a provenance system is, from their viewpoint, often not worth the investment. In this work, we describe a real military use case and identify the various stakeholders. We then introduce the concept of incentives, to increase the return on investment for provenance usage, illustrating incentives with our use case.
Provenance, End-User Trust and Reuse: An Empirical Investigation
Back to Full List of Papers
Provenance theorists and practitioners assume that provenance is essential for trust in and reuse of data. However, little empirical research has been conducted to more closely examine this assumption. This qualitative study explores how provenance affects end-users' trust in and reuse of data. Toward this end, the authors conducted semi-structured interviews with 17 proteomics researchers who interact with data from ProteomeCommons.org, a large scientific data repository. Empirical findings from this study suggest that provenance does help end-users gauge the trustworthiness of data and build their confidence in reusing data. However, provenance also needs to be accompanied by other kinds of information, including: more specific data quality information, the data itself, and author reputation information. Implications of this study stress the value of end-user studies in provenance research, specifically to assess the 'real-world' impact of provenance encoded and communicated to end-users in systems.
Collecting Provenance via the Xen Hypervisor
Back to Full List of Papers
The Provenance Aware Storage Systems project (PASS) currently collects system-level provenance by intercepting system calls in the Linux kernel and storing the provenance in a stackable filesystem. While this approach is reasonably efficient, it suffers from two significant drawbacks: each new revision of the kernel requires reintegration of PASS changes, the stability of which must be continually tested; also, the use of a stackable filesystem makes it difficult to collect provenance on root volumes, especially during early boot. In this paper we describe an approach to collecting system-level provenance from virtual guest machines running under the Xen hypervisor. We make the case that our approach alleviates the aforementioned difficulties and promotes adoption of provenance collection within cloud computing platforms.
Dynamic Provenance for SPARQL Updates Using Named Graphs
Back to Full List of Papers
The (Semantic) Web currently does not have an official or de facto standard way exhibit provenance information. While some provenance models and annotation techniques originally developed with databases or workflows in mind transfer readily to RDF, RDFS and SPARQL, these techniques do not readily adapt to describing changes in dynamic RDF datasets over time. Named graphs have been introduced to RDF motivated as a way of grouping triples in order to facilitate annotation, provenance and other descriptive metadata. Although their semantics is not yet officially part of RDF, there appears to be a consensus based on their syntax and semantics in SPARQL queries. Meanwhile, updates are being introduced as part of the next version of SPARQL. In this paper we explore how to adapt the dynamic copy-paste provenance model of Buneman et al. [2] to RDF datasets that change over time in response to SPARQL updates, how to represent the resulting provenance records themselves as RDF using named graphs, and how the provenance information can be provided as a SPARQL end-point.
TAP: Time-aware Provenance for Distributed Systems
Back to Full List of Papers
In this paper, we explore the use of provenance for analyzing execution dynamics in distributed systems. We argue that provenance could have significant practical benefits for system administrators, e.g., for reasoning about changes in a system's state, diagnosing protocol misconfigurations, detecting intrusions, and pinpointing performance bottlenecks. However, to realize this vision, we must revisit several aspects of provenance management. As a first step, we present time-aware provenance (TAP), an enhanced provenance model that explicitly represents time, distributed state, and state changes. We outline our research agenda towards developing novel query processing, languages, and optimization techniques that can be used to efficiently and securely query time-aware provenance, even in the presence of transient state or untrusted nodes.
Provenance Integration Requires Reconciliation
Back to Full List of Papers
While there has been a great deal of research on provenance systems, there has been little discussion about challenges that arise when making different provenance systems interoperate. In fact, most of the literature focuses on provenance systems in isolation and does not discuss interoperability — what it means, its requirements, and how to achieve it. We designed the Provenance-Aware Storage System to be a general-purpose substrate on top of which it would be "easy" to add other provenance-aware systems in a way that would provide "seamless integration" for the provenance captured at each level. While the system did exactly what we wanted on toy problems, when we began integrating StarFlow, a Python-based workflow/provenance system, we discovered that integration is far trickier and more subtle than anyone has suggested in the literature. This work describes our experience undertaking the integration of StarFlow and PASS, identifying several important additions to existing provenance models necessary for interoperability among provenance systems.
On Factorisation of Provenance Polynomials
Back to Full List of Papers
Provenance polynomials generated by query evaluation in relational databases can have a regular structure that can be exploited for a more succinct representation via algebraic factorisations.
In this paper we highlight key properties and potential benefits of factorised provenance polynomials. We also present a list of challenges and outline results obtained so far in managing factorised polynomials of query results.
A Fine-Grained Workflow Model with Provenance-Aware Security Views
Back to Full List of Papers
In this paper we propose a fine-grained workflow model, based on context-free graph grammars, in which the dependency relation between the inputs and outputs of a module is explicitly specified as a bipartite graph. Using this model, we develop an access control mechanism that supports provenance-aware security views. Our security model not only protects sensitive data and modules from unauthorized access, but also provides the flexibility to expose correct or partially correct data dependency relationships within the provenance information.
Provenance Query Patterns for Many-Task Scientific Computing
Back to Full List of Papers
Provenance information enable the analysis of large scale many-task computations often specified as scientific workflows. They allow for one to determine how each resulting data set was derived from other data sets and applications. In this work, we survey queries used for exploring provenance information about many-task computations. We present a set of patterns that can be identified in these queries, which is being used as a basis for the design and implementation of a provenance management system for many-task scientific computations, integrated to the Swift parallel scripting system. It has a data model similar to the Open Provenance Model, with extensions that enrich core structural provenance data, represented as consumption and production relationships between applications and data sets, with information about the runtime behavior of each application, and domain-specific information such as the scientific parameters used by applications.
Compressing Provenance Graphs
Back to Full List of Papers
The provenance community has built a number of systems to collect provenance, most of which assume that provenance will be retained indefinitely. However, it is not cost-effective to retain provenance information inefficiently. Since provenance can be viewed as a graph, we note the similarities to web graphs and draw upon techniques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that adapting web compression techniques results in a compression ratio of 2.12:1 to 2.71:1, which we can improve upon to reach ratios of up to 3.31:1.
Bringing Provenance to Its Full Potential Using Causal Reasoning
Back to Full List of Papers
Provenance information is often used to explain query results and outcomes, exploit results of prior reasoning, and establish trust in data. The generality of the notion makes it applicable in a variety of domains, including data warehousing [7], curated databases [4], and various scientific applications. The recent introduction of causal reasoning in a database setting exploits provenance in ways that expand its applicability to more complex problems, and establish new directions, making a step towards achieving provenance's full potential. In this paper we explore through a variety of examples how causality improves on provenance information, discuss the challenges of building causality able systems, and propose some new directions.
Default-all is dangerous!
Back to Full List of Papers
We show that the default-all propagation scheme for database annotations is dangerous. Dangerous here means that it can propagate annotations to the query output which are semantically irrelevant to the query the user asked. This is the result of considering all relationally equivalent queries and returning the union of their where-provenance in an attempt to define a propagation scheme that is insensitive to query rewriting.
We propose an alternative query-rewrite-insensitive (QRI) where-provenance called minimal propagation. It is analogous to the minimal witness basis for why-provenance, straight-forward to evaluate, and returns all relevant and only relevant annotations.
Provenance for System Troubleshooting
Back to Full List of Papers
System administrators use a variety of techniques to track down and repair (or avoid) problems that occur in the systems under their purview. Reviewing log files, cross-correlating events on different machines, establishing liveness and performance monitors, and automating configuration procedures are just a few of the approaches used to stave off entropy. These efforts are often stymied by the presence of hidden dependencies between components in a system (e.g., processes, pipes, files, etc). In this paper we argue that system-level provenance can help expose these dependencies, giving system administrators a more complete picture of component interactions thus easing the task of troubleshooting.
Reexamining Some Holy Grails of Data Provenance
Back to Full List of Papers
We reconsider some of the explicit and implicit properties that underlie well-established definitions of data provenance semantics. Previous work on comparing provenance semantics has mostly focused on expressive power (does the provenance generated by a certain semantics subsume the provenance generated by other semantics) and on understanding whether a semantics is insensitive to query rewrite (i.e., do equivalent queries have the same provenance). In contrast, we try to investigate why certain semantics possess specific properties (like insensitivity) and whether these properties are always desirable. We present a new property stability with respect to query language extension that, to the best of our knowledge, has not been isolated and studied on its own.
Challenges in Managing Implicit and Abstract Provenance Data: Experiences with ProvManager
Back to Full List of Papers
Running scientific workflows in distributed and heterogeneous environments has been motivating the definition of provenance gathering approaches that are loosely coupled to workflow management systems. We have developed a provenance management system named ProvManager to manage provenance data in distributed and heterogeneous environments independent of a specific Scientific Workflow Management System. The experience of using ProvManager in real workflow applications has shown many provenance management issues that are not addressed in current related work. We have faced challenges such as the necessity of dealing with implicit provenance data and the lack of higher provenance abstraction levels. This paper discusses and points to directions towards these challenges, contextualizing them according to our experience in developing ProvManager.
|