Check out the new USENIX Web site.

Towards Semantics for Provenance Security

Stephen Chong
School of Engineering and Applied Sciences
Harvard University

Abstract: Provenance records the history of data. Careless use of provenance may violate the security policies of data. Moreover, the provenance itself may be sensitive information, necessitating restrictions on the use of both data and provenance to enforce security requirements. This paper proposes extensional semantic definitions for provenance security. The semantic definitions require that provenance information released to the user does not reveal confidential data, and that neither the provenance information given to the user, nor the program’s output, reveal sensitive provenance information.

1  Introduction

The interaction between information security and provenance is complex, and not well understood. When data manipulated by a system contains confidential information, careless use of provenance can lead to violations of information security. When some or all of the provenance information itself is confidential, care must be taken with the use of both data and provenance information to ensure appropriate security is enforced.

It is an open question what constitutes “appropriate security” for provenance. Previous work has considered access control to provenance information (e.g., [4,1,2,7,6]), but it is unclear what security guarantees this provides. Preventing unauthorized access to confidential provenance is insufficient to ensure that confidential provenance remains confidential. Also, permitting access to (non-confidential) provenance may reveal information about confidential data.

In this paper, we propose extensional semantic definitions for provenance security. We require that provenance information released to the user does not reveal confidential data, and that neither the provenance information given to the user nor the program’s output reveal sensitive provenance information.

Data and provenance can have different security requirements. For example, at the end of the peer review process, authors are allowed to know the contents of the reviews, but are typically not allowed to know the identity of the reviewers. Thus, the data (the reviews) are public, but the provenance (who wrote the reviews) is confidential. Another example, due to Braun et al. [1], is a letter of recommendation, where the subject of the letter is typically not allowed to know the contents, but is permitted to know the author. In this case, the data (the contents of the letter) are confidential, but the provenance (who authored the letter) is public.

Many applications manipulate both public and confidential data, and data with different provenance security requirements. In such settings, it is not always clear what provenance information a given user is allowed to access. Some supposedly public provenance may reveal confidential data, and some supposedly public data may reveal confidential provenance. This work provides a semantic basis for determining whether provenance information provided to a user satisfies the security policies of provenance and data.

We build on the work of Cheney, Acar, and Ahmed [3], who propose provenance traces as a semantic foundation for provenance in databases. Provenance traces explain a program’s execution by showing how the output was produced from the inputs. However, a complete explanation of a program’s execution may violate the security of data or provenance: it may reveal confidential data or confidential provenance. Thus, to respect the security of data and provenance, it may not be possible to provide a full provenance trace to a user. We allow parts of a provenance trace to be elided, and define conditions to ensure that the provenance trace contains as much information as possible, while still respecting the security of both data and provenance.

The rest of this paper is structured as follows. Section 2 presents syntax and semantics of a simple language, and defines provenance traces for it. Section 3 defines security policies for data and provenance. Section 4 defines semantic conditions for provenance security. Section 5 discusses future work.

2  Language

We use a simple calculus that is sufficient to investigate issues in provenance security. We expect the results of this paper to apply in a straightforward manner to richer languages. Programs c include let-commands, conditionals, and expressions e. We assume that expressions contain variables x, literal values v, and locations l, but leave them otherwise unspecified.

     
  c::= e ∣ let x = c0   in   c1∣ if x   then   c0   else   c1          
e::= x ∣ v ∣ l ∣ …          

 
σ,l ⇐ e ⇓ σ[l ↦ σ(e)]
l ∉dom(σ)
 σ,l′ ⇐ c0 ⇓ σ′    σ′,l ⇐ c1[l′ / x] ⇓ σ″   
σ,l ⇐ let x = c0   in   c1 ⇓ σ″
l′ ∉dom(σ)
σ,l ⇐ ci ⇓ σ′     
σ,l ⇐ if l′   then   c0   else   c1 ⇓ σ′
i = 


      0if  σ(l′) ≠ 0 
      1otherwise
Figure 1: Operational semantics

We define a large-step operational semantics for the language. The operational semantics uses stores, which are functions from locations to values. Following the notation of Cheney et al. [3], the judgment σ,lc ⇓ σ′ means that given initial store σ, the program c evaluates to store σ′, and the result of the program is σ′(l); we call l the output location, as this is where the result is stored. Inference rules for the semantics are given in Figure 1. We write σ(e) for the result of evaluating e with all locations l replaced by σ(l). We write σ[lv] for the store that maps location l to value v and otherwise behaves exactly like σ. We write c[l/x] for the result of substituting location l for every occurrence of free variable x in program c.

Note that a value stored in a location is never overwritten: if σ,lc ⇓ σ′ then ldom(σ) and for all l′ ∈ dom(σ), σ(l′) = σ′(l′). Also, c will read the value of location l in initial store σ only if l appears in c. We write loc(c) for the set of locations that appear in c, and refer to loc(c) as the input locations of c.

2.1  Provenance traces

A provenance trace summarizes the execution of a program. The provenance traces used in this paper are similar to those of Cheney et al. [3], but we permit parts of the trace to be elided, indicated with the symbol ⋆. The syntax of provenance traces is:

     
  T::=   l ← e ∣ T0;T1 ∣ cond(lvT         
  ∣ l ← ⋆  ∣ cond(⋆, vT) ∣ cond(l, ⋆, ⋆) ∣ ⋆          
           

Assignment trace le records that new location l was created, and the value assigned to l was the evaluation of expression e. Partial assignment trace l ← ⋆ records that new location l was created, but does not record the provenance of the value stored there. Sequential composition trace T0;T1 records that first trace T0 was performed, and then trace T1. Conditional trace cond(l, v, T) records that a conditional command was executed: location l was evaluated to determine which branch to take, v∈ {0,1} indicates which branch was taken, and trace T records the evaluation of the branch. The partial conditional trace cond(⋆, v, T) records which branch was taken, and the trace for that branch, but does not record which location was used as the test. Similarly, the partial conditional trace cond(l, ⋆, ⋆) records which location was used as the test, but not the result of the test or the trace for the appropriate branch. Finally, unknown trace ⋆ records no information about the execution.


    
σ,l ⇐ e ⇓ σ′   ⊨   l ← e
  σ,l′ ⇐ c0 ⇓ σ′   ⊨   T    σ′,l ⇐ c1[l′/x] ⇓ σ″   ⊨   T′  
σ,l ⇐ let x = c0   in   c1 ⇓ σ″   ⊨   TT
    
σ,l ⇐ e ⇓ σ′   ⊨   l ← ⋆
  σ,l ⇐ cv ⇓ σ′   ⊨   T    v = 


      0if  σ(l′) ≠ 0 
      1otherwise
 
σ,l ⇐ if l′   then   c0   else   c1 ⇓ σ′   ⊨   cond(l′, vT)
  σ,l ⇐ cv ⇓ σ′   ⊨   T    v = 


      0if  σ(l′) ≠ 0 
      1otherwise
 
σ,l ⇐ if l′   then   c0   else   c1 ⇓ σ′   ⊨   cond(⋆, vT

  σ,l ⇐ cv ⇓ σ′   ⊨   T    v = 


      0if  σ(l′) ≠ 0 
      1otherwise
 
σ,l ⇐ if l′   then   c0   else   c1 ⇓ σ′   ⊨   cond(l′, ⋆, ⋆)
    
σ,l ⇐ c ⇓ σ′   ⊨   ⋆
Figure 2: Trace consistency σ,lc ⇓ σ′   ⊨   T


Given evaluation σ,lc ⇓ σ′ and provenance trace T, trace consistency judgment σ,lc ⇓ σ′   ⊨   T states that T is consistent with the evaluation. Figure 2 presents inference rules for the judgment. Intuitively, elided trace ⋆ is consistent with any evaluation, partial assignment trace l ← ⋆ is consistent with any assignment to location l, and the partial conditional traces cond(⋆, v, T) and cond(l, ⋆, ⋆) are consistent with conditional commands regardless of, respectively, the location tested, and the branch taken. Assignment trace le is consistent with only command e. Conditional trace cond(l, v, T) is consistent with conditional command if l then c0 else c1 provided that branch cv was evaluated and T is consistent with that evaluation. Finally, trace T;T′ is consistent with command let x = c0 in c1 if T is consistent with the evaluation of c0 and T′ is consistent with the evaluation of c1, where variable x is replaced with the output location for the evaluation of c0.

Trace consistency is similar to consistency as defined by Cheney et al. [3]: both require that the trace describe what happened during the execution. Cheney et al. [3] also define fidelity, which requires that traces contain enough information to describe how the program would have executed with different inputs. We ignore fidelity in this paper: traces with parts elided for security purposes may not contain enough information to fully reconstruct the execution of a program, and thus will not satisfy fidelity.1


   
T ≤ T
   
⋆ ≤ T
  T0 ≤    T1   
cond(lvT0) ≤ cond(lvT1
   
l ← ⋆ ≤ l ← e
  T0 ≤    T1   
cond(⋆, vT0) ≤ cond(lvT1
  T0 ≤    T1      T0′ ≤    T1′   
T0T0′ ≤    T1T1′ 
   
cond(l, ⋆, ⋆) ≤ cond(lvT
Figure 3: Information order ≤ on traces


We define an information order ≤ on traces. Inference rules for the relation are given in Figure 3. Intuitively, if T0T1 then T0 contains less information about an evaluation than T1. The following theorem shows that if T0 contains less information than T1, and T1 is consistent with an evaluation, then so is T0.

Theorem 1   Given evaluation σ,lc ⇓ σ′ and traces T0 and T1 such that T0T1, if σ,lc ⇓ σ′   ⊨   T1 then σ,lc ⇓ σ′   ⊨   T0.

We say that input location l affects the result of program c if the result produced by c depends on the value stored in location l. That is, changing the value stored in l has the potential to change the result of the program.

Definition 1   Given initial store σ and location ldom(σ)∩loc(c), l affects the result of program c if there exists values v0 and v1 such that σ[lv0],l′ ⇐ c ⇓ σ′0 and σ[lv1],l′ ⇐ c ⇓ σ′1 and σ′0(l′) ≠ σ′1(l′).

We write depend(c,σ) for the set of all locations in σ that affect the result of c.

A location may appear in a (consistent) provenance trace of an evaluation of c even if it does not affect the result of c. For example, given the program l0 × 0 + l1, location l0 does not affect the result, and l1 does. However, both locations are in the trace l2l0 × 0 + l1, which is consistent with the program’s evaluation.

3  Security policies

In this section we define security policies for locations. The security policies declare whether the data stored in a location is confidential data or public data, and whether the location has confidential provenance or public provenance. A location has confidential provenance if it should not be known whether this location affects the result of the computation.

We assume a set of security levels S with a partial order ⊑. If s0s1 then security level s1 is at least as restrictive as security level s0. For the rest of the paper we use the two element lattice S = {L,H} where LH. Level H represents high confidentiality, or secret, and level L represents low confidentiality, or public. The results generalize to arbitrary sets S.

Policies for locations are of the form s0 locs1 where s0, s1S. The level s0 is the data security level, the security level to enforce on the contents of a location; s1 is the provenance security level and is the security level to enforce on information about whether the location affects the result of the evaluation.

A security context for program c is a function Γ:loc(c) → Policy that assigns security policies to the input locations of c. In this paper, we are concerned with enforcing security policies of input locations; determining and enforcing security policies on intermediate and output locations is left to future work.

For convenience we use Γ to define some sets of locations. Set DataLowΓ are locations in dom(Γ) with data security level L; set DataHighΓ are locations in dom(Γ) with data security level H (or equivalently, the locations not in DataLowΓ). Set ProvHighΓ are locations in dom(Γ) with provenance security level H.

     
  DataLowΓ= { l ∈ dom(Γ) ∣ Γ(l) = L locss ∈ S }         
DataHighΓ= { l ∈ dom(Γ) ∣ l ∉DataLowΓ }          
ProvHighΓ= { l ∈ dom(Γ) ∣ Γ(l) = s locHs ∈ S }          

4  Provenance security

Given program c, initial store σ, and security context Γ, we assume that there is a low observer, an agent that can observe the contents of any input location lDataLowΓ. The low observer does not directly observe the evaluation of c, but given evaluation σ,lc ⇓ σ′, the low observer can observe σ′(l), the content of output location l. The observer is then given a provenance trace T that is consistent with the evaluation σ,lc ⇓ σ′. (The observer could be given the entire provenance trace T, or T could be used to determine the access control policy for the observer’s access to provenance.)

To ensure the security policies of input locations are respected, we must define security requirements that ensure neither the program result nor the provenance trace reveal anything about the values stored in locations DataHighΓ or which locations in ProvHighΓ affect the result of the program. Ideally, the trace T should be maximal in the information ordering ≤ while satisfying the security requirements.

In this paper we propose extensional security requirements that restrict what information is revealed by the provenance trace, and that ensure the program result respects the provenance security policies. We do not consider what information the program result reveals about confidential input values; much work in language-based security considers this problem [5].

A naive attempt to ensure that the trace T does not reveal which locations in ProvHighΓ affect the result is to require that T does not contain any location in ProvHighΓ, but allow it to contain any other location, i.e., to require that loc(T) ∩ ProvHighΓ = ∅, where loc(T) is the set of locations that appear in trace T. This is analogous to using access control to prevent access to parts of the provenance marked as sensitive, but allowing all other accesses.

However, consider the following program.

let x = l0   in   
if x   then   (l1  xor  l2)   else   (l1  xor  l3)

where the policies of locations are

     
  Γ(l0)L locL     Γ(l1)H locL        
Γ(l2)H locH     Γ(l3)H locH.        

Note that the low observer is allowed to know the value stored in location l0, but not the values stored in other locations. The low observer is permitted to learn whether l1 affects the result, but not whether l2 or l3 do. Suppose that the value in location l0 is 42. Then trace T = l4l0; cond(l4, 0, ⋆) is consistent with the program evaluation, and does not contain any locations from ProvHighΓ. However, given trace T (and assuming the observer knows the program text) the low observer knows which branch of the if command was taken, and thus can infer that location l2 affects the result, violating the security policy of l2. Thus we reject loc(T) ∩ ProvHighΓ = ∅ as a suitable security requirement.

We propose instead provenance security which requires that for any location lProvHighΓ, the provenance trace T is consistent with two evaluations, one in which l affects the result, and one in which l does not affect the result. The two evaluations must have the same values in the low-observable input locations, and must produce the same result. This ensures that neither the provenance trace nor the result reveals which locations in ProvHighΓ affect the result.

Definition 2 (Provenance security)   Provenance trace T satisfies provenance security for σ0,lc ⇓ σ′0 exactly when:
σ0,lc ⇓ σ′0   ⊨   T and
for all l′ ∈ ProvHighΓ there is a store σ1 such that
  σ0(l″) = σ1(l″) for all lDataLowΓ and
  σ1,lc ⇓ σ′1 and σ′0(l) = σ′1(l) and
  σ1,lc ⇓ σ′1   ⊨   T and
  l′ ∈ depend(c0)  ⇐⇒ l′ ∉depend(c1).

For any location lProvHighΓ, provenance security provides plausible deniability: if l affects the result, then there is an evaluation in which l does not affect the result, and that evaluation produces the same result, is consistent with trace T, and has the same values in the low-observable input locations. Similarly, if l does not not affect the result, then there is an evaluation in which l affects the result, and that evaluation produces the same result, is consistent with trace T, and has the same values in the low-observable input locations.

Returning to the example above, we see that the trace T = l4l0; cond(l4, 0, ⋆) does not satisfy provenance security for the program let x = l0 in if x then (l1 xor l2) else (l1 xor l3), since only evaluations that execute the l1 xor l2 branch are compatible with T, and thus T reveals that the result depends on l2. Indeed, there is no trace that satisfies provenance security for any evaluation of this command, since l0DataLowΓ, and so the low observer always knows which branch is executed.

Implicit in provenance security is the conservative assumption that the low observer knows the program c that was evaluated. If we make the stronger (and potentially dangerous2) assumption that the low observer does not know some parts of the program then we can use a weaker security condition: for every location in ProvHighΓ there is both an initial store σ1 and a program c′ such that the evaluation of c′ from initial store σ1 satisfies the appropriate conditions, and c′ is appropriately similar to c. Depending on the application, this weaker security condition may be appropriate. For example, if the program is executing on a trusted server, and the observer does not have access to either the source code or binary executable, then it may be reasonable to assume that the observer does not know the program being evaluated, and cannot distinguish programs that take approximately the same amount of time to execute.

Provenance security ensures that neither the result nor the provenance trace reveal information about which locations in ProvHighΓ affect the result. Additionally, the provenance trace should not reveal secret input data inappropriately. For example, consider the program

let x = l0  mod  2   in   
if x   then   0   else   0

where Γ(l0) = H locL. Note that the output of the program does not reveal the secret input data stored in location l0. However, the trace T = l1l0 mod 2; l2cond(l, 1, ⋆) is consistent with an evaluation of the program, satisfies provenance security, but inappropriately reveals that the value stored in l0 is even.

We propose the security condition data security to enforce that provenance does not reveal confidential data. Data security requires that a provenance trace does not reveal anything about values stored in DataHighΓ locations: trace T must be consistent with all evaluations from initial states with identical values in low-observable input locations.

Definition 3 (Data security)   Provenance trace T satisfies data security for σ0,lc ⇓ σ′0 exactly when:
σ0,lc ⇓ σ′0   ⊨   T and
for all σ1 such that
      σ0(l′) = σ1(l′) for all l′ ∈ DataLowΓ
  σ1,lc ⇓ σ′1 implies σ1,lc ⇓ σ′1   ⊨   T.

Like provenance security, data security provides plausible deniability: for any evaluation of the command that is consistent with trace T, there is another evaluation that is also consistent with T where the public input values are the same, but the secret input values are completely different. Returning to the example above, the trace T = l1l0 mod 2; l2cond(l, ⋆, ⋆) satisfies data provenance for any evaluation of the command let x = l0 mod 2 in if x then 0 else 0.

5  Future work

The ultimate goal of this work is to provide precise, useful, and intuitive, semantic definitions of provenance security, and efficient mechanisms for enforcing them. This paper proposes some semantic definitions for provenance security. Much work remains, including constructive techniques for producing maximal provenance traces that satisfy the security conditions, and proof techniques for proving provenance security and data security for all possible evaluations of a program. Extensions to this work include defining other elisions of provenance traces, and considering the impact of treating provenance as first-class entities, allowing the program to inspect provenance as it executes.

This work is compatible with the use of (discretionary) access control to restrict users’ access to provenance: it provides a semantic basis for determining the access control policy for provenance information.

In many provenance-aware applications intermediate results will persist after program execution terminates. As such, it is important to determine appropriate security policies and semantic security requirements for intermediate results. Currently, this work considers specification and enforcement of security only for input locations, and assumes the value in the output location is observable by the low observer.

Acknowledgments

Thanks to Kim Bruce, Melissa O’Neill, and Chris Stone for giving useful feedback on an earlier version of this work. This paper was written while visiting the Computer Science Departments at Pomona College and Harvey Mudd College. Thanks also to the anonymous reviewers for their useful comments.

References

[1]
U. Braun, A. Shinnar, and M. Seltzer. Securing provenance. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Security, 2008.
[2]
A. Chebotko, S. Chang, S. Lu, F. Fotouhi, and P. Yang. Secure scientific workflow provenance querying with security views. In Proceedings of the Ninth International Conference on Web-Age Information Management, pages 349–356. IEEE Computer Society, 2008.
[3]
J. Cheney, U. Acar, and A. Ahmed. Provenance traces. arXiv:0812.0564v1, December 2008.
[4]
R. Hasan, R. Sion, and M. Winslett. The case of the fake Picasso: Preventing history forgery with secure provenance. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, June 2009.
[5]
A. Sabelfeld and A. Myers. Language-based information-flow security. IEEE Journal on Selected Areas in Communications, 21 (1): 5–19, Jan. 2003.
[6]
C. Sar and P. Cao. Lineage file system. Online at https://theory.stanford.edu/~cao/lineage.
[7]
V. Tan, P. Groth, S. Miles, S. Jiang, S. Munroe, S. Tsasakou, and L. Moreau. Security issues in a SOA-based provenance system. Lecture Notes in Computer Science, 4145, 2006.

1
Our disregard of fidelity is also the reason our conditional traces cond(l, v, T) have fewer annotations than those of Cheney et al. [3].
2
Security by obscurity is generally regarded as an inadequate mechanism to ensure security.

This document was translated from LATEX by HEVEA.