Check out the new USENIX Web site. next up previous
Next: Data Consistency Up: The Design of the Previous: Queries, Query-Results and Semantic

   
Scope of Queries and Scope Consistency

In HAC, every query - and its corresponding semantic directory - has a scope which is the set of files over which the query is evaluated. A query does not return symbolic links to files that are outside its scope even if those files match the query. The scope of a query depends on the parent of the corresponding semantic directory. If ..../parent/child is a path such that both parent and child are semantic directories, then the scope of child is defined to be the existing set of symbolic links in parent. This set of symbolic links is also called the scope "provided" by parent. The scope provided by the root of a HAC file system is defined to be all the files in that file system. A change in the scope provided by parent, for example, will also change the scope of child. In this case, we say that child depends on parent. Note that all directories in the file system directly or indirectly depend on the root. By these definitions, the scope provided by a newly created child semantic directory is always a refinement of the scope provided by its parent ([bdbcp:94], [gjso:91]). When a user creates a new semantic subdirectory, HAC guarantees that the new set of symbolic links in that directory is always a subset of the set of the existing symbolic links in its parent. In other words, HAC treats the sets of symbolic links in different semantic directories as separate entities whose contents depend on how these directories are related to each other hierarchically. Hence, semantic directories allow users to organize both files and results of queries in a hierarchical fashion.

Semantic directories also allow users to tune the results of queries according to their personal tastes. HAC interprets the existing set of symbolic links in a semantic directory as its existing (``current'') query result. Since each query-result is a separate entity, users can modify the result of any query by (i) deleting some irrelevant links returned by the query, (ii) creating new links to files that have related information, but were missed by the query, or (iii) adding regular files to that directory. In our fingerprint example, users may want to add a set of C programs implementing fingerprints, email messages from a certain user or about a certain topic, and/or image files to the fingerprint semantic directory, even though these files do not match fingerprint's query. They may also decide that news stories about a certain crime should be removed from fingerprint even though they do match the query. They can do that by making the query more complex (e.g., "fingerprint AND NOT murder"), but often it is easier to remove a few files manually. Users can also build email semantic directories, allowing a message to be in more than one directory (e.g., by sender, receipient, topic, and/or a combination).

No query system is perfect, and currently most are not even close. HAC gives users more power to customize and adjust. It allows users to refine queries by using either the query language or the file system directly. Both methods are valid and being able to apply both at any time makes HAC very powerful and intuitive. But there's a major problem with this freedom. Since HAC allows users to edit the results of queries, it is now possible for them to create a hierarchy of semantic directories that makes sense to them intuitively, but still violates the scope restrictions discussed earlier. We call this the Scope Consistency Problem.

As far as we know, no existing file system has addressed the scope consistency problem. The Semantic File System [gjso:91] and Nebula [bdbcp:94] do not allow users to modify results of queries without modifying the query or the files in the file system. Prospero [neum:92] allows users complete freedom to define and manipulate queries (or ``filters'') and their results, but does not talk about enforcing any kind of consistency when results of queries are arranged in a hierarchy (or a graph). Search systems like Harvest [bdhms:94] and various WWW search engines are geared to bring search results to users, but not to organize results in any meaningful way.

Our approach to this problem in one of the main contributions of this paper. We now describe our solution in detail and show that it gives rise to a powerful new paradigm. To begin with, we classify symbolic links in a semantic directory in three ways (this distinction, for the most part, is hidden from users):

Permanent symbolic links:
links that were explicitly added by the user to the directory.

Transient symbolic links:
links that were obtained by evaluating a query.

Prohibited symbolic links:
links (whether transient or permanent) that were once present in the directory but at some point were explicitly deleted from it by the user. HAC will ensure that these links will not be implicitly added later without a direct action by the user.

Given a semantic directory sd, which is not the root of a HAC file system, we define the scope restriction on the set of symbolic links in sd as the following invariant:

1.
The set of transient symbolic links in sd is always a subset of the scope provided by its parent parent, and

2.
sd should have transient symbolic links to all the files in the scope provided by parent that satisfy sd's query, except for links that are explicitly prohibited in sd.

Changes to sd's scope can lead to a breakup of these invariants, a situation we call scope-inconsistency. This can happen, for example, whenever

1.
a user modifies the set of symbolic links in sd's parent parent,

2.
a user moves sd toa different part of the file system,

3.
there is a change in the scope of parent, or

4.
a user changes the query of sd after he/she creates it. (HAC allows users to access and modify the query associated with a semantic directory.)

A major part of HAC is an algorithm to maintain scope consistency, which is briefly described below. First, HAC uses the CBA mechanism to re-evaluate sd's query on all the files in its current scope. Then, from this result, HAC discards the links that occur in sd's set of permanent and prohibited symbolic links. The links that remain are the new transient symbolic links of sd. Note that HAC does not add a prohibited symbolic link to the above result even if that link points to a file that is in sd's scope and matches its query. Similarly, HAC does not delete a permanent symbolic link from sd even if that link points to a file that is no longer in sd's scope or does not match its query. Also note that HAC re-computes only the set of transient symbolic links of sd -- HAC does not change the set of permanent or prohibited symbolic links associated with sd 3. When the algorithm modifies the set of transient symbolic links in sd, it changes the scope provided by sd. Hence, the algorithm will also re-evaluate the queries of all the directories which directly or indirectly depend on sd. These are the descendents of sd and are present in the sub-tree rooted at sd. Any top-down traversal of this sub-tree (e.g., a breadth-first search) gives us the order in which we must re-evaluate the queries.

We decided to define the set of transient symbolic links in sd to be a refinement of the scope provided by its parent parent. We rejected the idea of defining this set to be, say, the union of the transient symbolic links in sd and the scope provided by all its children. (In this case, sd will depend on its children, not the other way round.) If we use this definition, users can never add a symbolic link sl to a child of sd such that sl does not automatically belong to the scope provided by sd. In other words, we cannot take care of the possibility that some information cannot be classified in a strict hierarchical fashion. This is unacceptable. We also rejected the idea of defining the set of transient symbolic links in sd to be the union of the transient symbolic links in sd and all its children, since in that case, changes to the set of transient links in a child semantic directory can effect the set of transient links in a parent. This is counter-intuitive since in hierarchical file systems, changes to the contents of a subdirectory do not effect the contents of its parent in any way.

To conclude: we allow users to edit and fine-tune the results of queries without modifying the query since we feel that the query of a semantic directory is not as important as the set of symbolic links in it. The query is just a quick first step to obtain more or less the information users are looking for. On the other hand, the set of symbolic links in a semantic directory may be the result of many (possibly time-consuming) browsing and editing steps. Hence, HAC does not modify this set unless it is explicitly asked to do so. Moreover, with this design, HAC is responsible only for the transient symbolic links in the file system, while users are responsible for all the permanent and prohibited symbolic links. HAC gives advice and help -- users decide how to organize their file system.


next up previous
Next: Data Consistency Up: The Design of the Previous: Queries, Query-Results and Semantic
Burra Gopal
1999-01-04