Check out the new USENIX Web site. next up previous
Next: Related Work Up: Integrating Content-Based Access Mechanisms Previous: Multiple Semantic Mount Points

   
Implementation and Performance

We implemented HAC on top of a UNIX file system (SunOS) using Glimpse as the default CBA mechanism [mw:94]. Our prototype contains about 25,000 lines of C code. It was implemented as a dynamically linked library (DLL) which can be accessed by all user-level applications. No kernel modifications were used. This made the design easier to experiment with, easier to port, and easier to convince people to use it. The obvious disadvantage is a penalty in performance compared to a native UNIX file system, but as we will show, it is a small penalty. We start with a brief overview of the implementation.

HAC allows users to define their own personal name spaces (i.e., a personal file system). HAC uses this name space to resolve the users' path names and evaluate their queries. This name space exists within a directory in the UNIX file system. HAC intercepts all file system calls that access this directory or its contents, and provides a transparent interface to all applications. Well-known file system commands, such as cd, ls, mkdir, mv, rm, etc., can be used to access and manipulate objects in the file system in the usual way. HAC also provides additional commands that manipulate queries and semantic directories. These are for the most part intuitive extensions of regular file system commands. For example, smkdir creates a semantic directory, smv modifies the query of a directory and sreadln retrieves it, scat accepts a symbolic link in a semantic directory and returns the information in the corresponding file that matches the query of the directory, smount defines new syntactic and semantic mount points, and ssync re-evaluates the queries of all the directories that directly or indirectly depend on a given directory.

HAC interacts with UNIX using a well defined API which assumes very little about the native file system -- HAC can be used even on ``flat'' file systems and file systems that do not support symbolic links. HAC interacts with Glimpse using another simple, well defined API. We believe that this API is general enough to integrate any CBA mechanism into HAC. Since HAC is a user-level file system, it does not contain any security and access-control features of its own: it borrows them from the underlying operating system.

We ran several experiments to determine the overhead to extended file system operations compared with regular file systems and/or regular glimpse queries. In the first experiment, we measured the overhead when we used HAC as a syntactic file system like UNIX and ran the Andrew Benchmark [saty:90] on both systems. The Andrew Benchmark has been used as a standard to evaluate the performance of many new file systems. The benchmark has 5 phases: (i)Makedir: constructs a destination directory hierarchy that is identical to the source directory hierarchy, (ii)Copy: copies each file in the source hierarchy to the destination hierarchy, (iii)Scan: recursively traverses the whole destination hierarchy and examines the status of every file in the hierarchy without reading the actual data in the files, (iv)Read: reads every byte of every file in the destination hierarchy, and (v)Make: compiles and links the files in the destination hierarchy. The results for HAC are shown in tables 1 and 2 below:


 
Table: Results of Andrew Benchmark
File System Makedir Copy Scan Read Make Total
UNIX 2s 5s 5s 8s 19s 39s
HAC 4s 9s 8s 14s 22s 57s
 


 
Table: Comparison with other File Systems
File System % Slowdown
Jade FS 36
Pseudo FS 33-41
HAC FS 46
 

From table 1, we see that phases 1 and 2 have the maximum overhead. This is because in phase 1, when HAC creates a new directory, it also creates and initializes (to ``empty'') the data structures that store its query, its query-result, and its set of permanent and prohibited symbolic links. HAC keeps track of the name of this directory in a global map so that it can track changes to the structure of the file system. Finally, HAC creates a new (empty) node for the directory in the dependency graph. (All of these are stored in the disk and require extra I/O operations.) In phase 2, when HAC creates a new file, it also initializes the open file-descriptor and the attribute-cache for that file. (This is stored in UNIX shared memory so that different processes can access it.) This helps to speed up Scan and Read operations on that file. Phases 3 and 4 have a medium overhead. In phase 3, HAC accesses the attribute-cache to retrieve the appropriate status information, and in phase 4, HAC accesses and updates the per-process file-descriptor table to implement the read-operation. Phase 5 has the least overhead since it is computationally intensive. On the whole, HAC is about 46 % slower than UNIX. From table 2 HAC is only slightly slower than the Jade [rp:93] and Pseudo [weou:89] file systems (both of which are user-level file systems like HAC). We also calculated the space overhead to store HAC's data structures (the extra information needed for each directory mentioned above, along with a fixed amount of book-keeping information). In the example we used, HAC required 222 KB while UNIX needs 210 KB. This is about 5 % more. The average amount of shared memory needed per process (including the attribute cache and the descriptor table) is about 16KB. Both these overheads are negligible. We believe that HAC's performance is quite reasonable since unlike the other two file systems, HAC must also create and maintain data-structures that provide content-based access to files.

In the second experiment, we first used Glimpse to index a database consisting of over 17000 files that occupy about 150 MB. We ran the indexing mechanism directly over UNIX to get an estimate of the time Glimpse takes to index the database and the space needed to store the index. We then indexed a different copy of the same database by using the HAC file system library instead. The results are shown in table 3. We see that HAC has a 27 % time overhead and a 15 % space overhead: we believe that both of these are reasonable.


 
Table: Results of Indexing
No. of files 17154
Size of files 149 Megabytes
Size of UNIX index 10 Megabytes
Size of HAC index 11.5 Megabytes
Time taken in UNIX 25 min
Time taken in HAC 31 min 48 sec
 

In the second part of this experiment, we used the smkdir command in HAC to create a semantic directory with a query Q. We also ran Glimpse through UNIX to search the above database for the same query. We chose three kinds of queries: (i) those that matched very few files, (ii) those that matched a lot of files, and (iii) those that matched an intermediate number of files. (We believe that queries of type (i) and (iii) are the most realistic -- and the most useful.) The results are shown in Table 4.


 
Table: Results of Searching
No. of files that matched 1 6556 98
Time taken in UNIX .45 sec 4 min 23 sec 7 sec
Time taken in HAC 2 sec 4 min 28 sec 8 sec
 

For queries that matched very few files, Glimpse running on UNIX is more than 4 times as fast as HAC. This is because to interact with the CBA mechanism in HAC, we must create a semantic directory. We do not incur this overhead when we run Glimpse on UNIX to search files. While this may seem like a large overhead, in absolute terms it is very small. The overhead of creating a semantic directory reduces as the number of files that match the query increases. For queries that match an intermediate number of files, the overhead is about 15 %. For queries that match a lot of files, the overhead is only 2 %.

Regarding the space overhead, note that we need to store, with each semantic directory, the list of files matching the query. Instead of storing actual file names, which could add quite a bit of space, we use a compact representation of the list of all file names. This is part of HAC's API for the CBA mechanism. We currently use bitmaps since it is simple to implement and has speed advantages for Glimpse. The extra space we need per semantic directory is therefore N/8 Bytes, where N is the number of indexed files. This comes out to be about 2 KB in this experiment. We plan to improve this in future by using better sparse-set representations, so that it is possible to index a very large number of files.


next up previous
Next: Related Work Up: Integrating Content-Based Access Mechanisms Previous: Multiple Semantic Mount Points
Burra Gopal
1999-01-04