Evaluation

Linn and Debray evaluated their obfuscation tool using the SPECint 95 benchmark suite, a set of eight benchmark applications written in C. These programs were compiled with gcc version egcs-2.91.66 at optimization level -O3 and then obfuscated.

Table 1: Disassembler accuracy.

				Our tool

compress95	56.07	69.96	24.19	91.04	98.07
gcc	65.54	82.18	45.09	88.45	95.17
go	66.08	78.12	43.01	91.81	96.80
ijpeg	60.82	74.23	31.46	91.60	97.53
li	56.65	72.78	29.07	89.86	97.35
m88ksim	58.42	75.66	29.56	90.39	97.49
perl	57.66	72.01	31.36	86.93	96.28
vortex	66.02	76.97	42.65	90.71	96.65
Mean	60.91	75.24	34.55	90.10	96.92

To measure the efficacy of the obfuscation process, the confusion factor for instructions was introduced. This metric measures how many program instructions were incorrectly disassembled. More formally, let V be the set of valid program instructions and O the set of instructions that a disassembler outputs. Then, the confusion factor CF is defined as $\mbox{CF} = \frac{\vert V-O\vert}{V}$ . Because our work focuses on the efficacy of the disassembler in identifying valid instructions, we define the disassembler accuracy DA as $\mbox{DA} = 1 - \mbox{CF}$ .

Linn and Debray used three different disassemblers to evaluate the quality of their obfuscator. The first one was the GNU objdump utility, which implements a standard linear sweep algorithm. The second disassembler was implemented by Linn and Debray themselves. It is a recursive disassembler that uses speculative linear disassembly (comparable to our gap completion) for regions that are not reachable by the recursive part. This disassembler was also provided with additional information about the start and end addresses of all program functions. The purpose of this disassembler was to serve as an upper bound estimator for the disassembler accuracy and to avoid reporting ``unduly optimistic results'' [13]. The third disassembler was IDA Pro 4.3x, a commercial disassembler that is often considered to be among the best commercially available disassemblers. This belief is also reflected in the fact that IDA Pro is used to provide disassembly as input for static analysis tools such as [3].

We developed a disassembler that implements the general techniques and the tool-specific modification presented in the two previous sections. Our tool was then run on the eight obfuscated SPECint 95 applications. The results for our tool and a comparison to the three disassemblers used by Linn and Debray are shown in Table 1. Note that we report two results for our disassembler. One shows the disassembler accuracy when only general techniques are utilized. The second result shows the disassembler accuracy when the tool-specific modification is also enabled.

These results demonstrate that our disassembler provides a significant improvement over the best disassembler used in the evaluation by Linn and Debray. Even without using tool-specific knowledge, the disassembler accuracy is higher than their recursive disassembler used to estimate the upper bound for the disassembler accuracy. When the tool-specific modification is enabled, the binary is disassembled almost completely. The poor results for IDA Pro can be explained with the fact that the program only disassembles addresses that can be guaranteed (according to the tool) to be instructions. As a result, many functions that are invoked through the branch function are not disassembled at all. In addition, IDA Pro continues directly after call instructions and is frequently mislead by junk bytes there.

Given the satisfying results of our disassembler, the disassembly process was analyzed in more detail. It is interesting to find the ratio between the number of valid instructions identified by the control flow graph and the number of valid instructions identified by the gap completion phase. Although the gap completion phase is important in filling regions not covered by the CFG, our key observation is the fact that the control transfer instructions and the resulting control flow graph constitute the skeleton of an analyzed function. Therefore, one would expect that most valid instructions can be derived from the control flow graph, and only small gaps (e.g., caused by indirect calls or unconditional jumps) need to be completed later. Table 2 shows the fraction (in percent) of correctly identified, valid instructions that were obtained using the control flow graph and the fraction obtained in the gap completion phase. Because the numbers refer to correctly identified instructions only, the two fractions sum up to unity. Both the results with tool-specific support and the results with the general techniques alone are provided. When tool specific support is available, the control flow graph contributes noticeable more to the output. In this case, the disassembler can include all regions following call instructions into the CFG. However, in both experiments, a clear majority of the output was derived from the control flow graph, confirming our key observation.

Table 2: CFG vs. gap completion.

	general		tool-specific
		CFG	Gap	CFG
compress95	87.09	12.91	96.36	3.64
gcc	85.12	14.88	93.10	6.90
go	89.13	10.87	95.11	4.89
ijpeg	87.02	12.98	95.03	4.97
li	85.63	14.37	95.11	4.89
m88ksim	87.18	12.82	96.00	4.00
perl	86.22	13.78	95.57	4.43
vortex	88.04	11.96	94.67	5.33
Mean	86.93	13.07	95.12	4.88

Table 3: Conflict resolution.

		Conflict Resolution
				Step 1	Step 2	Step 3
compress95	54674	7021	4693	4242	93	48	38577
gcc	245586	21762	25680	29801	900	565	166878
go	91140	10667	8934	9405	231	154	61749
ijpeg	70255	9414	6069	5299	140	95	49238
li	63459	8350	5297	4952	125	78	44657
m88ksim	77344	10061	6933	6938	177	101	53134
perl	104841	10940	11442	11750	291	152	70266
vortex	118703	15004	9221	13424	407	373	80274

Because most of the output is derived from the control flow graph, it is important that the conflict resolution phase is effective. One third of the control transfer instructions that are used to create the initial control flow graphs are invalid. To achieve a good disassembler accuracy, it is important to remove the invalid nodes from the CFG. The first two steps of the conflict resolution phase remove nodes that are guaranteed to be invalid, given our assumptions. The third and forth step implement two heuristics and the fifth step randomly selects one of two conflicting nodes. It is evident that it is desirable to have as many conflicts as possible resolved by the first and second step, while the fifth step should never be required.

Table 3 shows for each program the number of basic blocks in the initial control flow graphs (column Initial Blocks) and the number of basic blocks in the control flow graphs after the conflict resolution phase (column Final Blocks). In addition, the number of basic blocks that were removed in each of the five steps of the conflict resolution phase are shown. The numbers given in Table 3 were collected when the tool-specific modification was enabled. The results were very similar when only general techniques were used.

It can be seen that most conflicts were resolved after the first three steps. About two thirds of the removed basic blocks were guaranteed to be invalid. This supports our claim that invalid control flow instructions, caused by the misinterpretation of instruction arguments, often result in impossible control flows that can be easily detected. Most of the remaining blocks are removed by the first heuristic that checks how tight a block is connected with the rest of the CFG. Invalid blocks are often loosely coupled and can taken out during this step. The last two steps were only responsible for a small fraction of the total removed blocks. The heuristic in step four was sometimes able to provide an indication of which block was valid. Otherwise, a random node had to be selected.

Static analysis tools are traditionally associated with poor scalability and the inability to deal with real-world input. Therefore, it is important to ascertain that our disassembler can process even large real-world binaries in an acceptable amount of time. In Section 4, we claimed that the processing overhead of the program is linear in the number of instructions of the binary. The intuitive reason is the fact that the binary is partitioned into functions that are analyzed independently. Assuming that the average size of an individual function is relatively independent of the size of the binary, the amount of work per function is also independent of the size of the binary. As a result, more functions have to be analyzed as the size of the binary increases. Because the number of functions increases linearly with the number of instructions and the work per function is constant (again, assuming a constant average function size), the overhead of the static analysis process is linear in the number of instructions.

Table 4: Disassembler processing times.

Program	Size (Bytes)	Instructions	Time (s)
openssh	263,684	46,343	4
compress95	1,768,420	92,137	9
li	1,820,768	109,652	7
ijpeg	1,871,776	127,012	9
m88ksim	2,001,616	127,358	8
go	2,073,728	145,953	11
perl	2,176,268	169,054	15
vortex	2,340,196	204,230	16
gcc	2,964,740	387,289	28
emacs	4.765,512	405,535	38

To support this claim with experimental data, the time for a complete disassembly of each evaluation binary was taken. The size of obfuscated programs of the SPECint 95 benchmark are in the range of 1.77 MB to 2.96 MB. To obtain more diversified results, we also disassembled one smaller (openssh 3.7) and one larger binary (emacs 21.3). The processing times were taken as the average of ten runs on a 1.8 GHz Pentium IV system with 512 MB of RAM, running Gentoo Linux 2.6. The results (in seconds) for the disassembler are listed in Table 4. There was no noticeable difference when using tool-specific modification.

Figure 7 shows a plot of the processing times and the corresponding number of instructions for each binary. The straight line represents the linear regression line. The close proximity of all points to this line demonstrates that the processing time increases proportional to the number of instructions, allowing our disassembler to operate on large binaries with acceptable cost.

**Figure 7:** Processing times and linear regression.
$\scalebox{0.31}{\includegraphics{time}}$