Check out the new USENIX Web site. next up previous
Next: Conclusions Up: Static Disassembly of Obfuscated Previous: Tool-Specific Techniques


Evaluation


Linn and Debray evaluated their obfuscation tool using the SPECint 95 benchmark suite, a set of eight benchmark applications written in C. These programs were compiled with gcc version egcs-2.91.66 at optimization level -O3 and then obfuscated.


Table 1: Disassembler accuracy.
        Our tool
   
compress95 56.07 69.96 24.19 91.04 98.07
gcc 65.54 82.18 45.09 88.45 95.17
go 66.08 78.12 43.01 91.81 96.80
ijpeg 60.82 74.23 31.46 91.60 97.53
li 56.65 72.78 29.07 89.86 97.35
m88ksim 58.42 75.66 29.56 90.39 97.49
perl 57.66 72.01 31.36 86.93 96.28
vortex 66.02 76.97 42.65 90.71 96.65
Mean 60.91 75.24 34.55 90.10 96.92


To measure the efficacy of the obfuscation process, the confusion factor for instructions was introduced. This metric measures how many program instructions were incorrectly disassembled. More formally, let V be the set of valid program instructions and O the set of instructions that a disassembler outputs. Then, the confusion factor CF is defined as $\mbox{CF} =
\frac{\vert V-O\vert}{V}$. Because our work focuses on the efficacy of the disassembler in identifying valid instructions, we define the disassembler accuracy DA as $\mbox{DA} = 1 - \mbox{CF}$.

Linn and Debray used three different disassemblers to evaluate the quality of their obfuscator. The first one was the GNU objdump utility, which implements a standard linear sweep algorithm. The second disassembler was implemented by Linn and Debray themselves. It is a recursive disassembler that uses speculative linear disassembly (comparable to our gap completion) for regions that are not reachable by the recursive part. This disassembler was also provided with additional information about the start and end addresses of all program functions. The purpose of this disassembler was to serve as an upper bound estimator for the disassembler accuracy and to avoid reporting ``unduly optimistic results'' [13]. The third disassembler was IDA Pro 4.3x, a commercial disassembler that is often considered to be among the best commercially available disassemblers. This belief is also reflected in the fact that IDA Pro is used to provide disassembly as input for static analysis tools such as [3].

We developed a disassembler that implements the general techniques and the tool-specific modification presented in the two previous sections. Our tool was then run on the eight obfuscated SPECint 95 applications. The results for our tool and a comparison to the three disassemblers used by Linn and Debray are shown in Table 1. Note that we report two results for our disassembler. One shows the disassembler accuracy when only general techniques are utilized. The second result shows the disassembler accuracy when the tool-specific modification is also enabled.

These results demonstrate that our disassembler provides a significant improvement over the best disassembler used in the evaluation by Linn and Debray. Even without using tool-specific knowledge, the disassembler accuracy is higher than their recursive disassembler used to estimate the upper bound for the disassembler accuracy. When the tool-specific modification is enabled, the binary is disassembled almost completely. The poor results for IDA Pro can be explained with the fact that the program only disassembles addresses that can be guaranteed (according to the tool) to be instructions. As a result, many functions that are invoked through the branch function are not disassembled at all. In addition, IDA Pro continues directly after call instructions and is frequently mislead by junk bytes there.

Given the satisfying results of our disassembler, the disassembly process was analyzed in more detail. It is interesting to find the ratio between the number of valid instructions identified by the control flow graph and the number of valid instructions identified by the gap completion phase. Although the gap completion phase is important in filling regions not covered by the CFG, our key observation is the fact that the control transfer instructions and the resulting control flow graph constitute the skeleton of an analyzed function. Therefore, one would expect that most valid instructions can be derived from the control flow graph, and only small gaps (e.g., caused by indirect calls or unconditional jumps) need to be completed later. Table 2 shows the fraction (in percent) of correctly identified, valid instructions that were obtained using the control flow graph and the fraction obtained in the gap completion phase. Because the numbers refer to correctly identified instructions only, the two fractions sum up to unity. Both the results with tool-specific support and the results with the general techniques alone are provided. When tool specific support is available, the control flow graph contributes noticeable more to the output. In this case, the disassembler can include all regions following call instructions into the CFG. However, in both experiments, a clear majority of the output was derived from the control flow graph, confirming our key observation.


Table 2: CFG vs. gap completion.
  general tool-specific
  CFG Gap CFG
compress95 87.09 12.91 96.36 3.64
gcc 85.12 14.88 93.10 6.90
go 89.13 10.87 95.11 4.89
ijpeg 87.02 12.98 95.03 4.97
li 85.63 14.37 95.11 4.89
m88ksim 87.18 12.82 96.00 4.00
perl 86.22 13.78 95.57 4.43
vortex 88.04 11.96 94.67 5.33
Mean 86.93 13.07 95.12 4.88



Table 3: Conflict resolution.
    Conflict Resolution  
    Step 1 Step 2 Step 3
compress95 54674 7021 4693 4242 93 48 38577
gcc 245586 21762 25680 29801 900 565 166878
go 91140 10667 8934 9405 231 154 61749
ijpeg 70255 9414 6069 5299 140 95 49238
li 63459 8350 5297 4952 125 78 44657
m88ksim 77344 10061 6933 6938 177 101 53134
perl 104841 10940 11442 11750 291 152 70266
vortex 118703 15004 9221 13424 407 373 80274


Because most of the output is derived from the control flow graph, it is important that the conflict resolution phase is effective. One third of the control transfer instructions that are used to create the initial control flow graphs are invalid. To achieve a good disassembler accuracy, it is important to remove the invalid nodes from the CFG. The first two steps of the conflict resolution phase remove nodes that are guaranteed to be invalid, given our assumptions. The third and forth step implement two heuristics and the fifth step randomly selects one of two conflicting nodes. It is evident that it is desirable to have as many conflicts as possible resolved by the first and second step, while the fifth step should never be required.

Table 3 shows for each program the number of basic blocks in the initial control flow graphs (column Initial Blocks) and the number of basic blocks in the control flow graphs after the conflict resolution phase (column Final Blocks). In addition, the number of basic blocks that were removed in each of the five steps of the conflict resolution phase are shown. The numbers given in Table 3 were collected when the tool-specific modification was enabled. The results were very similar when only general techniques were used.

It can be seen that most conflicts were resolved after the first three steps. About two thirds of the removed basic blocks were guaranteed to be invalid. This supports our claim that invalid control flow instructions, caused by the misinterpretation of instruction arguments, often result in impossible control flows that can be easily detected. Most of the remaining blocks are removed by the first heuristic that checks how tight a block is connected with the rest of the CFG. Invalid blocks are often loosely coupled and can taken out during this step. The last two steps were only responsible for a small fraction of the total removed blocks. The heuristic in step four was sometimes able to provide an indication of which block was valid. Otherwise, a random node had to be selected.

Static analysis tools are traditionally associated with poor scalability and the inability to deal with real-world input. Therefore, it is important to ascertain that our disassembler can process even large real-world binaries in an acceptable amount of time. In Section 4, we claimed that the processing overhead of the program is linear in the number of instructions of the binary. The intuitive reason is the fact that the binary is partitioned into functions that are analyzed independently. Assuming that the average size of an individual function is relatively independent of the size of the binary, the amount of work per function is also independent of the size of the binary. As a result, more functions have to be analyzed as the size of the binary increases. Because the number of functions increases linearly with the number of instructions and the work per function is constant (again, assuming a constant average function size), the overhead of the static analysis process is linear in the number of instructions.


Table 4: Disassembler processing times.
Program Size (Bytes) Instructions Time (s)
openssh 263,684 46,343 4
compress95 1,768,420 92,137 9
li 1,820,768 109,652 7
ijpeg 1,871,776 127,012 9
m88ksim 2,001,616 127,358 8
go 2,073,728 145,953 11
perl 2,176,268 169,054 15
vortex 2,340,196 204,230 16
gcc 2,964,740 387,289 28
emacs 4.765,512 405,535 38


To support this claim with experimental data, the time for a complete disassembly of each evaluation binary was taken. The size of obfuscated programs of the SPECint 95 benchmark are in the range of 1.77 MB to 2.96 MB. To obtain more diversified results, we also disassembled one smaller (openssh 3.7) and one larger binary (emacs 21.3). The processing times were taken as the average of ten runs on a 1.8 GHz Pentium IV system with 512 MB of RAM, running Gentoo Linux 2.6. The results (in seconds) for the disassembler are listed in Table 4. There was no noticeable difference when using tool-specific modification.

Figure 7 shows a plot of the processing times and the corresponding number of instructions for each binary. The straight line represents the linear regression line. The close proximity of all points to this line demonstrates that the processing time increases proportional to the number of instructions, allowing our disassembler to operate on large binaries with acceptable cost.

Figure 7: Processing times and linear regression.
\scalebox{0.31}{\includegraphics{time}}



next up previous
Next: Conclusions Up: Static Disassembly of Obfuscated Previous: Tool-Specific Techniques
2004-05-18