Next: Data Set Up: MEF: Malicious Email Filter Previous: Monitoring the Propagation of

Methodology for Building Data Mining Detection Models

We gathered a large set of programs from public sources and defined a learning problem with two classes: malicious and benign executables. Each example in the data set is a Windows or MS-DOS format executable, although the framework we present is applicable to other formats. To standardize our data-set, we used MacAfee's [5] virus scanner and labeled our programs as either malicious or benign executables.

We split the dataset into two subsets: the training set and the test set. The data mining algorithms used the training set while generating the rule sets, and after training we used a test set to test the accuracy of the classifiers on a set of unseen examples.

Data Set
Detection Algorithms
Data Mining Approach
Signature-Based Approach
Preliminary Results
Performance on New Executables
Performance on Known Executables

Matthew G. Schultz
2001-05-01