We gathered a large set of programs from public sources and defined a learning problem with two classes: malicious and benign executables. Each example in the data set is a Windows or MS-DOS format executable, although the framework we present is applicable to other formats. To standardize our data-set, we used MacAfee's [5] virus scanner and labeled our programs as either malicious or benign executables.
We split the dataset into two subsets: the training set and the test set. The data mining algorithms used the training set while generating the rule sets, and after training we used a test set to test the accuracy of the classifiers on a set of unseen examples.