Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

Similar presentations


Presentation on theme: "1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering."— Presentation transcript:

1 1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa

2 2 Motivation Traditional anti-computer virus systems are signature- based. This technique is appropriate to detect existing viruses, but it falls short of detecting new unseen viruses or variants of existing ones. Traditional anti-computer virus systems are signature- based. This technique is appropriate to detect existing viruses, but it falls short of detecting new unseen viruses or variants of existing ones. Yet, virus writers strategically modify their viruses so that existing virus signatures do not match the new viruses. They do so in random and unpredictable ways, each time the virus replicates. Yet, virus writers strategically modify their viruses so that existing virus signatures do not match the new viruses. They do so in random and unpredictable ways, each time the virus replicates. Heuristic scanners attempt to compensate for this lacuna by using more general features from viral code. However, the process requires human intervention and falls short of yielding both good detection rates for new viruses and low false positives.  Automated searches for general features are needed. Heuristic scanners attempt to compensate for this lacuna by using more general features from viral code. However, the process requires human intervention and falls short of yielding both good detection rates for new viruses and low false positives.  Automated searches for general features are needed.

3 3 Purpose: To Improve on current automated search methods for general features This talk presents: This talk presents: A Feature Search and Selection approach for Virus Detection that performs an exhaustive search on a data set of viruses, yielding a large number of short generic features, that are then filtered with respect to how representative they are of viral properties. A Feature Search and Selection approach for Virus Detection that performs an exhaustive search on a data set of viruses, yielding a large number of short generic features, that are then filtered with respect to how representative they are of viral properties. A Stringent Cross-Validation scheme allowing us to simulate real-world conditions of new virus outbreaks. A Stringent Cross-Validation scheme allowing us to simulate real-world conditions of new virus outbreaks. Evidence that our Feature Selection approach has high predictive power. Evidence that our Feature Selection approach has high predictive power.

4 4 Background Computer Viruses are often organized within sets of Virus Families. Computer Viruses are often organized within sets of Virus Families. Virus families are characterized by their similarities in: Virus families are characterized by their similarities in: Structure Structure Code Code Methods of infection Methods of infection Consideration of Virus Families is crucial to the task of detection. Indeed, the first virus of a family is usually devastating while its family variants are typically less so. Consideration of Virus Families is crucial to the task of detection. Indeed, the first virus of a family is usually devastating while its family variants are typically less so.  Our approach uses a-priori knowledge of virus families, but our evaluation scheme focuses on evaluating classifiers in their detection of viruses of a family they were not trained on.

5 5 Feature Search and Selection I Our feature search and selection algorithm is comprised of three steps: Our feature search and selection algorithm is comprised of three steps: Scanning & Recording: A scanning window of length, SequenceLength, moves across the binary code, recording the frequency within each family of each sequence it encounters. Scanning & Recording: A scanning window of length, SequenceLength, moves across the binary code, recording the frequency within each family of each sequence it encounters. Selection: The features whose family frequency is at or above the threshold, IntraFamilySupport, are selected  Only the features most representative of a family are retained. Selection: The features whose family frequency is at or above the threshold, IntraFamilySupport, are selected  Only the features most representative of a family are retained. Elimination: The features that fall below the threshold, InterFamilySupport, are eliminated  Features that are too exclusive of a particular family are rejected. Elimination: The features that fall below the threshold, InterFamilySupport, are eliminated  Features that are too exclusive of a particular family are rejected.

6 6 Feature Search and Selection II Our Feature Search and Selection method is hierarchical, and, thus, scalable to large datasets: Our Feature Search and Selection method is hierarchical, and, thus, scalable to large datasets: The Scanning and Recording step is done only once. The Scanning and Recording step is done only once. The Selection step is conducted on small family subsets. The Selection step is conducted on small family subsets. The Elimination step is conducted on shorter feature lists. The Elimination step is conducted on shorter feature lists. Our Feature Search and Selection method ensures that all retained features represent viral properties common to many types of viruses, as opposed to idiosyncrasies specific to one family. Our Feature Search and Selection method ensures that all retained features represent viral properties common to many types of viruses, as opposed to idiosyncrasies specific to one family.

7 7 Evaluation Scheme I Purpose: Purpose: To simulate an environment where a virus detection system is faced with the outbreak of a new unseen virus. To simulate an environment where a virus detection system is faced with the outbreak of a new unseen virus. Procedure: Procedure: Form k- folds f 1..f k, such that Form k- folds f 1..f k, such that for each pair of folds (f i,f j ), i= 1..k, j= 1..k, and i ≠ j for each pair of folds (f i,f j ), i= 1..k, j= 1..k, and i ≠ j The set of families represented in f i is disjoint from the set of families represented in f j The set of families represented in f i is disjoint from the set of families represented in f j Benign programs are added, at random, to each fold. Benign programs are added, at random, to each fold. Perform a regular cross-validation scheme. Perform a regular cross-validation scheme.

8 8 Evaluation Scheme II

9 9 Results Traditional Feature Search (best strategy to date): retain 16-byte sequences appearing with a support of at least 1% [Schultz et al., 2001] Traditional Feature Search (best strategy to date): retain 16-byte sequences appearing with a support of at least 1% [Schultz et al., 2001] Data Set: 1512 viruses + 1488 benign executables Data Set: 1512 viruses + 1488 benign executables The viruses belong to 110 families. The viruses belong to 110 families. Parameter Setting: Parameter Setting: SequenceLength= 8 SequenceLength= 8 IntraFamilySupport= 40% IntraFamilySupport= 40% InterfamilySupport= 3 InterfamilySupport= 3 We obtain up to 93.65% accuracy versus 65.04% obtained by the traditional feature search approach.

10 10 Other Observations Extra Experiments Set-up: Extra Experiments Set-up: An additional set of experiments were performed in which the three search parameters where varied. An additional set of experiments were performed in which the three search parameters where varied. The Intra-family Support was modified according to the other two, so that a maximum of 500 features per family are selected in the second step of our algorithm. The Intra-family Support was modified according to the other two, so that a maximum of 500 features per family are selected in the second step of our algorithm. Observations: Observations: Classifiers perform better with shorter sequence length. Sequence lengths of size 5, 4 and 3 seem optimal. Classifiers perform better with shorter sequence length. Sequence lengths of size 5, 4 and 3 seem optimal. Low Inter-Family Support thresholds yield better results, especially for longer sequences. Low Inter-Family Support thresholds yield better results, especially for longer sequences. Performance generally decreases when the feature set contains fewer than 200 features. Large numbers of small features perform better than small numbers of large ones. Performance generally decreases when the feature set contains fewer than 200 features. Large numbers of small features perform better than small numbers of large ones.

11 11 Conclusion and Future Work Summary: Summary: Our Feature Search and Selection and Evaluation methods focus on selecting generic features useful on new, unseen families of viruses. Our Feature Search and Selection and Evaluation methods focus on selecting generic features useful on new, unseen families of viruses. Our results demonstrate the usefulness of our method in this setting. Our results demonstrate the usefulness of our method in this setting. Future Work: Future Work: To reduce the false positive rate further, using a larger number of benign files for training, or, simply stratification or cost-sensitive learning. To reduce the false positive rate further, using a larger number of benign files for training, or, simply stratification or cost-sensitive learning. To test our Feature Search and Selection method in a Retrospective Testing setting, that would involve a set of older viruses in the training set and a set of more recent ones in the test set. To test our Feature Search and Selection method in a Retrospective Testing setting, that would involve a set of older viruses in the training set and a set of more recent ones in the test set.


Download ppt "1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering."

Similar presentations


Ads by Google