Detecting and Mitigating Sampling Bias in Cybersecurity with Unlabeled Data

Authors: 

Saravanan Thirumuruganathan, Independent Researcher; Fatih Deniz, Issa Khalil, and Ting Yu, Qatar Computing Research Institute, HBKU; Mohamed Nabeel, Palo Alto Networks; Mourad Ouzzani, Qatar Computing Research Institute, HBKU

Abstract: 

Machine Learning (ML) based systems have demonstrated remarkable success in addressing various challenges within the ever-evolving cybersecurity landscape, particularly in the domain of malware detection/classification. However, a notable performance gap becomes evident when such classifiers are deployed in production. This discrepancy, often observed between accuracy scores reported in research papers and their real-world deployments, can be largely attributed to sampling bias. Intuitively, the data distribution in the production differs from that of training resulting in reduced performance of the classifier. How to deal with such sampling bias is an important problem in cybersecurity practice. In this paper, we propose principled approaches to detect and mitigate the adverse effects of sampling bias. First, we propose two simple and intuitive algorithms based on domain discrimination and distribution of k-th nearest neighbor distance to detect discrepancies between training and production data distributions. Second, we propose two algorithms based on the self-training paradigm to alleviate the impact of sampling bias. Our approaches are inspired by domain adaptation and judiciously harness the unlabeled data for enhancing the generalizability of ML classifiers. Critically, our approach does not require any modifications to the classifiers themselves, thus ensuring seamless integration into existing deployments. We conducted extensive experiments on four diverse datasets from malware, web domains, and intrusion detection. In an adversarial setting with large sampling bias, our proposed algorithms can improve the F-score by as much as 10-16 percentage points. Concretely, the F-score of a malware classifier on AndroZoo dataset increases from 0.83 to 0.937.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {299673,
author = {Saravanan Thirumuruganathan and Fatih Deniz and Issa Khalil and Ting Yu and Mohamed Nabeel and Mourad Ouzzani},
title = {Detecting and Mitigating Sampling Bias in Cybersecurity with Unlabeled Data},
booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
year = {2024},
isbn = {978-1-939133-44-1},
address = {Philadelphia, PA},
pages = {1741--1758},
url = {https://www.usenix.org/conference/usenixsecurity24/presentation/thirumuruganathan},
publisher = {USENIX Association},
month = aug
}

Presentation Video