Improving ML-based Binary Function Similarity Detection by Assessing and Deprioritizing Control Flow Graph Features

Authors: 

Jialai Wang, Tsinghua University; Chao Zhang, Tsinghua University and Zhongguancun Laboratory; Longfei Chen and Yi Rong, Tsinghua University; Yuxiao Wu, Huazhong University of Science and Technology; Hao Wang, Wende Tan, and Qi Li, Tsinghua University; Zongpeng Li, Tsinghua University and Quancheng Labs

Abstract: 

Machine learning-based binary function similarity detection (ML-BFSD) has witnessed significant progress recently. They often choose control flow graph (CFG) as an important feature to learn out of functions, as CFGs characterize the control dependencies between basic code blocks. However, the exact role of CFGs in model decisions is not explored, and the extent to which CFGs might lead to model errors is unknown. This work takes a first step towards assessing the role of CFGs in ML-BFSD solutions both theoretically and practically, and promotes their performance accordingly. First, we adapt existing explanation methods to interpreting ML-BFSD solutions, and theoretically reveal that existing models heavily rely on CFG features. Then, we design a solution deltaCFG to manipulate CFGs and practically demonstrate the lack of robustness of existing models. We have extensively evaluated deltaCFG on 11 state-of-the-art (SOTA) ML-BFSD solutions, and find that the models' results would flip if we manipulate the query functions' CFGs but keep semantics, showing that most models have bias on CFG features. Our theoretic and practical assessment solutions can also serve as a robustness validator for the development of future ML-BFSD solutions. Lastly, we present a solution to utilize deltaCFG to augment training data, which helps deprioritize CFG features and enhance the performance of existing ML-BFSD solutions. Evaluation results show that, MRR, Recall@1, AUC and F1 score of existing models are improved by up to 10.1%, 12.7%, 5.1%, and 27.2% respectively, proving that reducing the models' bias on CFG features could improve their performance.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {299641,
author = {Jialai Wang and Chao Zhang and Longfei Chen and Yi Rong and Yuxiao Wu and Hao Wang and Wende Tan and Qi Li and Zongpeng Li},
title = {Improving {ML-based} Binary Function Similarity Detection by Assessing and Deprioritizing Control Flow Graph Features},
booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
year = {2024},
isbn = {978-1-939133-44-1},
address = {Philadelphia, PA},
pages = {4265--4282},
url = {https://www.usenix.org/conference/usenixsecurity24/presentation/wang-jialai},
publisher = {USENIX Association},
month = aug
}

Presentation Video