Ruofan Liu, Shanghai Jiao Tong University/National University of Singapore; Yun Lin, Shanghai Jiao Tong University; Xiwen Teoh, National University of Singapore; Gongshen Liu, Shanghai Jiao Tong University; Zhiyong Huang and Jin Song Dong, National University of Singapore
Phishing, a pervasive form of social engineering attack that compromises user credentials, has led to significant financial losses and undermined public trust. Modern phishing detection has gravitated to reference-based methods for their explainability and robustness against zero-day phishing attacks. These methods maintain and update predefined reference lists to specify domain-brand relationships, alarming phishing websites by the inconsistencies between their domain (e.g., payp0l.com) and intended brand (e.g., PayPal). However, the curated lists are largely limited by their lack of comprehensiveness and high maintenance costs in practice.
In this work, we present PhishLLM as a novel reference-based phishing detector that operates without an explicit pre-defined reference list. Our rationale lies in that modern LLMs have encoded far more extensive brand-domain information than any predefined list. Further, the detection of many webpage semantics such as credential-taking intention analysis is more like a linguistic problem, but they are processed as a vision problem now. Thus, we design PhishLLM to decode (or retrieve) the domain-brand relationships from LLM and effectively parse the credential-taking intention of a webpage, without the cost of maintaining and updating an explicit reference list. Moreover, to control the hallucination of LLMs, we introduce a search-engine-based validation mechanism to remove the misinformation. Our extensive experiments show that PhishLLM significantly outperforms state-of-the-art solutions such as Phishpedia and PhishIntention, improving the recall by 21% to 66%, at the cost of negligible precision. Our field studies show that PhishLLM discovers (1) 6 times more zero-day phishing webpages compared to existing approaches such as PhishIntention and (2) close to 2 times more zero-day phishing webpages even if it is enhanced by DynaPhish. Our code is available at https://github.com/code-philia/PhishLLM/.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Ruofan Liu and Yun Lin and Xiwen Teoh and Gongshen Liu and Zhiyong Huang and Jin Song Dong},
title = {Less Defined Knowledge and More True Alarms: Reference-based Phishing Detection without a Pre-defined Reference List},
booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
year = {2024},
isbn = {978-1-939133-44-1},
address = {Philadelphia, PA},
pages = {523--540},
url = {https://www.usenix.org/conference/usenixsecurity24/presentation/liu-ruofan},
publisher = {USENIX Association},
month = aug
}