Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image

Authors: 

Nan Jiang, Bangjie Sun, and Terence Sim, National University of Singapore; Jun Han, KAIST

Abstract: 

We present Foice, a novel deepfake attack against voice authentication systems. Foice generates a synthetic voice of the victim from just a single image of the victim's face, without requiring any voice sample. This synthetic voice is realistic enough to fool commercial authentication systems. Since face images are generally easier to obtain than voice samples, Foice effectively makes it easier for an attacker to mount large-scale attacks. The key idea lies in learning the partial correlation between face and voice features and adding to that a face-independent voice feature sampled from a Gaussian distribution. We demonstrate the effectiveness of Foice with a comprehensive set of real-world experiments involving ten offline participants and an online dataset of 1029 unique individuals. By evaluating eight state-of-the-art systems, including WeChat's Voiceprint and Microsoft Azure, we show that all these systems are vulnerable to Foice attack.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {299701,
author = {Nan Jiang and Bangjie Sun and Terence Sim and Jun Han},
title = {Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image},
booktitle = {33rd USENIX Security Symposium (USENIX Security 24)},
year = {2024},
isbn = {978-1-939133-44-1},
address = {Philadelphia, PA},
pages = {1045--1062},
url = {https://www.usenix.org/conference/usenixsecurity24/presentation/jiang-nan},
publisher = {USENIX Association},
month = aug
}

Presentation Video