{CodexLeaks}: Privacy Leaks from Code Generation Language Models in {GitHub} Copilot

Liang Niu; Shujaat Mirza; Zayd Maradni; Christina Pöpper

Authors:

Liang Niu and Shujaat Mirza, New York University; Zayd Maradni and Christina Pöpper, New York University Abu Dhabi

Abstract:

Code generation language models are trained on billions of lines of source code to provide code generation and auto-completion features, like those offered by code assistant GitHub Copilot with more than a million users. These datasets may contain sensitive personal information—personally identifiable, private, or secret—that these models may regurgitate.

This paper introduces and evaluates a semi-automated pipeline for extracting sensitive personal information from the Codex model used in GitHub Copilot. We employ carefully-designed templates to construct prompts that are more likely to result in privacy leaks. To overcome the non-public training data, we propose a semi-automated filtering method using a blind membership inference attack. We validate the effectiveness of our membership inference approach on different code generation models. We utilize hit rate through the GitHub Search API as a distinguishing heuristic followed by human-in-the-loop evaluation, uncovering that approximately 8% (43) of the prompts yield privacy leaks. Notably, we observe that the model tends to produce indirect leaks, compromising privacy as contextual integrity by generating information from individuals closely related to the queried subject in the training corpus.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {291327,
author = {Liang Niu and Shujaat Mirza and Zayd Maradni and Christina P{\"o}pper},
title = {{CodexLeaks}: Privacy Leaks from Code Generation Language Models in {GitHub} Copilot},
booktitle = {32nd USENIX Security Symposium (USENIX Security 23)},
year = {2023},
isbn = {978-1-939133-37-3},
address = {Anaheim, CA},
pages = {2133--2150},
url = {https://www.usenix.org/conference/usenixsecurity23/presentation/niu},
publisher = {USENIX Association},
month = aug
}

Download

Niu PDF

CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot

USENIX Security '23 is SOLD OUT.

Please do not plan to walk into the venue and register on site.
The event has reached maximum physical capacity, and we will not be able to accommodate any additional registrations.

Open Access Media

Presentation Video

CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot

USENIX Security '23 is SOLD OUT.

Please do not plan to walk into the venue and register on site. The event has reached maximum physical capacity, and we will not be able to accommodate any additional registrations.

Open Access Media

Presentation Video

Please do not plan to walk into the venue and register on site.
The event has reached maximum physical capacity, and we will not be able to accommodate any additional registrations.