Bag of Tricks for Training Data Extraction from Language Models
Yu, Weichen1,2; Pang, Tianyu1; Liu, Qian1; Du, Chao1; Kang, Bingyi1; Huang, Yan2; Yan, Shuicheng1
2023-05
会议日期2023-7
会议地点Hawaii, US
英文摘要

With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proofof-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-thenranking, i.e., generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e.g., sampling strategy) and text ranking (e.g., token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research. The code is available at https://github.com/weichen-yu/LM-Extraction.

会议录出版者International Conference on Machine Learning (ICML)
语种英语
内容类型会议论文
源URL[http://ir.ia.ac.cn/handle/173211/52305]  
专题自动化研究所_智能感知与计算研究中心
通讯作者Pang, Tianyu; Huang, Yan
作者单位1.SEA AI Lab
2.中国科学院自动化研究所,
推荐引用方式
GB/T 7714
Yu, Weichen,Pang, Tianyu,Liu, Qian,et al. Bag of Tricks for Training Data Extraction from Language Models[C]. 见:. Hawaii, US. 2023-7.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace