Contamination results based on "Data Contamination Quiz"
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX
), otherwise provide a link to a paper, GitHub or dataset-card.
- imdb
- ag_news
- yelp_review_full
- nyu-mll/glue (rte)
- nyu-mll/glue (wnli)
- samsum
- EdinburghNLP/xsum
- openai_humaneval
- ucinlp/drop
- gsm8k
Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B
).
- GPT-4
- GPT-3.5
Contaminated corpora: N/A
Contaminated split(s): If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.
- imdb/test: {GPT-4: 82.00, GPT-3.5: 55.00}
- ag_news/test: {GPT-4: 91.00, GPT-3.5: 82.00}
- yelp_review_full/test: {GPT-4: 80.00, GPT-3.5: 13.00}
- nyu-mll/glue (rte)/validation: {GPT-4: 60.00, GPT-3.5: 71.00}
- nyu-mll/glue (wnli)/validation: {GPT-4: 50.70, GPT-3.5: 12.68}
- samsum/test: {GPT-4: 77.00, GPT-3.5: 74.00}
- EdinburghNLP/xsum/test: {GPT-4: 95.00, GPT-3.5: 79.00}
- openai_humaneval/test: {GPT-4: 56.71}
- ucinlp/drop/validation: {GPT-4: 44.00}
- gsm8k/train: {GPT-4: 79.00}
You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The results are based on the findings of Golchin and Surdeanu, 2023, where they assessed contamination levels using a quiz-based method. This quiz directs an LLM to pick an option containing the original dataset instance from its three word-level perturbations. If the LLM successfully identifies the original dataset instance, it shows the LLM's exposure to the data. Lastly, the quiz performance represents the level of detected contamination within a specific dataset partition for the LLM that administered the quiz.
Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URL: https://arxiv.org/abs/2311.06233
Citation:
@article
{DBLP:journals/corr/abs-2311-06233,
author = {Shahriar Golchin and
Mihai Surdeanu},
title = {Data Contamination Quiz: {A} Tool to Detect and Estimate Contamination
in Large Language Models},
journal = {CoRR},
volume = {abs/2311.06233},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2311.06233},
doi = {10.48550/ARXIV.2311.06233},
eprinttype = {arXiv},
eprint = {2311.06233},
timestamp = {Wed, 15 Nov 2023 16:23:10 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2311-06233.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Shahriar Golchin, Mihai Surdeanu
- Institution: University of Arizona
- Email: golchin@arizona.edu; msurdeanu@arizona.edu
Hi @shahriargolchin ,
Seems that some of the evidence overlaps with previous evidence already in the database. Particularly these:
- imdb (line 449-450)
- ag_news (line 452-453)
- yelp_review_full (line 455-456)
- rte (line 458-459)
- wnli (line 461-462)
- samsum (line 464-465)
- xsum (line 467-468)
Could you please remove the outdated evidence? I think I added them from your paper "Time Travel in LLM", but there were no specific numbers iirc.
Best,
Oscar
Hi @OSainz ,
Thanks for your reply. To clarify, the results from the Time Travel paper represent contamination at the partition level in a binary manner, i.e., it shows whether a dataset partition (e.g., IMDB train set) is contaminated or not. However, the results from the Data Contamination Quiz are estimates of contamination levels. Given this, the nature of detection between these two is different. With that in mind, I was wondering if you would like to collect this information separately or replace the existing data with the new information obtained from the Data Contamination Quiz.
Thank you,
Shahriar
Okay, we can have duplicates if they are computed by different methods (and/or reported by different sources).
(Comment edited because I changed my mind)
Thanks again @shahriargolchin for your contribution. We are merging to main.
Perfect! Thank you, @OSainz