GPT-3.5 HumanEval_R CodeForces2305 contamination based on https://arxiv.org/abs/2402.15938
Browse files## What are you reporting:
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
**Evaluation dataset(s)**: Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. `uonlp/CulturaX`), otherwise provide a link to a paper, GitHub or dataset-card.
HumanEval_R, CodeForces2305. These datasets are mentioned in https://arxiv.org/pdf/2402.15938
**Contaminated model(s)**: Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. `allenai/OLMo-7B`). GPT-3.5 Turbo0613, GPT-3.5 Turbo1106
> You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.
## Briefly describe your method to detect data contamination
- [ ] Model-based approach
The cited paper mentions construction of two new datasets as follows
1) CodeForces2305: This dataset includes 90 of the easiest programming problems sourced from the CodeForces website since May 2023.
2) HumanEval_R: This dataset is a modified version of HumanEval. Changes include updating the function signatures, translating requirements into German, French, and Chinese, and selecting different public test cases based on the work by Dong et al. (2023a) for prompting purposes.
This paper addresses the challenges of detecting data contamination in large language models (LLMs) by introducing CDD (Contamination Detection via output Distribution). This method utilizes sampled texts to assess the peakedness of the LLM’s output distribution, operating on the hypothesis that training data tends to make the model’s output distribution more peaked, thus biasing the model towards specific outputs.
## Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URL: `https://arxiv.org/pdf/2402.15938`
Citation: `@misc{dong2024generalization,
title={Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models},
author={Yihong Dong and Xue Jiang and Huanyu Liu and Zhi Jin and Ge Li},
year={2024},
eprint={2402.15938},
archivePrefix={arXiv},
primaryClass={cs.CL}
}`
*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full name: Suryansh Sharma
- Institution: Indian Institute of Technology Kharagpur
- Email: suryansh.s@kgpian.iitkgp.ac.in
- contamination_report.csv +5 -0
@@ -717,3 +717,8 @@ zest;;EleutherAI/pile;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;
|
|
717 |
zest;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
718 |
zest;;oscar-corpus/OSCAR-2301;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
719 |
zest;;togethercomputer/RedPajama-Data-V2;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
|
|
|
|
|
|
|
|
|
|
|
717 |
zest;;allenai/c4;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
718 |
zest;;oscar-corpus/OSCAR-2301;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
719 |
zest;;togethercomputer/RedPajama-Data-V2;;corpus;;;0.0;data-based;https://arxiv.org/abs/2310.20707;2
|
720 |
+
|
721 |
+
HumanEval_R;;GPT-3.5-turbo;0613;model;;;9.76;model-based;https://arxiv.org/abs/2402.15938;
|
722 |
+
HumanEval_R;;GPT-3.5-turbo;1106;model;;;10.97;model-based;https://arxiv.org/abs/2402.15938;
|
723 |
+
CodeForces2305;;GPT-3.5-turbo;0613;model;;;0.0;model-based;https://arxiv.org/abs/2402.15938;
|
724 |
+
CodeForces2305;;GPT-3.5-turbo;1106;model;;;0.0;model-based;https://arxiv.org/abs/2402.15938;
|