CONDA-Workshop/Data-Contamination-Database · Update contamination

Update contamination_report.csv502b10ab

suryanshs16103

May 17

What are you reporting:

Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): openai_humaneval

Contaminated model(s): gpt-3.5-turbo-1106, gpt-3.5-turbo-0613

Contaminated split(s): 41.47%, 23.79%

Briefly describe your method to detect data contamination

Model-based approach

Model-based approaches

The cited paper highlights how ChatGPT, when tested with the HumanEval dataset, shows high contamination levels. This is evident from the high Average Peak and Leak Ratios, especially compared to the clean CodeForces2305 dataset where ChatGPT's performance drops. The TED method proves effective in identifying and mitigating these contamination issues. The values can be verified from Table 5 of the cited paper.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://arxiv.org/pdf/2402.15938
Citation: @misc{dong2024generalization, title={Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models}, author={Yihong Dong and Xue Jiang and Huanyu Liu and Zhi Jin and Ge Li}, year={2024}, eprint={2402.15938}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Suryansh Sharma
Institution: Indian Institute of Technology Kharagpur
Email: suryansh.s@kgpian.iitkgp.ac.in

OSainz

Workshop on Data Contamination org May 20

Hi @suryanshs16103 ,

The evidence you are trying to add is already in the database. Please check this PR.

Please, before creating a new PR, check whether the evidence you want to add is or is not already in the database. I will close this PR.

Best,
Oscar

OSainz changed pull request status to closed May 20