Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

Added Contamination Evidence on MMLU of ChatGPT/GPT4 from "Investigating data contamination in modern benchmarks for large language models"

#10

by AmeyaPrabhu - opened Apr 24

base: refs/heads/main

←

from: refs/pr/10

Discussion Files changed

+179

-20

AmeyaPrabhu

Apr 24

•

edited Apr 25

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Contaminated Evaluation Dataset(s)::
cais/mmlu
winogrande
truthful_qa
allenai/openbookqa
Rowan/hellaswag

Contaminated model(s): GPT 3.5 (ChatGPT), GPT-4, LLaMA 2-13B, Mistral-7B

Approach:

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

The paper would mask incorrect choices in the respective test set, and the model would be able to predict the missing mask using the Exact Match metric. The candidate list for possible wrong options could be large and may even be infinite, so the fact that the model generates the exact wrong option seems like convincing evidence pointing to contamination.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination? Yes

url: https://arxiv.org/abs/2311.09783

  title={Investigating data contamination in modern benchmarks for large language models},
  author={Deng, Chunyuan and Zhao, Yilun and Tang, Xiangru and Gerstein, Mark and Cohan, Arman},
  journal={arXiv preprint arXiv:2311.09783},
  year={2023}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Ameya Prabhu
Institution: Tübingen AI Center, University of Tübingen
Email: ameya@prabhu.be

Update contamination_report.csv383926d5

Update contamination_report.csv5ecf89dc

AmeyaPrabhu changed pull request title from Update contamination_report.csv to Added Contamination Evidence from "Investigating data contamination in modern benchmarks for large language models" Apr 24

AmeyaPrabhu changed pull request title from Added Contamination Evidence from "Investigating data contamination in modern benchmarks for large language models" to Added Contamination Evidence on MMLU of ChatGPT/GPT4 from "Investigating data contamination in modern benchmarks for large language models" Apr 24

OSainz

Workshop on Data Contamination org Apr 25

Hi @AmeyaPrabhu !

Thanks for the contribution :) For consistency reasons we label "ChatGPT" as "GPT-3.5" because is the name of the underlying model. Could you please change it?

Best,
Oscar

Update contamination_report.csv4666215b

AmeyaPrabhu

Apr 25

•

edited Apr 25

Changed! Thanks for the quick feedback. I have additionally reported all contaminations > 3% instead of my previous 50% threshold on dataset from Table 3, added the details of the additional datasets to the reporting comment above for clarity. Should I add the 1% and 2% contaminations as well? I suspect those could be too common sentences being false positives.

Regards,
Ameya

Update contamination_report.csv77f5a8da

OSainz

Workshop on Data Contamination org Apr 25

I think it is okay to add those too. In addition to contamination evidence, we welcome evidence for lack of contamination. If someone does not find their dataset in the database, it just means that no reports have been done, instead of maybe it is not contaminated.

AmeyaPrabhu

Apr 25

I will add the 1 and 2% contamination values as well!

On evidence for lack of contamination: I worry here that just because a certain method did not find contamination, it does not mean the contamination doesn't exist. However, definite cases of contamination can be found by any method.

I worry if people see here that some downstream benchmark is clean, they might make the inference that is definitely not contaminated which might not be true. What do you think, I am quite new to this domain-- was reading related literature and thought contributing the evidence I found while reading papers would be useful here.

OSainz

Workshop on Data Contamination org Apr 25

We accept multiple submissions for the same dataset and source pair, therefore if in the future someone finds contamination for a given pair it will be reflected. In this database, we do not plan to judge, but rather gather the evidence that is dispersed through different papers and reports. That is why we report the source of the method together with the evidence.

Update contamination_report.csveb87bf5b

AmeyaPrabhu

Apr 25

•

edited Apr 25

Hi! I added the numbers for all the non-zero cases on the table, and updated the report! This should be ready to merge. Could you check and confirm?

Merge branch 'main' of https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report into pr/1018d798a2

OSainz

Workshop on Data Contamination org Apr 29

Hi @AmeyaPrabhu !

Merging to main, thank you for your contribution.

Oscar

OSainz changed pull request status to merged Apr 29

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment