CONDA-Workshop/Data-Contamination-Database · Add data from WIMBD paper

OSainz

Workshop on Data Contamination org Mar 22

•

edited Mar 24

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s):

UCLNLP/adversarial_qa
aeslc
amazon_reviews_multi
billsum
cosmos_qa
crows_pairs
ibm/duorc
esnli
gigaword
glue
head_qa
health_fact
hlgd
liar
math_dataset
math_qa
mc_taco
mocha
openai_humaneval
paws-x
paws
piqa
race
allenai/ropes
samsum
scan
allenai/scicite
scitail
sem_eval_2014_task_1
sick
snli
squadshifts
stsb_multi_mt
subjqa
super_glue
swag
tab_fact
wiki_qa
winograd_wsc
winogrande
xnli
xsum
zest

Contaminated model(s): None

Contaminated corpora:

allenai/c4
oscar-corpus/OSCAR-2301
EleutherAI/pile
togethercomputer/RedPajama-Data-V2

Contaminated split(s): Test splits

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

The method used to detect contamination was a data-based approach. Specifically, we have used the WIMBD tool to identify contamination. See Section 4.1.1 in the paper

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://aclanthology.org/2023.findings-emnlp.722/

@inproceedings{
elazar2024whats,
title={What's In My Big Data?},
author={Yanai Elazar and Akshita Bhagia and Ian Helgi Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Evan Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hannaneh Hajishirzi and Noah A. Smith and Jesse Dodge},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=RvfPnOkPV4}
}

Add data from WIMBD paper6b9f531f

OSainz changed pull request status to merged Mar 24