Add data from WIMBD paper
#2
by
OSainz
- opened
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s):
- UCLNLP/adversarial_qa
- aeslc
- amazon_reviews_multi
- billsum
- cosmos_qa
- crows_pairs
- ibm/duorc
- esnli
- gigaword
- glue
- head_qa
- health_fact
- hlgd
- liar
- math_dataset
- math_qa
- mc_taco
- mocha
- openai_humaneval
- paws-x
- paws
- piqa
- race
- allenai/ropes
- samsum
- scan
- allenai/scicite
- scitail
- sem_eval_2014_task_1
- sick
- snli
- squadshifts
- stsb_multi_mt
- subjqa
- super_glue
- swag
- tab_fact
- wiki_qa
- winograd_wsc
- winogrande
- xnli
- xsum
- zest
Contaminated model(s): None
Contaminated corpora:
- allenai/c4
- oscar-corpus/OSCAR-2301
- EleutherAI/pile
- togethercomputer/RedPajama-Data-V2
Contaminated split(s): Test splits
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The method used to detect contamination was a data-based approach. Specifically, we have used the WIMBD tool to identify contamination. See Section 4.1.1 in the paper
Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URL: https://aclanthology.org/2023.findings-emnlp.722/
@inproceedings{
elazar2024whats,
title={What's In My Big Data?},
author={Yanai Elazar and Akshita Bhagia and Ian Helgi Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Evan Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hannaneh Hajishirzi and Noah A. Smith and Jesse Dodge},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=RvfPnOkPV4}
}
OSainz
changed pull request status to
merged