bigscience
/

T0pp

@@ -61,15 +61,17 @@ We trained different variants T0 with different mixtures of datasets.
 |Model|Training datasets|
 |--|--|
-|T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA, Wiki QA<br>- Structure-To-Text: Common Gen, Wiki Bio<br>- Sentiment: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp<br>- Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum<br>- Topic Classification: AG News, DBPedia, TREC<br>- Paraphrase Identification: MRPC, PAWS, QQP|
-|T0p_11B|Same as T0_11B with a few additional datasets:<br>- Multiple-Choice QA: ARC, Circa, MC-TACO, Open Book QA, PiQA, RACE<br>- Extractive QA: CoQA, DROP, QA SRL,QuAC, ReCoRD, SQuAD v2<br>- Closed-Book QA: NQ Open, Trivia QA, Web Questions|
-|T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE:<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
 |T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
 |T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
 |T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
 For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
 # Evaluation data
 We systematically evaluate our models on a suite of held-out tasks:
@@ -82,20 +84,20 @@ We systematically evaluate our models on a suite of held-out tasks:
 |Sentence completion|COPA, HellaSwag, Story Cloze|
 We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
-- code description task
-- conceptual_combinations
-- hindu_knowledge_json
-- known_unknowns
-- Language Identification
-- logic_grid_puzzle_task
-- logical_deduction
-- common_misconceptions
-- movie_dialog_same_or_different
-- novel_concepts
-- strategyqa
-- formal_fallacies_syllogisms_negation
 - VitaminC
-- winowhy_multiple_choice
 # Limitations

 |Model|Training datasets|
 |--|--|
+|T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA*, Wiki QA<br>- Structure-To-Text: Common Gen, Wiki Bio<br>- Sentiment: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp<br>- Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum<br>- Topic Classification: AG News, DBPedia, TREC<br>- Paraphrase Identification: MRPC, PAWS, QQP|
+|T0p_11B|Same as T0_11B with additional datasets from GPT-3's evaluation suite:<br>- Multiple-Choice QA: ARC, OpenBook QA, PiQA, RACE, HellaSwag<br>- Extractive QA: SQuAD v2<br>- Closed-Book QA: Trivia QA, Web Questions|
+|T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE (excluding NLI sets):<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
 |T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
 |T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
 |T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
 For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
+*: We recast Hotpot QA as closed-book QA due to long input sequence length.
 # Evaluation data
 We systematically evaluate our models on a suite of held-out tasks:
 |Sentence completion|COPA, HellaSwag, Story Cloze|
 We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
+- Code description task
+- Conceptual combinations
+- Hindu knowledge json
+- Known unknowns
+- Language identification
+- Logic grid puzzle task
+- Logical deduction
+- Common misconceptions
+- Movie dialog same or different
+- Novel concepts
+- Strategyqa
+- Formal fallacies syllogisms negation
 - VitaminC
+- Winowhy multiple choice
 # Limitations