Access to this model has been disabled
Given its research scope, intentionally using the model for generating harmful content (non-exhaustive examples: hate speech, spam generation, fake news, harassment and abuse, disparagement, and defamation) on all websites where bots are prohibited is considered a misuse of this model. Head over to the Community page for further discussion and potential next steps.
GPT-4chan
Hugging Face has decided to permanently disable access to this model on the hub.
A complete model card can be found at https://ykilcher.com/gpt-4chan-model-card
Model Description
GPT-4chan is a language model fine-tuned from GPT-J 6B on 3.5 years worth of data from 4chan's politically incorrect (/pol/) board.
Training data
GPT-4chan was fine-tuned on the dataset Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board.
Training procedure
The model was trained for 1 epoch following GPT-J's fine-tuning guide.
Intended Use
GPT-4chan is trained on anonymously posted and sparsely moderated discussions of political topics. Its intended use is to reproduce text according to the distribution of its input data. It may also be a useful tool to investigate discourse in such anonymous online communities. Lastly, it has potential applications in tasks such as toxicity detection, as initial experiments show promising zero-shot results when comparing a string's likelihood under GPT-4chan to its likelihood under GPT-J 6B.
How to use
The following is copied from the Hugging Face documentation on GPT-J. Refer to the original for more details.
For inference parameters, we recommend a temperature of 0.8, along with either a top_p of 0.8 or a typical_p of 0.3.
For the float32 model (CPU):
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("ykilcher/gpt-4chan")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = (
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
"previously unexplored valley, in the Andes Mountains. Even more surprising to the "
"researchers was the fact that the unicorns spoke perfect English."
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.8,
top_p=0.9,
max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
For the float16 model (GPU):
from transformers import GPTJForCausalLM, AutoTokenizer
import torch
from transformers import GPTJForCausalLM
import torch
model = GPTJForCausalLM.from_pretrained(
"ykilcher/gpt-4chan", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
prompt = (
"In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
"previously unexplored valley, in the Andes Mountains. Even more surprising to the "
"researchers was the fact that the unicorns spoke perfect English."
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.cuda()
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.8,
top_p=0.9,
max_length=100,
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
Limitations and Biases
This is a statistical model. As such, it continues text as is likely under the distribution the model has learned from the training data. Outputs should not be interpreted as "correct", "truthful", or otherwise as anything more than a statistical function of the input. That being said, GPT-4chan does significantly outperform GPT-J (and GPT-3) on the TruthfulQA Benchmark that measures whether a language model is truthful in generating answers to questions.
The dataset is time- and domain-limited. It was collected from 2016 to 2019 on 4chan's politically incorrect board. As such, political topics from that area will be overrepresented in the model's distribution, compared to other models (e.g. GPT-J 6B). Also, due to the very lax rules and anonymity of posters, a large part of the dataset contains offensive material. Thus, it is very likely that the model will produce offensive outputs, including but not limited to: toxicity, hate speech, racism, sexism, homo- and transphobia, xenophobia, and anti-semitism.
Due to the above limitations, it is strongly recommend to not deploy this model into a real-world environment unless its behavior is well-understood and explicit and strict limitations on the scope, impact, and duration of the deployment are enforced.
Evaluation results
Language Model Evaluation Harness
The following table compares GPT-J 6B to GPT-4chan on a subset of the Language Model Evaluation Harness. Differences exceeding standard errors are marked in the "Significant" column with a minus sign (-) indicating an advantage for GPT-J 6B and a plus sign (+) indicating an advantage for GPT-4chan.
Task | Metric | GPT-4chan | stderr | GPT-J-6B | stderr | Significant |
---|---|---|---|---|---|---|
copa | acc | 0.85 | 0.035887 | 0.83 | 0.0377525 | |
blimp_only_npi_scope | acc | 0.712 | 0.0143269 | 0.787 | 0.0129537 | - |
hendrycksTest-conceptual_physics | acc | 0.251064 | 0.028347 | 0.255319 | 0.0285049 | |
hendrycksTest-conceptual_physics | acc_norm | 0.187234 | 0.0255016 | 0.191489 | 0.0257221 | |
hendrycksTest-high_school_mathematics | acc | 0.248148 | 0.0263357 | 0.218519 | 0.0251958 | + |
hendrycksTest-high_school_mathematics | acc_norm | 0.3 | 0.0279405 | 0.251852 | 0.0264661 | + |
blimp_sentential_negation_npi_scope | acc | 0.734 | 0.01398 | 0.733 | 0.0139967 | |
hendrycksTest-high_school_european_history | acc | 0.278788 | 0.0350144 | 0.260606 | 0.0342774 | |
hendrycksTest-high_school_european_history | acc_norm | 0.315152 | 0.0362773 | 0.278788 | 0.0350144 | + |
blimp_wh_questions_object_gap | acc | 0.841 | 0.0115695 | 0.835 | 0.0117436 | |
hendrycksTest-international_law | acc | 0.214876 | 0.0374949 | 0.264463 | 0.0402619 | - |
hendrycksTest-international_law | acc_norm | 0.438017 | 0.0452915 | 0.404959 | 0.0448114 | |
hendrycksTest-high_school_us_history | acc | 0.323529 | 0.0328347 | 0.289216 | 0.0318223 | + |
hendrycksTest-high_school_us_history | acc_norm | 0.323529 | 0.0328347 | 0.29902 | 0.0321333 | |
openbookqa | acc | 0.276 | 0.0200112 | 0.29 | 0.0203132 | |
openbookqa | acc_norm | 0.362 | 0.0215137 | 0.382 | 0.0217508 | |
blimp_causative | acc | 0.737 | 0.0139293 | 0.761 | 0.013493 | - |
record | f1 | 0.878443 | 0.00322394 | 0.885049 | 0.00314367 | - |
record | em | 0.8702 | 0.003361 | 0.8765 | 0.00329027 | - |
blimp_determiner_noun_agreement_1 | acc | 0.996 | 0.00199699 | 0.995 | 0.00223159 | |
hendrycksTest-miscellaneous | acc | 0.305236 | 0.0164677 | 0.274585 | 0.0159598 | + |
hendrycksTest-miscellaneous | acc_norm | 0.269476 | 0.0158662 | 0.260536 | 0.015696 | |
hendrycksTest-virology | acc | 0.343373 | 0.0369658 | 0.349398 | 0.0371173 | |
hendrycksTest-virology | acc_norm | 0.331325 | 0.0366431 | 0.325301 | 0.0364717 | |
mathqa | acc | 0.269012 | 0.00811786 | 0.267002 | 0.00809858 | |
mathqa | acc_norm | 0.261642 | 0.00804614 | 0.270687 | 0.00813376 | - |
squad2 | exact | 10.6123 | 0 | 10.6207 | 0 | - |
squad2 | f1 | 17.8734 | 0 | 17.7413 | 0 | + |
squad2 | HasAns_exact | 17.2571 | 0 | 15.5027 | 0 | + |
squad2 | HasAns_f1 | 31.8 | 0 | 29.7643 | 0 | + |
squad2 | NoAns_exact | 3.98654 | 0 | 5.75273 | 0 | - |
squad2 | NoAns_f1 | 3.98654 | 0 | 5.75273 | 0 | - |
squad2 | best_exact | 50.0716 | 0 | 50.0716 | 0 | |
squad2 | best_f1 | 50.077 | 0 | 50.0778 | 0 | - |
mnli_mismatched | acc | 0.320586 | 0.00470696 | 0.376627 | 0.00488687 | - |
blimp_animate_subject_passive | acc | 0.79 | 0.0128867 | 0.781 | 0.0130847 | |
blimp_determiner_noun_agreement_with_adj_irregular_1 | acc | 0.834 | 0.0117721 | 0.878 | 0.0103549 | - |
qnli | acc | 0.491305 | 0.00676439 | 0.513454 | 0.00676296 | - |
blimp_intransitive | acc | 0.806 | 0.0125108 | 0.858 | 0.0110435 | - |
ethics_cm | acc | 0.512227 | 0.00802048 | 0.559846 | 0.00796521 | - |
hendrycksTest-high_school_computer_science | acc | 0.2 | 0.0402015 | 0.25 | 0.0435194 | - |
hendrycksTest-high_school_computer_science | acc_norm | 0.26 | 0.0440844 | 0.27 | 0.0446196 | |
iwslt17-ar-en | bleu | 21.4685 | 0.64825 | 20.7322 | 0.795602 | + |
iwslt17-ar-en | chrf | 0.452175 | 0.00498012 | 0.450919 | 0.00526515 | |
iwslt17-ar-en | ter | 0.733514 | 0.0201688 | 0.787631 | 0.0285488 | + |
hendrycksTest-security_studies | acc | 0.391837 | 0.0312513 | 0.363265 | 0.0307891 | |
hendrycksTest-security_studies | acc_norm | 0.285714 | 0.0289206 | 0.285714 | 0.0289206 | |
hendrycksTest-global_facts | acc | 0.29 | 0.0456048 | 0.25 | 0.0435194 | |
hendrycksTest-global_facts | acc_norm | 0.26 | 0.0440844 | 0.22 | 0.0416333 | |
anli_r1 | acc | 0.297 | 0.0144568 | 0.322 | 0.0147829 | - |
blimp_left_branch_island_simple_question | acc | 0.884 | 0.0101315 | 0.867 | 0.0107437 | + |
hendrycksTest-astronomy | acc | 0.25 | 0.0352381 | 0.25 | 0.0352381 | |
hendrycksTest-astronomy | acc_norm | 0.348684 | 0.0387814 | 0.335526 | 0.038425 | |
mrpc | acc | 0.536765 | 0.024717 | 0.683824 | 0.0230483 | - |
mrpc | f1 | 0.63301 | 0.0247985 | 0.812227 | 0.0162476 | - |
ethics_utilitarianism | acc | 0.525374 | 0.00720233 | 0.509775 | 0.00721024 | + |
blimp_determiner_noun_agreement_2 | acc | 0.99 | 0.003148 | 0.977 | 0.00474273 | + |
lambada_cloze | ppl | 388.123 | 13.1523 | 405.646 | 14.5519 | + |
lambada_cloze | acc | 0.0116437 | 0.00149456 | 0.0199884 | 0.00194992 | - |
truthfulqa_mc | mc1 | 0.225214 | 0.0146232 | 0.201958 | 0.014054 | + |
truthfulqa_mc | mc2 | 0.371625 | 0.0136558 | 0.359537 | 0.0134598 | |
blimp_wh_vs_that_with_gap_long_distance | acc | 0.441 | 0.0157088 | 0.342 | 0.0150087 | + |
hendrycksTest-business_ethics | acc | 0.28 | 0.0451261 | 0.29 | 0.0456048 | |
hendrycksTest-business_ethics | acc_norm | 0.29 | 0.0456048 | 0.3 | 0.0460566 | |
arithmetic_3ds | acc | 0.0065 | 0.00179736 | 0.046 | 0.0046854 | - |
blimp_determiner_noun_agreement_with_adjective_1 | acc | 0.988 | 0.00344498 | 0.978 | 0.00464086 | + |
hendrycksTest-moral_disputes | acc | 0.277457 | 0.0241057 | 0.283237 | 0.0242579 | |
hendrycksTest-moral_disputes | acc_norm | 0.309249 | 0.0248831 | 0.32659 | 0.0252483 | |
arithmetic_2da | acc | 0.0455 | 0.00466109 | 0.2405 | 0.00955906 | - |
qa4mre_2011 | acc | 0.425 | 0.0453163 | 0.458333 | 0.0456755 | |
qa4mre_2011 | acc_norm | 0.558333 | 0.0455219 | 0.533333 | 0.045733 | |
blimp_regular_plural_subject_verb_agreement_1 | acc | 0.966 | 0.00573384 | 0.968 | 0.00556839 | |
hendrycksTest-human_sexuality | acc | 0.389313 | 0.0427649 | 0.396947 | 0.0429114 | |
hendrycksTest-human_sexuality | acc_norm | 0.305344 | 0.0403931 | 0.343511 | 0.0416498 | |
blimp_passive_1 | acc | 0.878 | 0.0103549 | 0.885 | 0.0100934 | |
blimp_drop_argument | acc | 0.784 | 0.0130197 | 0.823 | 0.0120755 | - |
hendrycksTest-high_school_microeconomics | acc | 0.260504 | 0.0285103 | 0.277311 | 0.0290794 | |
hendrycksTest-high_school_microeconomics | acc_norm | 0.390756 | 0.0316938 | 0.39916 | 0.0318111 | |
hendrycksTest-us_foreign_policy | acc | 0.32 | 0.0468826 | 0.34 | 0.0476095 | |
hendrycksTest-us_foreign_policy | acc_norm | 0.4 | 0.0492366 | 0.35 | 0.0479372 | + |
blimp_ellipsis_n_bar_1 | acc | 0.846 | 0.0114199 | 0.841 | 0.0115695 | |
hendrycksTest-high_school_physics | acc | 0.264901 | 0.0360304 | 0.271523 | 0.0363133 | |
hendrycksTest-high_school_physics | acc_norm | 0.284768 | 0.0368488 | 0.271523 | 0.0363133 | |
qa4mre_2013 | acc | 0.362676 | 0.028579 | 0.401408 | 0.0291384 | - |
qa4mre_2013 | acc_norm | 0.387324 | 0.0289574 | 0.383803 | 0.0289082 | |
blimp_wh_vs_that_no_gap | acc | 0.963 | 0.00597216 | 0.969 | 0.00548353 | - |
headqa_es | acc | 0.238877 | 0.00814442 | 0.251276 | 0.0082848 | - |
headqa_es | acc_norm | 0.290664 | 0.00867295 | 0.286652 | 0.00863721 | |
blimp_sentential_subject_island | acc | 0.359 | 0.0151773 | 0.421 | 0.0156206 | - |
hendrycksTest-philosophy | acc | 0.241158 | 0.0242966 | 0.26045 | 0.0249267 | |
hendrycksTest-philosophy | acc_norm | 0.327974 | 0.0266644 | 0.334405 | 0.0267954 | |
hendrycksTest-elementary_mathematics | acc | 0.248677 | 0.0222618 | 0.251323 | 0.0223405 | |
hendrycksTest-elementary_mathematics | acc_norm | 0.275132 | 0.0230001 | 0.26455 | 0.0227175 | |
math_geometry | acc | 0.0187891 | 0.00621042 | 0.0104384 | 0.00464863 | + |
blimp_wh_questions_subject_gap_long_distance | acc | 0.886 | 0.0100551 | 0.883 | 0.0101693 | |
hendrycksTest-college_physics | acc | 0.205882 | 0.0402338 | 0.205882 | 0.0402338 | |
hendrycksTest-college_physics | acc_norm | 0.22549 | 0.0415831 | 0.245098 | 0.0428011 | |
hellaswag | acc | 0.488747 | 0.00498852 | 0.49532 | 0.00498956 | - |
hellaswag | acc_norm | 0.648277 | 0.00476532 | 0.66202 | 0.00472055 | - |
hendrycksTest-logical_fallacies | acc | 0.269939 | 0.0348783 | 0.294479 | 0.0358117 | |
hendrycksTest-logical_fallacies | acc_norm | 0.343558 | 0.0373113 | 0.355828 | 0.0376152 | |
hendrycksTest-machine_learning | acc | 0.339286 | 0.0449395 | 0.223214 | 0.039523 | + |
hendrycksTest-machine_learning | acc_norm | 0.205357 | 0.0383424 | 0.178571 | 0.0363521 | |
hendrycksTest-high_school_psychology | acc | 0.286239 | 0.0193794 | 0.273394 | 0.0191093 | |
hendrycksTest-high_school_psychology | acc_norm | 0.266055 | 0.018946 | 0.269725 | 0.0190285 | |
prost | acc | 0.256298 | 0.00318967 | 0.268254 | 0.00323688 | - |
prost | acc_norm | 0.280156 | 0.00328089 | 0.274658 | 0.00326093 | + |
blimp_determiner_noun_agreement_with_adj_irregular_2 | acc | 0.898 | 0.00957537 | 0.916 | 0.00877616 | - |
wnli | acc | 0.43662 | 0.0592794 | 0.464789 | 0.0596131 | |
hendrycksTest-professional_law | acc | 0.284876 | 0.0115278 | 0.273794 | 0.0113886 | |
hendrycksTest-professional_law | acc_norm | 0.301825 | 0.0117244 | 0.292699 | 0.0116209 | |
math_algebra | acc | 0.0126369 | 0.00324352 | 0.0117944 | 0.00313487 | |
wikitext | word_perplexity | 11.4687 | 0 | 10.8819 | 0 | - |
wikitext | byte_perplexity | 1.5781 | 0 | 1.56268 | 0 | - |
wikitext | bits_per_byte | 0.658188 | 0 | 0.644019 | 0 | - |
anagrams1 | acc | 0.0125 | 0.00111108 | 0.0008 | 0.000282744 | + |
math_prealgebra | acc | 0.0195178 | 0.00469003 | 0.0126292 | 0.00378589 | + |
blimp_principle_A_domain_2 | acc | 0.887 | 0.0100166 | 0.889 | 0.0099387 | |
cycle_letters | acc | 0.0331 | 0.00178907 | 0.0026 | 0.000509264 | + |
hendrycksTest-college_mathematics | acc | 0.26 | 0.0440844 | 0.26 | 0.0440844 | |
hendrycksTest-college_mathematics | acc_norm | 0.31 | 0.0464823 | 0.4 | 0.0492366 | - |
arithmetic_1dc | acc | 0.077 | 0.00596266 | 0.089 | 0.00636866 | - |
arithmetic_4da | acc | 0.0005 | 0.0005 | 0.007 | 0.00186474 | - |
triviaqa | acc | 0.150888 | 0.00336543 | 0.167418 | 0.00351031 | - |
boolq | acc | 0.673394 | 0.00820236 | 0.655352 | 0.00831224 | + |
random_insertion | acc | 0.0004 | 0.00019997 | 0 | 0 | + |
qa4mre_2012 | acc | 0.4 | 0.0388514 | 0.4125 | 0.0390407 | |
qa4mre_2012 | acc_norm | 0.4625 | 0.0395409 | 0.50625 | 0.0396495 | - |
math_asdiv | acc | 0.00997831 | 0.00207066 | 0.00563991 | 0.00156015 | + |
hendrycksTest-moral_scenarios | acc | 0.236872 | 0.0142196 | 0.236872 | 0.0142196 | |
hendrycksTest-moral_scenarios | acc_norm | 0.272626 | 0.0148934 | 0.272626 | 0.0148934 | |
hendrycksTest-high_school_geography | acc | 0.247475 | 0.0307463 | 0.20202 | 0.0286062 | + |
hendrycksTest-high_school_geography | acc_norm | 0.287879 | 0.0322588 | 0.292929 | 0.032425 | |
gsm8k | acc | 0 | 0 | 0 | 0 | |
blimp_existential_there_object_raising | acc | 0.812 | 0.0123616 | 0.792 | 0.0128414 | + |
blimp_superlative_quantifiers_2 | acc | 0.917 | 0.00872853 | 0.865 | 0.0108117 | + |
hendrycksTest-college_chemistry | acc | 0.28 | 0.0451261 | 0.24 | 0.0429235 | |
hendrycksTest-college_chemistry | acc_norm | 0.31 | 0.0464823 | 0.28 | 0.0451261 | |
blimp_existential_there_quantifiers_2 | acc | 0.545 | 0.0157551 | 0.383 | 0.0153801 | + |
hendrycksTest-abstract_algebra | acc | 0.17 | 0.0377525 | 0.26 | 0.0440844 | - |
hendrycksTest-abstract_algebra | acc_norm | 0.26 | 0.0440844 | 0.3 | 0.0460566 | |
hendrycksTest-professional_psychology | acc | 0.26634 | 0.0178832 | 0.28268 | 0.0182173 | |
hendrycksTest-professional_psychology | acc_norm | 0.256536 | 0.0176678 | 0.259804 | 0.0177409 | |
ethics_virtue | acc | 0.249849 | 0.00613847 | 0.200201 | 0.00567376 | + |
ethics_virtue | em | 0.0040201 | 0 | 0 | 0 | + |
arithmetic_5da | acc | 0 | 0 | 0.0005 | 0.0005 | - |
mutual | r@1 | 0.455982 | 0.0167421 | 0.468397 | 0.0167737 | |
mutual | r@2 | 0.732506 | 0.0148796 | 0.735892 | 0.0148193 | |
mutual | mrr | 0.675226 | 0.0103132 | 0.682186 | 0.0103375 | |
blimp_irregular_past_participle_verbs | acc | 0.869 | 0.0106749 | 0.876 | 0.0104275 | |
ethics_deontology | acc | 0.497775 | 0.00833904 | 0.523637 | 0.0083298 | - |
ethics_deontology | em | 0.00333704 | 0 | 0.0355951 | 0 | - |
blimp_transitive | acc | 0.818 | 0.0122076 | 0.855 | 0.01114 | - |
hendrycksTest-college_computer_science | acc | 0.29 | 0.0456048 | 0.27 | 0.0446196 | |
hendrycksTest-college_computer_science | acc_norm | 0.27 | 0.0446196 | 0.26 | 0.0440844 | |
hendrycksTest-professional_medicine | acc | 0.283088 | 0.0273659 | 0.272059 | 0.027033 | |
hendrycksTest-professional_medicine | acc_norm | 0.279412 | 0.0272572 | 0.261029 | 0.0266793 | |
sciq | acc | 0.895 | 0.00969892 | 0.915 | 0.00882343 | - |
sciq | acc_norm | 0.869 | 0.0106749 | 0.874 | 0.0104992 | |
blimp_anaphor_number_agreement | acc | 0.993 | 0.00263779 | 0.995 | 0.00223159 | |
blimp_wh_questions_subject_gap | acc | 0.925 | 0.00833333 | 0.913 | 0.00891687 | + |
blimp_wh_vs_that_with_gap | acc | 0.482 | 0.015809 | 0.429 | 0.015659 | + |
math_num_theory | acc | 0.0351852 | 0.00793611 | 0.0203704 | 0.00608466 | + |
blimp_complex_NP_island | acc | 0.538 | 0.0157735 | 0.535 | 0.0157805 | |
blimp_expletive_it_object_raising | acc | 0.777 | 0.0131698 | 0.78 | 0.0131062 | |
lambada_mt_en | ppl | 4.62504 | 0.10549 | 4.10224 | 0.0884971 | - |
lambada_mt_en | acc | 0.648554 | 0.00665142 | 0.682127 | 0.00648741 | - |
hendrycksTest-formal_logic | acc | 0.309524 | 0.0413491 | 0.34127 | 0.042408 | |
hendrycksTest-formal_logic | acc_norm | 0.325397 | 0.041906 | 0.325397 | 0.041906 | |
blimp_matrix_question_npi_licensor_present | acc | 0.663 | 0.0149551 | 0.727 | 0.014095 | - |
blimp_superlative_quantifiers_1 | acc | 0.791 | 0.0128641 | 0.871 | 0.0106053 | - |
lambada_mt_de | ppl | 89.7905 | 5.30301 | 82.2416 | 4.88447 | - |
lambada_mt_de | acc | 0.312245 | 0.0064562 | 0.312827 | 0.00645948 | |
hendrycksTest-computer_security | acc | 0.37 | 0.0485237 | 0.27 | 0.0446196 | + |
hendrycksTest-computer_security | acc_norm | 0.37 | 0.0485237 | 0.33 | 0.0472582 | |
ethics_justice | acc | 0.501479 | 0.00961712 | 0.526627 | 0.00960352 | - |
ethics_justice | em | 0 | 0 | 0.0251479 | 0 | - |
blimp_principle_A_reconstruction | acc | 0.296 | 0.0144427 | 0.444 | 0.0157198 | - |
blimp_existential_there_subject_raising | acc | 0.877 | 0.0103913 | 0.875 | 0.0104635 | |
math_precalc | acc | 0.014652 | 0.00514689 | 0.0018315 | 0.0018315 | + |
qasper | f1_yesno | 0.632997 | 0.032868 | 0.666667 | 0.0311266 | - |
qasper | f1_abstractive | 0.113489 | 0.00729073 | 0.118383 | 0.00692993 | |
cb | acc | 0.196429 | 0.0535714 | 0.357143 | 0.0646096 | - |
cb | f1 | 0.149038 | 0 | 0.288109 | 0 | - |
blimp_animate_subject_trans | acc | 0.858 | 0.0110435 | 0.868 | 0.0107094 | |
hendrycksTest-high_school_statistics | acc | 0.310185 | 0.031547 | 0.291667 | 0.0309987 | |
hendrycksTest-high_school_statistics | acc_norm | 0.361111 | 0.0327577 | 0.314815 | 0.0316747 | + |
blimp_irregular_plural_subject_verb_agreement_2 | acc | 0.881 | 0.0102442 | 0.919 | 0.00863212 | - |
lambada_mt_es | ppl | 92.1172 | 5.05064 | 83.6696 | 4.57489 | - |
lambada_mt_es | acc | 0.322337 | 0.00651139 | 0.326994 | 0.00653569 | |
anli_r2 | acc | 0.327 | 0.0148422 | 0.337 | 0.0149551 | |
hendrycksTest-nutrition | acc | 0.346405 | 0.0272456 | 0.346405 | 0.0272456 | |
hendrycksTest-nutrition | acc_norm | 0.385621 | 0.0278707 | 0.401961 | 0.0280742 | |
anli_r3 | acc | 0.336667 | 0.0136476 | 0.3525 | 0.0137972 | - |
blimp_regular_plural_subject_verb_agreement_2 | acc | 0.897 | 0.00961683 | 0.916 | 0.00877616 | - |
blimp_tough_vs_raising_2 | acc | 0.826 | 0.0119945 | 0.857 | 0.0110758 | - |
mnli | acc | 0.316047 | 0.00469317 | 0.374733 | 0.00488619 | - |
drop | em | 0.0595638 | 0.00242379 | 0.0228607 | 0.0015306 | + |
drop | f1 | 0.120355 | 0.00270951 | 0.103871 | 0.00219977 | + |
blimp_determiner_noun_agreement_with_adj_2 | acc | 0.95 | 0.00689547 | 0.936 | 0.00774364 | + |
arithmetic_2dm | acc | 0.061 | 0.00535293 | 0.14 | 0.00776081 | - |
blimp_determiner_noun_agreement_irregular_2 | acc | 0.93 | 0.00807249 | 0.932 | 0.00796489 | |
lambada | ppl | 4.62504 | 0.10549 | 4.10224 | 0.0884971 | - |
lambada | acc | 0.648554 | 0.00665142 | 0.682127 | 0.00648741 | - |
arithmetic_3da | acc | 0.007 | 0.00186474 | 0.0865 | 0.00628718 | - |
blimp_irregular_past_participle_adjectives | acc | 0.947 | 0.00708811 | 0.956 | 0.00648892 | - |
hendrycksTest-college_biology | acc | 0.201389 | 0.0335365 | 0.284722 | 0.0377381 | - |
hendrycksTest-college_biology | acc_norm | 0.222222 | 0.0347659 | 0.270833 | 0.0371618 | - |
headqa_en | acc | 0.324945 | 0.00894582 | 0.335522 | 0.00901875 | - |
headqa_en | acc_norm | 0.375638 | 0.00925014 | 0.383297 | 0.00928648 | |
blimp_determiner_noun_agreement_irregular_1 | acc | 0.912 | 0.00896305 | 0.944 | 0.0072744 | - |
blimp_existential_there_quantifiers_1 | acc | 0.985 | 0.00384575 | 0.981 | 0.00431945 | |
blimp_inchoative | acc | 0.653 | 0.0150605 | 0.683 | 0.0147217 | - |
mutual_plus | r@1 | 0.395034 | 0.0164328 | 0.409707 | 0.016531 | |
mutual_plus | r@2 | 0.674944 | 0.015745 | 0.680587 | 0.0156728 | |
mutual_plus | mrr | 0.632713 | 0.0103391 | 0.640801 | 0.0104141 | |
blimp_tough_vs_raising_1 | acc | 0.736 | 0.0139463 | 0.734 | 0.01398 | |
winogrande | acc | 0.636148 | 0.0135215 | 0.640884 | 0.0134831 | |
race | acc | 0.374163 | 0.0149765 | 0.37512 | 0.0149842 | |
blimp_irregular_plural_subject_verb_agreement_1 | acc | 0.908 | 0.00914438 | 0.918 | 0.00868052 | - |
hendrycksTest-high_school_macroeconomics | acc | 0.284615 | 0.0228783 | 0.284615 | 0.0228783 | |
hendrycksTest-high_school_macroeconomics | acc_norm | 0.284615 | 0.0228783 | 0.276923 | 0.022688 | |
blimp_adjunct_island | acc | 0.888 | 0.00997775 | 0.902 | 0.00940662 | - |
hendrycksTest-high_school_chemistry | acc | 0.236453 | 0.0298961 | 0.211823 | 0.028749 | |
hendrycksTest-high_school_chemistry | acc_norm | 0.300493 | 0.032258 | 0.29064 | 0.0319474 | |
arithmetic_2ds | acc | 0.051 | 0.00492053 | 0.218 | 0.00923475 | - |
blimp_principle_A_case_2 | acc | 0.955 | 0.00655881 | 0.953 | 0.00669596 | |
blimp_only_npi_licensor_present | acc | 0.926 | 0.00828206 | 0.953 | 0.00669596 | - |
math_counting_and_prob | acc | 0.0274262 | 0.00750954 | 0.0021097 | 0.0021097 | + |
cola | mcc | -0.0854256 | 0.0304519 | -0.0504508 | 0.0251594 | - |
webqs | acc | 0.023622 | 0.00336987 | 0.0226378 | 0.00330058 | |
arithmetic_4ds | acc | 0.0005 | 0.0005 | 0.0055 | 0.00165416 | - |
blimp_wh_vs_that_no_gap_long_distance | acc | 0.94 | 0.00751375 | 0.939 | 0.00757208 | |
pile_bookcorpus2 | word_perplexity | 28.7786 | 0 | 27.0559 | 0 | - |
pile_bookcorpus2 | byte_perplexity | 1.79969 | 0 | 1.78037 | 0 | - |
pile_bookcorpus2 | bits_per_byte | 0.847751 | 0 | 0.832176 | 0 | - |
blimp_sentential_negation_npi_licensor_present | acc | 0.994 | 0.00244335 | 0.982 | 0.00420639 | + |
hendrycksTest-high_school_government_and_politics | acc | 0.274611 | 0.0322102 | 0.227979 | 0.0302769 | + |
hendrycksTest-high_school_government_and_politics | acc_norm | 0.259067 | 0.0316188 | 0.248705 | 0.0311958 | |
blimp_ellipsis_n_bar_2 | acc | 0.937 | 0.00768701 | 0.916 | 0.00877616 | + |
hendrycksTest-clinical_knowledge | acc | 0.283019 | 0.0277242 | 0.267925 | 0.0272573 | |
hendrycksTest-clinical_knowledge | acc_norm | 0.343396 | 0.0292245 | 0.316981 | 0.0286372 | |
mc_taco | em | 0.125375 | 0 | 0.132883 | 0 | - |
mc_taco | f1 | 0.487131 | 0 | 0.499712 | 0 | - |
wsc | acc | 0.365385 | 0.0474473 | 0.365385 | 0.0474473 | |
hendrycksTest-college_medicine | acc | 0.231214 | 0.0321474 | 0.190751 | 0.0299579 | + |
hendrycksTest-college_medicine | acc_norm | 0.289017 | 0.0345643 | 0.265896 | 0.0336876 | |
hendrycksTest-high_school_world_history | acc | 0.295359 | 0.0296963 | 0.2827 | 0.0293128 | |
hendrycksTest-high_school_world_history | acc_norm | 0.312236 | 0.0301651 | 0.312236 | 0.0301651 | |
hendrycksTest-anatomy | acc | 0.296296 | 0.0394462 | 0.281481 | 0.03885 | |
hendrycksTest-anatomy | acc_norm | 0.288889 | 0.0391545 | 0.266667 | 0.0382017 | |
hendrycksTest-jurisprudence | acc | 0.25 | 0.0418609 | 0.277778 | 0.0433004 | |
hendrycksTest-jurisprudence | acc_norm | 0.416667 | 0.0476608 | 0.425926 | 0.0478034 | |
logiqa | acc | 0.193548 | 0.0154963 | 0.211982 | 0.016031 | - |
logiqa | acc_norm | 0.281106 | 0.0176324 | 0.291859 | 0.0178316 | |
ethics_utilitarianism_original | acc | 0.767679 | 0.00609112 | 0.941556 | 0.00338343 | - |
blimp_principle_A_c_command | acc | 0.827 | 0.0119672 | 0.81 | 0.0124119 | + |
blimp_coordinate_structure_constraint_complex_left_branch | acc | 0.794 | 0.0127956 | 0.764 | 0.0134345 | + |
arithmetic_5ds | acc | 0 | 0 | 0 | 0 | |
lambada_mt_it | ppl | 96.8846 | 5.80902 | 86.66 | 5.1869 | - |
lambada_mt_it | acc | 0.328158 | 0.00654165 | 0.336891 | 0.0065849 | - |
wsc273 | acc | 0.827839 | 0.0228905 | 0.827839 | 0.0228905 | |
blimp_coordinate_structure_constraint_object_extraction | acc | 0.852 | 0.0112349 | 0.876 | 0.0104275 | - |
blimp_principle_A_domain_3 | acc | 0.79 | 0.0128867 | 0.819 | 0.0121814 | - |
blimp_left_branch_island_echo_question | acc | 0.638 | 0.0152048 | 0.519 | 0.0158079 | + |
rte | acc | 0.534296 | 0.0300256 | 0.548736 | 0.0299531 | |
blimp_passive_2 | acc | 0.892 | 0.00982 | 0.899 | 0.00953362 | |
hendrycksTest-electrical_engineering | acc | 0.344828 | 0.0396093 | 0.358621 | 0.0399663 | |
hendrycksTest-electrical_engineering | acc_norm | 0.372414 | 0.0402873 | 0.372414 | 0.0402873 | |
sst | acc | 0.626147 | 0.0163938 | 0.493119 | 0.0169402 | + |
blimp_npi_present_1 | acc | 0.565 | 0.0156851 | 0.576 | 0.0156355 | |
piqa | acc | 0.739391 | 0.0102418 | 0.754081 | 0.0100473 | - |
piqa | acc_norm | 0.755169 | 0.0100323 | 0.761697 | 0.00994033 | |
hendrycksTest-professional_accounting | acc | 0.312057 | 0.0276401 | 0.265957 | 0.0263581 | + |
hendrycksTest-professional_accounting | acc_norm | 0.27305 | 0.0265779 | 0.22695 | 0.0249871 | + |
arc_challenge | acc | 0.325085 | 0.0136881 | 0.337884 | 0.013822 | |
arc_challenge | acc_norm | 0.352389 | 0.0139601 | 0.366041 | 0.0140772 | |
hendrycksTest-econometrics | acc | 0.263158 | 0.0414244 | 0.245614 | 0.0404934 | |
hendrycksTest-econometrics | acc_norm | 0.254386 | 0.0409699 | 0.27193 | 0.0418577 | |
headqa | acc | 0.238877 | 0.00814442 | 0.251276 | 0.0082848 | - |
headqa | acc_norm | 0.290664 | 0.00867295 | 0.286652 | 0.00863721 | |
wic | acc | 0.482759 | 0.0197989 | 0.5 | 0.0198107 | |
hendrycksTest-high_school_biology | acc | 0.270968 | 0.0252844 | 0.251613 | 0.024686 | |
hendrycksTest-high_school_biology | acc_norm | 0.274194 | 0.0253781 | 0.283871 | 0.0256494 | |
hendrycksTest-management | acc | 0.281553 | 0.0445325 | 0.23301 | 0.0418583 | + |
hendrycksTest-management | acc_norm | 0.291262 | 0.0449868 | 0.320388 | 0.0462028 | |
blimp_npi_present_2 | acc | 0.645 | 0.0151395 | 0.664 | 0.0149441 | - |
hendrycksTest-prehistory | acc | 0.265432 | 0.0245692 | 0.243827 | 0.0238919 | |
hendrycksTest-prehistory | acc_norm | 0.225309 | 0.0232462 | 0.219136 | 0.0230167 | |
hendrycksTest-world_religions | acc | 0.321637 | 0.0358253 | 0.333333 | 0.0361551 | |
hendrycksTest-world_religions | acc_norm | 0.397661 | 0.0375364 | 0.380117 | 0.0372297 | |
math_intermediate_algebra | acc | 0.00996678 | 0.00330749 | 0.00332226 | 0.00191598 | + |
anagrams2 | acc | 0.0347 | 0.00183028 | 0.0055 | 0.000739615 | + |
arc_easy | acc | 0.647306 | 0.00980442 | 0.669613 | 0.00965143 | - |
arc_easy | acc_norm | 0.609848 | 0.0100091 | 0.622896 | 0.00994504 | - |
blimp_anaphor_gender_agreement | acc | 0.993 | 0.00263779 | 0.994 | 0.00244335 | |
hendrycksTest-marketing | acc | 0.311966 | 0.0303515 | 0.307692 | 0.0302364 | |
hendrycksTest-marketing | acc_norm | 0.34188 | 0.031075 | 0.294872 | 0.0298726 | + |
blimp_principle_A_domain_1 | acc | 0.997 | 0.00173032 | 0.997 | 0.00173032 | |
blimp_wh_island | acc | 0.856 | 0.011108 | 0.852 | 0.0112349 | |
hendrycksTest-sociology | acc | 0.303483 | 0.0325101 | 0.278607 | 0.0317006 | |
hendrycksTest-sociology | acc_norm | 0.298507 | 0.0323574 | 0.318408 | 0.0329412 | |
blimp_distractor_agreement_relative_clause | acc | 0.774 | 0.0132325 | 0.719 | 0.0142212 | + |
truthfulqa_gen | bleurt_max | -0.811655 | 0.0180743 | -0.814228 | 0.0172128 | |
truthfulqa_gen | bleurt_acc | 0.395349 | 0.0171158 | 0.329253 | 0.0164513 | + |
truthfulqa_gen | bleurt_diff | -0.0488385 | 0.0204525 | -0.185905 | 0.0169617 | + |
truthfulqa_gen | bleu_max | 20.8747 | 0.717003 | 20.2238 | 0.711772 | |
truthfulqa_gen | bleu_acc | 0.330477 | 0.0164668 | 0.281518 | 0.015744 | + |
truthfulqa_gen | bleu_diff | -2.12856 | 0.832693 | -6.66121 | 0.719366 | + |
truthfulqa_gen | rouge1_max | 47.0293 | 0.962404 | 45.3457 | 0.89238 | + |
truthfulqa_gen | rouge1_acc | 0.341493 | 0.0166007 | 0.257038 | 0.0152981 | + |
truthfulqa_gen | rouge1_diff | -2.29454 | 1.2086 | -10.1049 | 0.8922 | + |
truthfulqa_gen | rouge2_max | 31.0617 | 1.08725 | 28.7438 | 0.981282 | + |
truthfulqa_gen | rouge2_acc | 0.247246 | 0.0151024 | 0.201958 | 0.014054 | + |
truthfulqa_gen | rouge2_diff | -2.84021 | 1.28749 | -11.0916 | 1.01664 | + |
truthfulqa_gen | rougeL_max | 44.6463 | 0.966119 | 42.6116 | 0.893252 | + |
truthfulqa_gen | rougeL_acc | 0.334149 | 0.0165125 | 0.24235 | 0.0150007 | + |
truthfulqa_gen | rougeL_diff | -2.50853 | 1.22016 | -10.4299 | 0.904205 | + |
hendrycksTest-public_relations | acc | 0.3 | 0.0438931 | 0.281818 | 0.0430912 | |
hendrycksTest-public_relations | acc_norm | 0.190909 | 0.0376443 | 0.163636 | 0.0354343 | |
blimp_distractor_agreement_relational_noun | acc | 0.859 | 0.0110109 | 0.833 | 0.0118004 | + |
lambada_mt_fr | ppl | 57.0379 | 3.15719 | 51.7313 | 2.90272 | - |
lambada_mt_fr | acc | 0.388512 | 0.0067906 | 0.40947 | 0.00685084 | - |
blimp_principle_A_case_1 | acc | 1 | 0 | 1 | 0 | |
hendrycksTest-medical_genetics | acc | 0.37 | 0.0485237 | 0.31 | 0.0464823 | + |
hendrycksTest-medical_genetics | acc_norm | 0.41 | 0.0494311 | 0.39 | 0.0490207 | |
qqp | acc | 0.364383 | 0.00239348 | 0.383626 | 0.00241841 | - |
qqp | f1 | 0.516391 | 0.00263674 | 0.451222 | 0.00289696 | + |
iwslt17-en-ar | bleu | 2.35563 | 0.188638 | 4.98225 | 0.275369 | - |
iwslt17-en-ar | chrf | 0.140912 | 0.00503101 | 0.277708 | 0.00415432 | - |
iwslt17-en-ar | ter | 1.0909 | 0.0122111 | 0.954701 | 0.0126737 | - |
multirc | acc | 0.0409234 | 0.00642087 | 0.0178384 | 0.00428994 | + |
hendrycksTest-human_aging | acc | 0.264574 | 0.0296051 | 0.264574 | 0.0296051 | |
hendrycksTest-human_aging | acc_norm | 0.197309 | 0.0267099 | 0.237668 | 0.0285681 | - |
reversed_words | acc | 0.0003 | 0.000173188 | 0 | 0 | + |
Some results are missing due to errors or computational constraints.