TITLE = '

Open Multilingual LLM Evaluation Leaderboard (Dutch only)

' INTRO_TEXT = f""" ## About This is a fork of the [Open Multilingual LLM Evaluation Leaderboard](https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard), but restricted to only Dutch models and augmented with additional model results. We test the models on the following benchmarks **for the Dutch version only!!**, which have been translated into Dutch automatically by the original authors of the Open Multilingual LLM Evaluation Leaderboard with `gpt-35-turbo`. - AI2 Reasoning Challenge (25-shot) - HellaSwag (10-shot) - MMLU (5-shot) - TruthfulQA (0-shot) I do not maintain those datasets, I only run benchmarks and add the results to this space. For questions regarding the test sets or running them yourself, see [the original Github repository](https://github.com/laiviet/lm-evaluation-harness). All models are benchmarked in 8-bit precision. """ CREDIT = f""" ## Credit This leaderboard has borrowed heavily from the following sources: - Datasets (AI2_ARC, HellaSwag, MMLU, TruthfulQA) - Evaluation code (EleutherAI's lm_evaluation_harness repo) - Leaderboard code (Huggingface4's open_llm_leaderboard repo) - The multilingual version of the leaderboard (uonlp's open_multilingual_llm_leaderboard repo) """ CITATION = f""" ## Citation If you use or cite the Dutch benchmark results or this specific leaderboard page, please cite the following paper: TDB If you use the multilingual benchmarks, please cite the following paper: ```bibtex @misc{{lai2023openllmbenchmark, author = {{Viet Lai and Nghia Trung Ngo and Amir Pouran Ben Veyseh and Franck Dernoncourt and Thien Huu Nguyen}}, title={{Open Multilingual LLM Evaluation Leaderboard}}, year={{2023}} }} ``` """