Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
import os | |
import base64 | |
from src.display.utils import ModelType | |
current_dir = os.path.dirname(os.path.realpath(__file__)) | |
with open(os.path.join(current_dir, "main_logo.png"), "rb") as image_file: | |
main_logo = base64.b64encode(image_file.read()).decode('utf-8') | |
with open(os.path.join(current_dir, "host_sponsor.png"), "rb") as image_file: | |
host_sponsor = base64.b64encode(image_file.read()).decode('utf-8') | |
TITLE = f"""<img src="data:image/jpeg;base64,{main_logo}" style="width:30%;display:block;margin-left:auto;margin-right:auto">""" | |
BOTTOM_LOGO = f"""<img src="data:image/jpeg;base64,{host_sponsor}" style="width:75%;display:block;margin-left:auto;margin-right:auto">""" | |
INTRODUCTION_TEXT = f""" | |
The previous Leaderboard version is live [here](https://huggingface.co/spaces/choco9966/open-ko-llm-leaderboard-old) 📊 | |
🚀 The Open Ko-LLM Leaderboard2 🇰🇷 objectively evaluates the performance of Korean Large Language Model (LLM). When you submit a model on the "Submit here!" page, it is automatically evaluated. | |
This leaderboard is co-hosted by [Upstage](https://www.upstage.ai/), and [NIA](https://www.nia.or.kr/site/nia_kor/main.do) that provides various Korean Data Sets through [AI-Hub](https://aihub.or.kr/), and operated by [Upstage](https://www.upstage.ai/). The GPU used for evaluation is operated with the support of [KT](https://cloud.kt.com/) and [AICA](https://aica-gj.kr/main.php). If Season 1 focused on evaluating the capabilities of the LLM in terms of reasoning, language understanding, hallucination, and commonsense through academic benchmarks, Season 2 will focus on assessing the LLM's practical abilities and reliability. The datasets for this season are sponsored by [Flitto](https://www.flitto.com/portal/en), [SELECTSTAR](https://selectstar.ai/ko/), and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1). The evaluation dataset is exclusively private and only available for evaluation process. More detailed information about the benchmark dataset is provided on the “About” page. | |
You'll notably find explanations on the evaluations we are using, reproducibility guidelines, best practices on how to submit a model, and our FAQ. | |
""" | |
LLM_BENCHMARKS_TEXT = f""" | |
# Motivation | |
While outstanding LLM models are being released competitively, most of them are centered on English and are familiar with the English cultural sphere. We operate the Korean leaderboard, 🚀 Open Ko-LLM, to evaluate models that reflect the characteristics of the Korean language and Korean culture. Through this, we hope that users can conveniently use the leaderboard, participate, and contribute to the advancement of research in Korean. | |
## How it works | |
📈 We evaluate models on 9 key benchmarks using the [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) , a unified framework to test generative language models on a large number of different evaluation tasks. | |
- Ko-GPQA (provided by [Flitto](https://www.flitto.com/portal/en)) | |
- Ko-WinoGrande (provided by [Flitto](https://www.flitto.com/portal/en)) | |
- Ko-GSM8K (provided by [Flitto](https://www.flitto.com/portal/en)) | |
- Ko-EQ-Bench (provided by [Flitto](https://www.flitto.com/portal/en)) | |
- Ko-IFEval (provided by [Flitto](https://www.flitto.com/portal/en)) | |
- KorNAT-Knowledge (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1)) | |
- KorNAT-Social-Value (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1)) | |
- Ko-Harmlessness (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1)) | |
- Ko-Helpfulness (provided by [SELECTSTAR](https://selectstar.ai/ko/) and [KAIST AI](https://gsai.kaist.ac.kr/?lang=ko&ckattempt=1)) | |
For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning, harmlessness, helpfulness and general knowledge across a wide variety of fields in 0-shot and few-shot settings. | |
The final score is converted to the average score from each evaluation datasets. | |
GPUs are provided by [KT](https://cloud.kt.com/) and [AICA](https://aica-gj.kr/main.php) for the evaluations. | |
## **Results** | |
- Detailed numerical results in the `results` Upstage dataset: https://huggingface.co/datasets/open-ko-llm-leaderboard/results | |
- Community queries and running status in the `requests` Upstage dataset: https://huggingface.co/datasets/open-ko-llm-leaderboard/requests | |
## More resources | |
If you still have questions, you can check our FAQ [here](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard/discussions/1)! | |
""" | |
FAQ_TEXT = """ | |
""" | |
EVALUATION_QUEUE_TEXT = f""" | |
# Evaluation Queue for the 🤗 Open Ko-LLM Leaderboard | |
Models added here will be automatically evaluated on the 🤗 cluster. | |
## Submission Disclaimer | |
**By submitting a model, you acknowledge that:** | |
- We store information about who submitted each model in [Requests dataset](https://huggingface.co/datasets/open-ko-llm-leaderboard/requests). | |
- This practice helps maintain the integrity of our leaderboard, prevent spam, and ensure responsible submissions. | |
- Your submission will be visible to the community and you may be contacted regarding your model. | |
- Please submit carefully and responsibly 💛 | |
## First Steps Before Submitting a Model | |
### 1. Ensure Your Model Loads with AutoClasses | |
Verify that you can load your model and tokenizer using AutoClasses: | |
```jsx | |
from transformers import AutoConfig, AutoModel, AutoTokenizer | |
config = AutoConfig.from_pretrained("your model name", revision=revision) | |
model = AutoModel.from_pretrained("your model name", revision=revision) | |
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) | |
``` | |
Note: | |
- If this step fails, debug your model before submitting. | |
- Ensure your model is public. | |
- We are working on adding support for models requiring `use_remote_code=True`. | |
### 2. Convert Weights to Safetensors | |
[Safetensors](https://huggingface.co/docs/safetensors/index) is a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! | |
### 3. Verify Your Model Open License | |
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 | |
### 4. Complete Your Model Card | |
When we add extra information about models to the leaderboard, it will be automatically taken from the model card | |
### 5. Select Correct Precision | |
Choose the right precision to avoid evaluation errors: | |
- Not all models convert properly from float16 to bfloat16. | |
- Incorrect precision can cause issues (e.g., loading a bf16 model in fp16 may generate NaNs). | |
> Important: When submitting, git branches and tags will be strictly tied to the specific commit present at the time of submission to ensure revision consistency. | |
> | |
## Model types | |
- 🟢 : 🟢 pretrained model: new, base models, trained on a given text corpora using masked modelling | |
- 🟩 : 🟩 continuously pretrained model: new, base models, continuously trained on further corpus (which may include IFT/chat data) using masked modelling | |
- 🔶 : 🔶 fine-tuned on domain-specific datasets model: pretrained models finetuned on more data | |
- 💬 : 💬 chat models (RLHF, DPO, IFT, ...) model: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc | |
- 🤝 : 🤝 base merges and moerges model: merges or MoErges, models which have been merged or fused without additional fine-tuning. | |
Please provide information about the model through an issue! 🤩 | |
🏴☠️ : 🏴☠️ This icon indicates that the model has been selected as a subject of caution by the community, implying that users should exercise restraint when using it. Clicking on the icon will take you to a discussion about that model. (Models that have used the evaluation set for training to achieve a high leaderboard ranking, among others, are selected as subjects of caution.) | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results. Authors of open-ko-llm-leaderboard are ordered alphabetically." | |
CITATION_BUTTON_TEXT = r""" | |
@inproceedings{park2024open, | |
title={Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark}, | |
author={Chanjun Park and Hyeonwoo Kim and Dahyun Kim and Seonghwan Cho and Sanghoon Kim and Sukyung Lee and Yungi Kim and Hwalsuk Lee}, | |
year={2024}, | |
booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) } | |
} | |
@software{eval-harness, | |
author = {Gao, Leo and | |
Tow, Jonathan and | |
Biderman, Stella and | |
Black, Sid and | |
DiPofi, Anthony and | |
Foster, Charles and | |
Golding, Laurence and | |
Hsu, Jeffrey and | |
McDonell, Kyle and | |
Muennighoff, Niklas and | |
Phang, Jason and | |
Reynolds, Laria and | |
Tang, Eric and | |
Thite, Anish and | |
Wang, Ben and | |
Wang, Kevin and | |
Zou, Andy}, | |
title = {A framework for few-shot language model evaluation}, | |
month = sep, | |
year = 2021, | |
publisher = {Zenodo}, | |
version = {v0.0.1}, | |
doi = {10.5281/zenodo.5371628}, | |
url = {https://doi.org/10.5281/zenodo.5371628}, | |
} | |
@misc{rein2023gpqagraduatelevelgoogleproofqa, | |
title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark}, | |
author={David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman}, | |
year={2023}, | |
eprint={2311.12022}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.AI}, | |
url={https://arxiv.org/abs/2311.12022}, | |
} | |
@article{sakaguchi2021winogrande, | |
title={Winogrande: An adversarial winograd schema challenge at scale}, | |
author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin}, | |
journal={Communications of the ACM}, | |
volume={64}, | |
number={9}, | |
pages={99--106}, | |
year={2021}, | |
publisher={ACM New York, NY, USA} | |
} | |
@article{cobbe2021training, | |
title={Training verifiers to solve math word problems}, | |
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and others}, | |
journal={arXiv preprint arXiv:2110.14168}, | |
year={2021} | |
} | |
article{paech2023eq, | |
title={Eq-bench: An emotional intelligence benchmark for large language models}, | |
author={Paech, Samuel J}, | |
journal={arXiv preprint arXiv:2312.06281}, | |
year={2023} | |
} | |
@misc{zhou2023instructionfollowingevaluationlargelanguage, | |
title={Instruction-Following Evaluation for Large Language Models}, | |
author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou}, | |
year={2023}, | |
eprint={2311.07911}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL}, | |
url={https://arxiv.org/abs/2311.07911}, | |
} | |
@article{lee2024kornat, | |
title={KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge}, | |
author={Lee, Jiyoung and Kim, Minwoo and Kim, Seungho and Kim, Junghwan and Won, Seunghyun and Lee, Hwaran and Choi, Edward}, | |
journal={arXiv preprint arXiv:2402.13605}, | |
year={2024} | |
} | |
""" | |