Spaces:

gaia-benchmark
/

leaderboard

Running on CPU Upgrade

thomwolf HF staff

pierric HF staff commited on Nov 23, 2023

Commit

693e0dc

•

1 Parent(s): 92a7272

`this space` => `this dataset` (#2)

- `this space` => `this dataset` (6a0fba90794888ffb10b0a2b20a605c2282fae1b)

Co-authored-by: Pierric Cistac <pierric@users.noreply.huggingface.co>

Files changed (1) hide show

content.py CHANGED Viewed

@@ -7,7 +7,7 @@ GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with aug
 GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve.
 It is therefore divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities. Each level is divided into a fully public dev set for validation, and a test set with private answers and metadata.
-GAIA data can be found in [this space](https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
 ## Submissions
 Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.

 GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve.
 It is therefore divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities. Each level is divided into a fully public dev set for validation, and a test set with private answers and metadata.
+GAIA data can be found in [this dataset](https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
 ## Submissions
 Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.