Spaces:

gaia-benchmark
/

leaderboard

Running on CPU Upgrade

App Files Files Community

gregmialz commited on Nov 14, 2023

Commit

bb6c22e

•

1 Parent(s): ac816a0

Update content.py

Browse files

Files changed (1) hide show

content.py +9 -3

content.py CHANGED Viewed

@@ -3,6 +3,9 @@ TITLE = """<h1 align="center" id="space-title">GAIA Leaderboard</h1>"""
 CANARY_STRING = "" # TODO
 INTRODUCTION_TEXT = """
 Large language models have seen their potential capabilities increased by several orders of magnitude with the introduction of augmentations, from simple prompting adjustement to actual external tooling (calculators, vision models, ...) or online web retrieval.
 To evaluate the next generation of LLMs, we argue for a new kind of benchmark, simple and yet effective to measure actual progress on augmented capabilities, and therefore present GAIA. Details in the paper.
@@ -10,7 +13,11 @@ GAIA is made of 3 evaluation levels, depending on the added level of tooling and
 We expect the level 1 to be breakable by very good LLMs, and the level 3 to indicate a strong jump in model capabilities.
 Each of these levels is divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Results can be submitted for both validation and test.
-The data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
 We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
 ```
@@ -19,8 +26,7 @@ We expect submissions to be json-line files with the following format. The first
 ...
 ```
-Scores are expressed as the percentage of correct answers for a given split.
 Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.
 Please do not repost the public dev set, nor use it in training data for your models.

 CANARY_STRING = "" # TODO
 INTRODUCTION_TEXT = """
+# Summary
 Large language models have seen their potential capabilities increased by several orders of magnitude with the introduction of augmentations, from simple prompting adjustement to actual external tooling (calculators, vision models, ...) or online web retrieval.
 To evaluate the next generation of LLMs, we argue for a new kind of benchmark, simple and yet effective to measure actual progress on augmented capabilities, and therefore present GAIA. Details in the paper.
 We expect the level 1 to be breakable by very good LLMs, and the level 3 to indicate a strong jump in model capabilities.
 Each of these levels is divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Results can be submitted for both validation and test.
+# Data
+GAIA data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). It consists in ~466 questions distributed in two splits, with similar distribution of Levels. Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
+# Submissions
 We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
 ```
 ...
 ```
+Scores are expressed as the percentage of correct answers for a given split.
 Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.
 Please do not repost the public dev set, nor use it in training data for your models.