crystal-technologies's picture
Upload 2711 files
6e73cd3

In-context learning (ICL) evaluation

This folder contains the MosaicML LLM evaluation suite. It is a blazingly fast, multi-GPU-enabled ICL evaluation suite with native FSDP compatibility with any model on the HuggingFace hub and any PyTorch model that implements the ComposerModel interface. We also include collection of ICL datasets we refer to as our Model Gauntlet organized into 6 broad categories of competency that we expect good foundation models to have.

You can evaluate a model by preparing an evaluation YAML following the format of the examples in the scripts/eval/yamls directory.


Quickstart

Offline evaluation

To run offline evaluation, download this repo and run the following commands:

cd llm-foundry/scripts
composer eval/eval.py eval/yamls/hf_eval.yaml

This will run EleutherAI/gpt-neo-125m through the MosaicML Eval Gauntlet, a diverse evaluation suite consisting of over 30 benchmarks. You can update the configuration directly in the hf_eval.yaml YAML file, or override the values in the YAML with CLI args, such as:

cd llm-foundry/scripts
composer eval/eval.py eval/yamls/hf_eval.yaml \
    model_name_or_path=mosaicml/mpt-7b

You can also modify the specific benchmarks executed and their formatting by modifying the contents of tasks.yaml and you can modify the choice of composite scores and the set of tasks they consist of by modifying eval_gauntlet.yaml.

Evaluation during training

To run evaluation during training, download this repo, follow the instructions in scripts/train/README.md to perform single node pre-training and run the following commands

cd llm-foundry/scripts/train
composer train.py yamls/pretrain/mpt-125m_eval.yaml train_loader.dataset.split=train_small eval_loader.dataset.split=val_small

You can also modify the specific benchmarks executed and their formatting by modifying the contents of tasks.yaml and you can modify the choice of composite scores and the set of tasks they consist of by modifying eval_gauntlet.yaml. You can also choose to either run the full evaluation or run on a subset number of batches per benchmark by setting icl_subset_num_batches.


In-depth walkthrough

ICL evaluation can be done offline via the scripts/eval/eval.py or during training via scripts/train/train.py.

In order to do ICL evaluation you must specify a set of benchmarks you'd like to run via the icl_tasks key in your eval/training config. icl_tasks can either consist of config, or it can be a file path pointing to a locally accessible YAML config (see scripts/eval/yamls/tasks.yaml for an example).

ICL task YAML format

Your YAML must have a config section entitled icl_tasks specifying the benchmarks to evaluate againts, this can either be a list of dictionaries of the form

icl_tasks:
  -
    label: piqa
    dataset_uri: # ADD YOUR OWN DATASET URI
    num_fewshot: [5]
    icl_task_type: multiple_choice
    continuation_delimiter: ' '
    example_delimiter: "\n"
    prompt_string: ''
  -
    label: lambada
    dataset_uri: # ADD YOUR OWN DATASET URI
    num_fewshot: [0]
    icl_task_type: language_modeling

or a local path pointing to a YAML containing an icl_tasks config.

Note that if continuation_delimiter, example_delimiter, or prompt_string are omitted they will default to the values below:

continuation_delimiter: ' '
example_delimiter: "\n"
prompt_string: ''

Eval gauntlet YAML format

Your YAML may optionally have a config section entitled eval_gauntlet specifying how to aggregate the results (if absent, only the individual benchmark accuracies will be reported). After the tasks listed in the icl_tasks config are evaluated, the eval script will use the eval_gauntlet config, if specified, to aggregate the individual benchmarks into composite scores.

An eval_gauntlet config must specify the list of categories you'd like to generate composite scores for, as well as the list of benchmarks to be included in each category. For each category you need to list the name and the num_fewshot. These two values must exactly match the values specified in the icl_tasks config. Additionally you must specify the random baseline accuracy for each benchmark.

There are also three flags indicating how to perform the aggregation:

  1. weighting can either be EQUAL (all tasks are weighted equally), SAMPLE_SZ (tasks are weighted proportional to the size of the dataset), or LOG_SAMPLE_SZ (tasks are weighted proportional to the logarithm of the dataset size).
  2. subtract_random_baseline can either be true or false. If true we will subtract the random baseline accuracy from the final accuracy before averaging, otherwise it will be averaged in as is.
  3. rescale_accuracy can either be true or false. If true (and if subtract_random_baseline was also true), the accuracy will be rescaled to be <= 1 before averaging.

An example config is below:

eval_gauntlet:
  weighting: EQUAL
  subtract_random_baseline: true
  rescale_accuracy: true
  categories:
  - name: world_knowledge
    benchmarks:
    - name: jeopardy
      num_fewshot: 10
      random_baseline: 0
    - name: mmlu
      num_fewshot: 10
      random_baseline: 0.25
  - name: language_understanding
    benchmarks:
    - name: lambada_openai
      num_fewshot: 0
      random_baseline: 0.0
    - name: hellaswag
      num_fewshot: 10
      random_baseline: 0.25

You can either specify your eval_gauntlet config directly in your eval/train YAML or via a local path pointing to a YAML containing an eval_gauntlet config.

Offline evaluation

You can run the evaluation script on a model checkpoint via composer eval/eval.py YOUR_YAML from the scripts directory or launch it on the MosaicML platform using a an MCLI YAML following the format of llm-foundry/mcli/mcli-1b-eval.yaml.

You can use the default icl_tasks and eval_gauntlet configs or specify your own following the instructions above.

Evaluation during training

You can use ICL evaluation during training by taking an ordinary training YAML and adding an icl_tasks and eval_gauntlet config. You should also specify icl_seq_len in your training YAML and you can optionally run a truncated version of eval on a random subset of each benchmark by specifying a value for icl_subset_num_batches

An example is given below:

  icl_tasks: eval/yamls/tasks.yaml # or use tasks_light.yaml
  icl_subset_num_batches: 100 # -1, or omit this key entirely, to evaluate on all batches
  eval_gauntlet: 'eval/yamls/eval_gauntlet.yaml'
  icl_seq_len: 1024

For training, we recommend you do not run the full eval gauntlet. Instead either use the tasks_light.yaml which is a subset of the full gauntlet benchmarks, or set icl_subset_num_batches to a small number O(100) which will only run each benchmark on a random sample of icl_subset_num_batches batches.

You can use the default icl_tasks and eval_gauntlet configs or specify your own following the instructions above.


ICL Tasks

ICL evaluation measures a model’s ability to solve novel problems by being provided examples in-context without ever being specifically trained to answer such questions.

Composer supports a number of different standard ICL formats and allows users to upload their own datasets that correspond to those formats.

This document explains the ICL formats compatible with Composer, summarizes how to add new datasets in those formats, and catalogs the datasets currently used by the research team to evaluate models.


Supported ICL formats

Composer currently supports five ICL formats:

  1. InContextLearningQATaskDataset
  2. InContextLearningLMTaskDataset
  3. InContextLearningMultipleChoiceTaskDataset
  4. InContextLearningSchemaTaskDataset
  5. InContextLearningCodeEvalDataset

InContextLearningQATaskDataset

The ICL question answering (QA) task supports free response question answering evaluation using the model’s generate function. A QA dataset consists of a list of JSONs containing a question (under the key context), a correct answer (under the key answer), and a list of alternative spellings of the answer that would be considered permissible (under the key aliases). The QA task works with the NLP metric: InContextLearningQAAccuracy which assigns a model's output to be "correct" if, conditioned on the context, the model's generate method produces a string that is a normalized prefix for either the answer or any of the aliases.

Required keys for each datum:

  • context: str
  • answer: str
  • aliases: List[str]

An example datum is below:

{"context": "What star sign is Jamie Lee Curtis?", "answer": "Scorpio", "aliases": ["Scorpio", "Skorpio"]}

The QA task expects a prompt string, a continuation delimiter to separate questions from answers, an example delimiter to separate few shot examples from one another, and a question prelimiter to put before each question. If using the following settings, with 2 examples in context, the above datum may be rendered to the model as:

prompt_string: "Answer the following trivia question:\n", example_delimiter: "\n", continuation_delimiter: " Answer: ", question_prelimiter: "Question: "

Answer the following trivia question: Question: What is the Japanese share index called? Answer: Nikkei Question: Who was the man behind The Chipmunks? Answer: David Seville Question: What star sign is Jamie Lee Curtis? Answer:

The model would then be expected to generate a series of tokens beginning with either of the aliases: Scorpio/Skorpio.

Below is a complete YAML section that works with the TriviaQA dataset in scripts/eval/local_data/triviaqa.jsonl:

label: triviaqa
dataset_uri: local_data/triviaqa.jsonl
num_fewshot:
- 0
- 1
- 5
- 10
batch_size: 4
icl_task_type: question_answering
metric_names:
- InContextLearningQAAccuracy
prompt_string: '' # this goes at the beginning of each input
example_delimiter: "\n" # this goes between fewshot examples
continuation_delimiter: ' ' # this separates questions from answers

InContextLearningLMTaskDataset

The ICL language modeling (LM) task assesses the model’s ability to predict a precise sequence of tokens (called a continuation) following some context using the model’s forward function. An LM dataset consists of a list of JSONs containing a context (under the key context) and a continuation (under the key continuation that the model must correctly predict conditioned on the context. The LM task uses the NLP metric InContextLearningLMAccuracy, which assigns a model's output to be "correct" if, conditioned on the context tokens, the model's argmax output logits exactly match the tokens in the continuation.

Required keys for each datum:

  • context: str
  • continuation: str

An example datum is below:

{"context": "With Tristran's next step he was standing beside a lake, and the candlelight shone brightly on the water; and then he was walking through the mountains, through lonely crags, where the candlelight was reflected in the eyes of the creatures of the high snows; and then he was walking through the clouds, which, while not entirely substantial, still supported his weight in comfort; and then, holding tightly to his candle, he was underground, and the candlelight glinted back at him from the wet cave walls; now he was in the mountains once more; and then he was on a road through wild forest, and he glimpsed a chariot being pulled by two goats, being driven by a woman in a red dress who looked, for the glimpse he got of her, the way Boadicea was drawn in his history books; and another step and he was in a leafy glen, and he could hear the chuckle of water as it splashed and sang its way into a small brook.\n\nHe took another step, but he was still in the", "continuation": " glen"}

The LM task expects a prompt string, a continuation delimiter to separate continuation from context, and an example delimiter to separate few shot examples from one another. If using the following settings, with 0 examples in context, the above datum may be rendered to the model as:

With Tristran's next step he was standing beside a lake, and the candlelight shone brightly on the water; and then he was walking through the mountains, through lonely crags, where the candlelight was reflected in the eyes of the creatures of the high snows; and then he was walking through the clouds, which, while not entirely substantial, still supported his weight in comfort; and then, holding tightly to his candle, he was underground, and the candlelight glinted back at him from the wet cave walls; now he was in the mountains once more; and then he was on a road through wild forest, and he glimpsed a chariot being pulled by two goats, being driven by a woman in a red dress who looked, for the glimpse he got of her, the way Boadicea was drawn in his history books; and another step and he was in a leafy glen, and he could hear the chuckle of water as it splashed and sang its way into a small brook. He took another step, but he was still in the

The model would then be expected output “ glen”.

Below is a YAML section that works with the Lambada OpenAI dataset in scripts/eval/local_data/lambada_openai.jsonl:

label: lambada_openai
dataset_uri: local_data/lambada_openai.jsonl
num_fewshot:
- 0
batch_size: 4
icl_task_type: language_modeling
metric_names:
- InContextLearningLMAccuracy
prompt_string: '' # this goes at the beginning of each input
example_delimiter: "\n" # this goes between fewshot examples
continuation_delimiter: ' ' # this separates contexts from continuations

InContextLearningMultipleChoiceTaskDataset

The ICL multiple choice (MC) task assesses the model’s ability to answer multiple choice questions by assigning highest per token probability to the correct answer. An MC dataset consists of a list of JSONs containing a query (under the key query), a list of choices (under the key choices), and the index indicating the correct answer (under the key gold). The MC task works with the NLP metric InContextLearningMultipleChoiceAccuracy, which separately runs the model's forward() method on the query prepended to each choice, and then determines the model to be correct if the correct choice has the lowest per token perplexity conditioned on the query.

Required keys for each datum:

  • query: str
  • choices: str
  • gold: int

An example datum is below:

{"query": "High jump: A boy is running down a track. The boy", "choices": ["runs into a car.", "gets in a mat.", "lifts his body above the height of a pole.", "stands on his hands and springs."], "gold": 2}

The MC task expects a prompt string, a continuation delimiter to separate continuation from context, and an example delimiter to separate few shot examples from one another. If using the following settings, with 0 examples in context, the above datum may be rendered as four different inputs to the model:

High jump: A boy is running down a track. The boy runs into a car.

High jump: A boy is running down a track. The boy gets in a mat.

High jump: A boy is running down a track. The boy lifts his body above the height of a pole.

High jump: A boy is running down a track. The boy stands on his hands and springs.

The model would be deemed correct if it assigns the lowest per token perplexity to the sequence " lifts his body above the height of a pole."

Below is a YAML section that works with the HellaSwag dataset in scripts/eval/local_data/hellaswag.jsonl:

label: hellaswag
dataset_uri: local_data/hellaswag.jsonl # ADD YOUR OWN DATASET URI
num_fewshot:
- 0
- 1
- 5
- 10
batch_size: 4
icl_task_type: multiple_choice
metric_names:
- InContextLearningMultipleChoiceAccuracy
- InContextLearningMCExpectedCalibrationError
prompt_string: '' # this goes at the beginning of each input
example_delimiter: "\n" # this goes between fewshot examples
continuation_delimiter: ' ' # this separates questions from answers

InContextLearningSchemaTaskDataset

The ICL schema task assesses the model’s ability to determine which of some set of possible contexts (under the key context_options) makes a sequence of tokens (under the key continuation) most likely, with the correct context indicated by "gold". This task is based on A Simple Method for Commonsense Reasoning.

The schema task works with the NLP metric InContextLearningMultipleChoiceAccuracy, which separately runs the model's forward() method on each context option prepended to the continuation and rates the model correct if it assigns minimum per token perplexity to the continuation conditioned on the true context.

Required keys for each datum:

  • query: str
  • choices: str
  • gold: int

An example datum is below:

{"context_options": ["Jim comforted Kevin because Jim", "Jim comforted Kevin because Kevin"], "continuation": "was so upset.", "gold": 1}

The Schema task expects a prompt string, a continuation delimiter to separate continuation from context, and an example delimiter to separate few shot examples from one another. If using the following settings, with 0 few shot examples in context, the above datum may be rendered as two different inputs to the model:

Jim comforted Kevin because Jim was so upset.

Jim comforted Kevin because Kevin was so upset.

The model would be assigned correct if per token perplexity of the sequence " was so upset" is lower in the second version than it is in the first version.

Below is a YAML section that works with the Winograd dataset in scripts/eval/local_data/winograd_wsc.jsonl:

label: winograd
dataset_uri: local_data/winograd_wsc.jsonl
num_fewshot:
- 0
- 1
- 5
- 10
batch_size: 4
icl_task_type: schema
metric_names:
- InContextLearningMultipleChoiceAccuracy
- InContextLearningMCExpectedCalibrationError
prompt_string: '' # this goes at the beginning of each input
example_delimiter: "\n" # this goes between fewshot examples
continuation_delimiter: ' ' # this separates questions from answers

InContextLearningCodeEvalDataset

The ICL CodeEvalDataset takes a prompt, and, working with the NLP metric InContextLearningCodeEvalAccuracy, generates code which gets run against the supplied tests, as in HumanEval (Evaluating Large Language Models Trained on Code) and MBPP (Program Synthesis with Large Language Models). This generation involves many decoding steps, so can take longer per sample than other ICL tasks. An example datum:

{"task_id": "JavaScript/2", "prompt": "/* Given a positive floating point number, it can be decomposed into\n  and integer part (largest integer smaller than given number) and decimals\n  (leftover part always smaller than 1).\n\n  Return the decimal part of the number.\n  >>> truncateNumber(3.5)\n  0.5\n  */\nconst truncateNumber = (number) => {\n", "canonical_solution": "  return number % 1.0;\n}\n\n", "test": "const testTruncateNumber = () => {\n  console.assert(truncateNumber(3.5) === 0.5)\n\n  console.assert(Math.abs(truncateNumber(1.33) - 0.33) < 1e-6)\n\n  console.assert(Math.abs(truncateNumber(123.456 - 0.456) < 1e-6))\n}\n\ntestTruncateNumber()\n", "entry_point": "truncateNumber", "test_inputs": ["3.5", "1.33", "123.456"], "test_outputs": ["0.5", "0.33", "0.456"], "language": "javascript"}

Required keys for each datum:

  • prompt: str
  • test: str
  • entry_point: str
  • test_inputs: List[str]
  • test_outputs: List[str]
  • language: str

Code evaluation can happen locally (insecure) or inside an AWS Lambda function sandbox. This is controlled by setting the environment variable CODE_EVAL_DEVICE to LOCAL or LAMBDA. If set to LAMBDA, you must also provide CODE_EVAL_URL and CODE_EVAL_APIKEY to query the API gateway in the AWS Sandbox.


Build your own dataset (BYOD)

Building a dataset compatible with our eval suite is very easy if it fits with one of the four supported task types. Simply choose the appropriate task type (LM, MC, QA, or Schema) and process each dataset into a jsonl format in which each row has the format described above.

Below is a minimal script which prepares the Winograd schema challenge hosted on HuggingFace. This script can be modified to generate other datasets based on the HuggingFace dataset hub.

from datasets import load_dataset

upper_pronouns = [
        "A",
        "An",
        "The",
        "She",
        "He",
        "It",
        "They",
        "My",
        "His",
        "Her",
        "Their",
    ]

def __normalize_option(doc, option):
        # this function adapted from EleutherAI/lm-evaluation-harness

        # Append `'s` to possessive determiner based options.
        if doc["pronoun"].lower() in ["my", "his", "her", "our", "their"]:
            option += "'s"
        # Appropriately lowercase the pronoun in the option.
        pronoun = option.split()[0]
        start_of_sentence = doc["text"][doc["pronoun_loc"] - 2] == "."
        if not start_of_sentence and pronoun in upper_pronouns:
            return option.replace(pronoun, pronoun.lower())
        return option

def lower_first_letter(s):
    return s[0:1].lower() + s[1:]

def prep_winograd_wsc(row):
    # this function adapted from EleutherAI/lm-evaluation-harness

    prefix = row['text'][:row['pronoun_loc']]
    continuation = row['text'][row['pronoun_loc'] + len(row['pronoun']):]

    context_options = [
        prefix + __normalize_option(row, o) for o in row['options']
    ]

    return {
        "context_options": context_options,
        "continuation": continuation,
        "gold": row['label']
    }

def prep_dataset(out_file):
        dataset_name = ('winograd_wsc', 'wsc273')
        dataset = load_dataset(*dataset_name)

        with open(out_file, "w", encoding='utf8') as f:

            if dataset_name[0] == 'winogrande':
                split = dataset['validation']
            else:
                split = dataset['test'] if 'test' in dataset \
                    else dataset['validation']
            for row in split:
                row = prep_winograd_wsc(row)
                f.write(json.dumps(row, ensure_ascii=False) + "\n")

Similarly, you can compile a dataset directly from EleutherAI/lm-evaluation-harness by modifying the script below:

def prep_triviaqa(row):

    return {
        "context": f"Question: {row['question']}\nAnswer:",
        "answer": row['answer']['value'],
        "aliases": row['answer']['aliases']
    }

def prep_dataset(out_file):
    task = lm_eval_tasks.get_task_dict(['triviaqa'])['triviaqa']

    if task.has_test_docs():
        task_doc_func = task.test_docs
        task_set = "test"
    elif task.has_validation_docs():
        task_set = "val"
        task_doc_func = task.validation_docs
    with open(out_file, "w", encoding='utf8') as f:
            for row in task_doc_func():
                row = prep_triviaqa(row)
                f.write(json.dumps(row, ensure_ascii=False) + "\n")

A note on delimiters and tokenizers

When formatting samples, prompt_string is prepended to the beginning, then num_fewshot examples from the dataset are concatenated. Each few shot example is formatted with the context/continuation of each being separated by continuation_delimiter, then each example is separated from the others by the example_delimiter. Finally, we append the context/query/question/context options of the current sample to be evaluated and the continuation_delimiter.

Thus the structure of each question's preamble is prompt | few shot examples | context | continuation delimiter. The continuation (aka choices for MC) is then tokenized separately and the tokens of the preamble and tokens of the continuation are concatenated. It is important to note that if the continuation delimiter has a trailing space, it is stripped and instead prepended to the continuation. Furthermore, if the continuation does not have a leading space, one will be prepended.