athirdpath's picture
Adding Evaluation Results (#1)
47ee831 verified
metadata
language:
  - en
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - trl
  - sft
base_model: athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1
model-index:
  - name: Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 45.21
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 28.02
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 8.84
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 5.59
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 8.3
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 28.5
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1-plus_reddit
          name: Open LLM Leaderboard

athirdpath/Llama-3.1-Instruct_NSFW-pretrained_e1 further pretrained on 1 epoch of the dirty stories from nothingiisreal/Reddit-Dirty-And-WritingPrompts, with all scores below 2 dropped.


Why do this? I have a niche use case where I cannot increase compute over 8b, and L3/3.1 are the only models in this size category that meet my needs for logic. However, both versions of L3/3.1 have the damn repetition/token overconfidence problem, and this is meant to disrupt that certainty without disrupting the model's ability to function.

By the way, I think it's the lm_head that is causing the looping, but it might be the embeddings being too separated. I'm not going to pay two more times to test them separately, however :p

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 20.74
IFEval (0-Shot) 45.21
BBH (3-Shot) 28.02
MATH Lvl 5 (4-Shot) 8.84
GPQA (0-shot) 5.59
MuSR (0-shot) 8.30
MMLU-PRO (5-shot) 28.50