leaderboard-pr-bot's picture
Adding Evaluation Results
be23217 verified
|
raw
history blame
5.68 kB
metadata
license: other
tags:
  - storywriting
  - text adventure
  - not-for-all-audiences
license_name: microsoft-research-license
model-index:
  - name: psyonic-cetacean-20B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 25.44
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jebcarter/psyonic-cetacean-20B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 27.84
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jebcarter/psyonic-cetacean-20B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 0.98
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jebcarter/psyonic-cetacean-20B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 3.13
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jebcarter/psyonic-cetacean-20B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 16.9
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jebcarter/psyonic-cetacean-20B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 20.95
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jebcarter/psyonic-cetacean-20B
          name: Open LLM Leaderboard

image/png

Presenting the FP16 files for Psyonic-Cetacean-20B! This is an experimental Llama2-based stack merge based on the models and recipe below:

  slices:
  - sources:
    - model: Orca2flat
      layer_range: [0, 16]
  - sources:
    - model: LLaMA2-13B-Psyfighter2 (FP16 not yet available)
      layer_range: [8, 24]
  - sources:
    - model: Orca2flat
      layer_range: [17, 32]
  - sources:
    - model: LLaMA2-13B-Psyfighter2 (FP16 not yet available)
      layer_range: [25, 40]
merge_method: passthrough
dtype: float16

Note: while we did run an inverted merge the output was not satisfactory and will not be released.

We first flatted the additional ChatML vocabulary tokens out of Orca-2-13B, then performed a stack merge with Psyfighter-2-13B. The results surprised us with their vividness, freshness of prose, obedience to instruction prompting, and formatting cohesion.

This model is focused on storywriting and text adventure, with a side order of Assistant and Chat functionality. Like its ancestor Psyfighter-2 this model will function better if you let it improvise and riff on your concepts rather than feeding it an excess of detail. Additionally, either the removal of the ChatML vocab or the stack merging process itself has resulted in not only an uncensored model but an actively anti-censored model, so please be aware that this model can and will kill you during adventures or output NSFW material if prompted accordingly.

During testing, the model exhibited an especially strong affinity for science fiction and space opera writing, while handling fantasy elements quite well and horror elements slightly less so. Refer to the Psyfighter-2 model card for best prompting practices.

Despite that, we have tested the model out to 16000 context via Rope scaling and the model does not drive towards NSFW on its own. It will follow your tone and style very well.

Please enjoy, and if you encounter anything exciting or weird, please reach out to me at [jebcarter@pm.me].

Special thanks as always to the KoboldAI crew who provided the mergebox, testing, and feedback on this model, and to gelukuMLG for the model mascot!

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 15.87
IFEval (0-Shot) 25.44
BBH (3-Shot) 27.84
MATH Lvl 5 (4-Shot) 0.98
GPQA (0-shot) 3.13
MuSR (0-shot) 16.90
MMLU-PRO (5-shot) 20.95