asharsha30/LLAMA_Harsha_8_B_ORDP_10k

This model is the fine tune of NousResearch/Meta-Llama-3-8B using the 12,000 steps of mlabonne/orpo-dpo-mix-40k.

💻 Usage

# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="asharsha30/LLAMA_Harsha_8_B_ORDP_10k")
pipe(messages)

📈Training And Evaluation Report:

Reports from Wandb

https://wandb.ai/asharshavardhana96-texas-a-m-university/huggingface/runs/gky6j4vn?nw=nwuserasharshavardhana96

Acknowledgment:

Huge thanks to Maxime Labonne for his brilliant blog post covering about the techniques related to finetuning the llama models using SFT and ORPO

Evaluated Using:

The model is evaluated using the https://github.com/mlabonne/llm-autoeval and the results are summarized from the generated gist https://gist.github.com/asharsha30-1996/4162fc98d9669aab3080645c54905bd0

Accuracy measure on Neous Benchmarks:

Model AGIEval GPT4All TruthfulQA Bigbench Average
LLAMA_Harsha_8_B_ORDP_10k 35.54 71.15 55.39 37.96 50.01

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 26.77 ± 2.78
acc_norm 27.17 ± 2.80
agieval_logiqa_en 0 acc 31.34 ± 1.82
acc_norm 33.03 ± 1.84
agieval_lsat_ar 0 acc 18.70 ± 2.58
acc_norm 19.57 ± 2.62
agieval_lsat_lr 0 acc 42.94 ± 2.19
acc_norm 35.10 ± 2.12
agieval_lsat_rc 0 acc 52.42 ± 3.05
acc_norm 43.87 ± 3.03
agieval_sat_en 0 acc 65.53 ± 3.32
acc_norm 54.37 ± 3.48
agieval_sat_en_without_passage 0 acc 41.75 ± 3.44
acc_norm 33.98 ± 3.31
agieval_sat_math 0 acc 42.27 ± 3.34
acc_norm 37.27 ± 3.27

Average: 35.54%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 49.91 ± 1.46
acc_norm 54.10 ± 1.46
arc_easy 0 acc 80.47 ± 0.81
acc_norm 80.05 ± 0.82
boolq 1 acc 82.08 ± 0.67
hellaswag 0 acc 61.08 ± 0.49
acc_norm 80.26 ± 0.40
openbookqa 0 acc 34.00 ± 2.12
acc_norm 45.00 ± 2.23
piqa 0 acc 79.71 ± 0.94
acc_norm 81.61 ± 0.90
winogrande 0 acc 74.98 ± 1.22

Average: 71.15%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 37.45 ± 1.69
mc2 55.39 ± 1.50

Average: 55.39%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 57.37 ± 3.60
bigbench_date_understanding 0 multiple_choice_grade 68.02 ± 2.43
bigbench_disambiguation_qa 0 multiple_choice_grade 31.01 ± 2.89
bigbench_geometric_shapes 0 multiple_choice_grade 20.89 ± 2.15
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 28.40 ± 2.02
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 20.71 ± 1.53
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 48.67 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 31.60 ± 2.08
bigbench_navigate 0 multiple_choice_grade 50.60 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 63.25 ± 1.08
bigbench_ruin_names 0 multiple_choice_grade 34.38 ± 2.25
bigbench_salient_translation_error_detection 0 multiple_choice_grade 21.84 ± 1.31
bigbench_snarks 0 multiple_choice_grade 44.20 ± 3.70
bigbench_sports_understanding 0 multiple_choice_grade 50.30 ± 1.59
bigbench_temporal_sequences 0 multiple_choice_grade 26.30 ± 1.39
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 21.36 ± 1.16
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 15.77 ± 0.87
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 48.67 ± 2.89

Average: 37.96%

Average score: 50.01%

Elapsed time: 02:36:38

Downloads last month
67
Safetensors
Model size
8.03B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for asharsha30/LLAMA_Harsha_8_B_ORDP_10k

Finetuned
(359)
this model
Quantizations
1 model

Dataset used to train asharsha30/LLAMA_Harsha_8_B_ORDP_10k

Evaluation results