Model Card for EH-sentiment-finetuned-Llama-3.2-1B-Instruct/

This is a test project, fine tuning Llama3.1-1B-Instruct for sentiment classification, using a subset of an amazon reviews dataset mteb/amazon_polarity and ORPO fine tuning.

The finetuned model achieves moderate +10% improvement on sentiment classification (as measured by SST2 - which asks the model to classify sentences in a single word, either 'positive' or 'neagtive'), without general performance being impacted (as measured by hellaswag, which asks the model to complete a sentence with a sensible response, chosen from a list of choices).

Metric Category Metric Base Model Finetuned Model Change
Sentiment SST2/acc 0.68 0.75 +10%
General Completions hellaswag/acc 0.447 0.459 +3%
hellaswag/acc_norm 0.550 0.560 +2%

The training dataset was the first 10k samples from mteb/amazon_polarity, and the model was trained for 5 epochs. The dataset was nearly balanced across positive and negative sentiment - ~51% of examples were negative.

The finetuning training examples used an SST-like prompt format (see Prompt Formats, below). An attempt was also made to train using exactly the SST Eval format. Oddly, using the SST Eval format resulted in the SST accuracy going down (0.54 for 10k samples and 1 epoch, -20% compared to the base model.) This was unexpected, and weird, and probably would bear further investigation.

The model was much worse at correctly identifying positive sentiment (57% accuracy) than it was at identifying negative sentiment (93% accuracy) - see Confusion Matrix, below. This performance on negative sentiment is good - State of the Art for SST2 overall is 97% (achieved by T5-11B).

Since the training dataset was balanced across positive and negative examples, this mismatch seems likely to have been present in the base model, although this was not confirmed. Next steps for improvement should be to verify that the behavior is inherited, and if so probably train with a larger set of positive statements.

Confusion Matrix

Prompt Formats

SST Eval: The SST Eval uses prompts like this:

A complete waste of time. Typographical errors, poor grammar, and a totally pathetic plot add up to absolutely nothing. I'm embarrassed for this author and very disappointed I actually paid for this book.

Question: Is this sentence positive or negative?
Answer:

SST-like: Training examples were formulated using an SST-like prompt:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

###Instruction:
Determine the sentiment of the input sentence. Please respond as positive or negative.
###Input:
The best soundtrack ever to anything.

Model Details

Model Description

Fintuned model for sentiment classification.

  • Developed by: Eric Hennings
  • Finetuned from model [optional]: meta-llama/Llama-3.2-1B-Instruct

Model Sources [optional]

Downloads last month
32
Safetensors
Model size
1.24B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for erichennings/EH-sentiment-finetuned-Llama-3.2-1B-Instruct

Finetuned
(140)
this model

Dataset used to train erichennings/EH-sentiment-finetuned-Llama-3.2-1B-Instruct