rshacter/rs_uplimit1_model

Model Card: Uplimit Project 1 part 1 Model Description: This is a model to test run publishing models. It has no real model assessment value.

This is a Large Language Model (LLM) trained on a dataset of DIBT/10k_prompts_ranked. It was evaluated using using Eleuther Evaluation Harness

Hellaswag Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.2872	±	0.0045
		none	0	acc_norm	↑	0.3082	±	0.0046

Interpretation (curtesy for Prepelexit.ai):

Accuracy Metrics: Standard accuracy: 0.2872 (28.72%) Normalized accuracy: 0.3082 (30.82%)

Context: The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way. The task is considered difficult even for larger language models.

Interpretation Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.

Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.

Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.

Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.

Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training

rshacter
/

rs_uplimit1_model

Dataset used to train rshacter/rs_uplimit1_model