Model Card: Uplimit Project 1 part 1 Model Description: This is a model to test run publishing models. It has no real model assessment value.
This is a Large Language Model (LLM) trained on a dataset of DIBT/10k_prompts_ranked. It was evaluated using using Eleuther Evaluation Harness
Hellaswag Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
hellaswag | 1 | none | 0 | acc | ↑ | 0.2872 | ± | 0.0045 |
none | 0 | acc_norm | ↑ | 0.3082 | ± | 0.0046 |
Interpretation (curtesy for Prepelexit.ai):
Accuracy Metrics: Standard accuracy: 0.2872 (28.72%) Normalized accuracy: 0.3082 (30.82%)
Context: The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way. The task is considered difficult even for larger language models.
Interpretation Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.
Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.
Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.
Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.
Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training