munish0838
commited on
Commit
β’
dc0cb2a
1
Parent(s):
636cac7
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
tags:
|
4 |
+
- nlp
|
5 |
+
- math
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
pipeline_tag: text-generation
|
9 |
+
base_model: microsoft/rho-math-1b-interpreter-v0.1
|
10 |
+
---
|
11 |
+
|
12 |
+
# QuantFactory/rho-math-1b-interpreter-v0.1-GGUF
|
13 |
+
This is quantized version of [microsoft/rho-math-1b-interpreter-v0.1](https://huggingface.co/microsoft/rho-math-1b-interpreter-v0.1) created using llama.cpp
|
14 |
+
|
15 |
+
# Model Description
|
16 |
+
|
17 |
+
<h1 align="center">
|
18 |
+
Rho-1: Not All Tokens Are What You Need
|
19 |
+
</h1>
|
20 |
+
|
21 |
+
|
22 |
+
<p align="center">
|
23 |
+
<a href="https://arxiv.org/abs/2404.07965"><b>[π Arxiv]</b></a> β’
|
24 |
+
<a href="https://huggingface.co/papers/2404.07965"><b>[π¬ HF Paper]</b></a> β’
|
25 |
+
<a href="https://huggingface.co/microsoft/rho-math-1b-v0.1"><b>[π€ Models]</b></a> β’
|
26 |
+
<a href="https://github.com/microsoft/rho"><b>[π± GitHub]</b></a>
|
27 |
+
</p>
|
28 |
+
|
29 |
+
<p align="center">
|
30 |
+
<img src="https://github.com/microsoft/rho/blob/main/docs/static/images/acc_vs_tokens_1b_7b.png?raw=true" width="1000">
|
31 |
+
<br>
|
32 |
+
<em>Figure 1: Rho-1 is pre-trained with Selective Language Modeling (SLM). SLM improves average few-shot accuracy on GSM8k and MATH by over 16%, achieving the baseline performance 5-10x faster.</em>
|
33 |
+
</p>
|
34 |
+
|
35 |
+
|
36 |
+
## π₯ News
|
37 |
+
|
38 |
+
- [2024/04/12] π₯π₯π₯ Rho-Math-v0.1 models released at π€ HuggingFace!
|
39 |
+
- [Rho-Math-1B](https://huggingface.co/microsoft/rho-math-1b-v0.1) and [Rho-Math-7B](https://huggingface.co/microsoft/rho-math-7b-v0.1) achieve 15.6% and 31.0% few-shot accuracy on MATH dataset, respectively β matching DeepSeekMath with only 3\% of the pretraining tokens.
|
40 |
+
- [Rho-Math-1B-Interpreter](https://huggingface.co/microsoft/rho-math-1b-interpreter-v0.1) is the first 1B LLM that achieves over 40% accuracy on MATH.
|
41 |
+
- [Rho-Math-7B-Interpreter](https://huggingface.co/microsoft/rho-math-7b-interpreter-v0.1) achieves 52% on MATH dataset, using only 69k samples for fine-tuning.
|
42 |
+
- [2024/04/11] Rho-1 paper and repo released.
|
43 |
+
|
44 |
+
|
45 |
+
|
46 |
+
## π‘ Introduction
|
47 |
+
|
48 |
+
Rho-1 base models employ Selective Language Modeling (SLM) for pretraining, which selectively trains on clean and useful tokens that aligned with the desired distribution.
|
49 |
+
|
50 |
+
|
51 |
+
### Selective Lanugage Modeling (SLM)
|
52 |
+
|
53 |
+
<p align="center">
|
54 |
+
<img src="https://github.com/microsoft/rho/blob/main/docs/static/images/example.png?raw=true" width="1000">
|
55 |
+
<br>
|
56 |
+
<em>Figure 2:
|
57 |
+
<b>Upper:</b> Even an extensively filtered pretraining corpus contains token-level noise.
|
58 |
+
<b>Left:</b> Previous Causal Language Modeling (CLM) trains on all tokens.
|
59 |
+
<b>Right:</b> Our proposed Selective Language Modeling (SLM) selectively applies loss on those useful and clean tokens.</em>
|
60 |
+
</p>
|
61 |
+
|
62 |
+
<p align="center">
|
63 |
+
<img src="https://github.com/microsoft/rho/blob/main/docs/static/images/pipeline.png?raw=true" width="1000">
|
64 |
+
<br>
|
65 |
+
<em>Figure 3: <b>The pipeline of Selective Language Modeling.</b>
|
66 |
+
SLM optimizes language model performance by concentrating on valuable, clean tokens during pre-training.
|
67 |
+
It involves three steps:
|
68 |
+
(Step 1) Initially, train a reference model on high-quality data.
|
69 |
+
(Step 2) Then, score each token's loss in a corpus using the reference model.
|
70 |
+
(Step 3) Finally, train the language model selectively on tokens that show higher excess loss compared to the reference loss.</em>
|
71 |
+
</p>
|
72 |
+
|
73 |
+
<!-- results: -->
|
74 |
+
|
75 |
+
### Evaluation Results
|
76 |
+
|
77 |
+
Base models (Few-shot CoT):
|
78 |
+
|
79 |
+
| **Model** | **Size** | **Data** | **Uniq. Token** | **Train Token** | **GSM8K** | **MATH** | **MMLU STEM** | **SAT** |
|
80 |
+
|:-----------------:|:--------:|:--------:|:---------------:|:---------------:|:---------:|:--------:|:-------------:|:--------:|
|
81 |
+
| 1-2B Base Models | | | | | | | | |
|
82 |
+
| Qwen1.5 | 1.8B | - | - | - | 36.1 | 6.8 | 31.3 | 40.6 |
|
83 |
+
| Gemma | 2.0B | - | - | - | 18.8 | 11.4 | **34.4** | 50.0 |
|
84 |
+
| DeepSeekMath | 1.3B | - | 120B | 150B | 23.8 | 13.6 | 33.1 | **56.3** |
|
85 |
+
| [Rho-Math-1B-v0.1](https://huggingface.co/microsoft/rho-math-1b-v0.1) | 1.1B | OWM | 14B | 30B | **36.2** | **15.6** | 23.3 | 28.1 |
|
86 |
+
| >= 7B Base Models | | | | | | | | |
|
87 |
+
| Mistral | 7B | | - | - | 41.2 | 11.6 | 49.5 | 59.4 |
|
88 |
+
| Minerva | 540B | - | 39B | 26B | 58.8 | 33.6 | **63.9** | - |
|
89 |
+
| LLemma | 34B | PPile | 55B | 50B | 54.2 | 23.0 | 54.7 | 68.8 |
|
90 |
+
| InternLM2-Math | 20B | - | 31B | 125B | 65.4 | 30.0 | 53.1 | 71.9 |
|
91 |
+
| DeepSeekMath | 7B | - | 120B | 500B | 64.1 | **34.2** | 56.4 | **84.4** |
|
92 |
+
| [Rho-Math-7B-v0.1](https://huggingface.co/microsoft/rho-math-7b-v0.1) | 7B | OWM | 14B | 10.5B | **66.9** | 31.0 | 54.6 | **84.4** |
|
93 |
+
|
94 |
+
|
95 |
+
[Tool-integrated reasoning](https://github.com/microsoft/ToRA) (Code Interpreter):
|
96 |
+
|
97 |
+
| **Model** | **Size** | **SFT Data** | **GSM8k** | **MATH** | **SVAMP** | **ASDiv** | **MAWPS** | **TabMWP** | **GSM-Hard** | **AVG** |
|
98 |
+
|------------------------------|----------|--------------|-----------|----------|-----------|-----------|-----------|------------|--------------|----------|
|
99 |
+
| gpt4-early (pal) | - | - | 94.2 | 51.8 | 94.8 | 92.6 | 97.7 | 95.9 | 77.6 | 86.4 |
|
100 |
+
| gpt-4-turbo-2024-04-09 (cot) | - | - | - | 73.4 | - | - | - | - | - |
|
101 |
+
| Open-Source Small Models | | | | | | | | | |
|
102 |
+
| MAmmoTH | 70B | MI-260k | 76.9 | 41.8 | 82.4 | - | - | - | - | - |
|
103 |
+
| ToRA | 7B | ToRA-69k | 68.8 | 40.1 | 68.2 | 73.9 | 88.8 | 42.4 | 54.6 | 62.4 |
|
104 |
+
| ToRA | 70B | ToRA-69k | 84.3 | 49.7 | **82.7** | 86.8 | 93.8 | 74.0 | **67.2** | **76.9** |
|
105 |
+
| DeepSeekMath | 7B | ToRA-69k | 79.8 | **52.0** | 80.1 | **87.1** | 93.8 | **85.8** | 63.1 | 77.4 |
|
106 |
+
| [Rho-Math-1B-Interpreter-v0.1](https://huggingface.co/microsoft/rho-math-1b-interpreter-v0.1) | 1B | ToRA-69k | 59.4 | 40.6 | 60.7 | 74.2 | 88.6 | 26.7 | 48.1 | 56.9 |
|
107 |
+
| [Rho-Math-7B-Interpreter-v0.1](https://huggingface.co/microsoft/rho-math-7b-interpreter-v0.1) | 7B | ToRA-69k | 81.3 | **51.8** | 80.8 | 85.5 | **94.5** | 70.1 | 63.1 | 75.3 |
|
108 |
+
|
109 |
+
|
110 |
+
## π Quick Start
|
111 |
+
|
112 |
+
|
113 |
+
### Evaluation
|
114 |
+
|
115 |
+
```sh
|
116 |
+
git clone git@github.com:microsoft/rho.git
|
117 |
+
cd rho-1/math-evaluation-harness
|
118 |
+
```
|
119 |
+
|
120 |
+
Base model few-shot evaluation:
|
121 |
+
|
122 |
+
```sh
|
123 |
+
bash scripts/run_eval.sh cot microsoft/rho-math-7b-v0.1
|
124 |
+
```
|
125 |
+
|
126 |
+
SFT model (code-interpreter) evaluation:
|
127 |
+
|
128 |
+
```sh
|
129 |
+
bash scripts/run_eval.sh tora microsoft/rho-math-7b-interpreter-v0.1
|
130 |
+
```
|
131 |
+
|
132 |
+
Our reproduced outputs are provided in `rho-1/outputs.zip`.
|
133 |
+
|