Yirany commited on
Commit
4e95c51
1 Parent(s): 9c9ac63

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -12
README.md CHANGED
@@ -7,31 +7,42 @@ datasets:
7
 
8
 
9
  ## OmniLMM 12B
10
- **OmniLMM-12B** is the most capable version. The model is built based on [EVA02-5B](https://github.com/baaivision/EVA/tree/master/EVA-CLIP) and [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
 
11
 
12
  - 🔥 **Strong Performance.**
13
 
14
- OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also **supports OCR capability** and endows **rich multimodal world knowledge**.
15
 
16
  - 🏆 **Trustworthy Behavior.**
17
 
18
- LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using our recent [RLHF-V](https://rlhf-v.github.io/) technique) and **ranked #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench).
19
 
20
  - 🕹 **Real-time Multimodal Interaction.**
21
 
22
- We combine the OmniLMM-12B and GPT-3.5 into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
 
23
 
24
  ## Evaluation
 
 
 
 
 
 
 
 
25
  <table>
26
  <thead>
27
  <tr>
28
  <th align="left">Model</th>
29
  <th>Size</th>
30
  <th>MME</th>
 
31
  <th nowrap="nowrap" >MMMU val</th>
32
  <th nowrap="nowrap" >MMHal-Bench</th>
 
33
  <th nowrap="nowrap" >SeedBench-I</th>
34
- <th nowrap="nowrap">MMB dev (en)</th>
35
  <th>MathVista</th>
36
  <th nowrap="nowrap" >LLaVA Bench W</th>
37
  </tr>
@@ -41,10 +52,11 @@ datasets:
41
  <td align="left">GPT-4V†</td>
42
  <td>-</td>
43
  <td>1409</td>
 
44
  <td>56.8</td>
45
  <td>3.53 / 70.8</td>
 
46
  <td>71.6 </td>
47
- <td>75.1 </td>
48
  <td>47.8 </td>
49
  <td>93.1 </td>
50
  </tr>
@@ -52,10 +64,11 @@ datasets:
52
  <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
53
  <td>-</td>
54
  <td>1681</td>
 
55
  <td>45.2</td>
56
  <td>- </td>
 
57
  <td>65.7 </td>
58
- <td>66.2 </td>
59
  <td>36.0 </td>
60
  <td>73.7 </td>
61
  </tr>
@@ -63,10 +76,11 @@ datasets:
63
  <td align="left">Yi-VL 6B</td>
64
  <td align="right">6.7B </td>
65
  <td>- </td>
 
66
  <td>39.1 </td>
67
  <td>- </td>
 
68
  <td>66.1 </td>
69
- <td>68.2 </td>
70
  <td>28.0 </td>
71
  <td>39.9 </td>
72
  </tr>
@@ -74,10 +88,11 @@ datasets:
74
  <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
75
  <td align="right">9.6B</td>
76
  <td>1488</td>
 
77
  <td>35.9</td>
78
  <td>2.93 / 59.4</td>
 
79
  <td>64.8 </td>
80
- <td>60.6 </td>
81
  <td>33.8 </td>
82
  <td>67.7 </td>
83
  </tr>
@@ -85,10 +100,11 @@ datasets:
85
  <td align="left" >CogVLM</td>
86
  <td align="right">17.4B</td>
87
  <td>1438</td>
 
88
  <td>32.1 </td>
89
  <td>2.68 / 52.1 </td>
 
90
  <td>68.8 </td>
91
- <td>63.7 </td>
92
  <td>34.7 </td>
93
  <td>73.9 </td>
94
  </tr>
@@ -96,10 +112,11 @@ datasets:
96
  <td align="left" >LLaVA 1.5</td>
97
  <td align="right">13.6B </td>
98
  <td>1531 </td>
 
99
  <td>36.4 </td>
100
  <td>2.71 / 51.0 </td>
 
101
  <td>68.1 </td>
102
- <td>68.2 </td>
103
  <td>26.4 </td>
104
  <td>64.6 </td>
105
  </tr>
@@ -107,16 +124,20 @@ datasets:
107
  <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
108
  <td align="right">11.6B </td>
109
  <td>1637 </td>
 
110
  <td>40.7 </td>
111
  <td>3.45 / 68.8 </td>
 
112
  <td>71.1 </td>
113
- <td>71.6 </td>
114
  <td>34.9 </td>
115
  <td>72.0 </td>
116
  </tr>
117
  </tbody>
118
  </table>
119
  <small>†: Proprietary models</small>
 
 
 
120
 
121
  ## Demo
122
  Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).
 
7
 
8
 
9
  ## OmniLMM 12B
10
+
11
+ **OmniLMM-12B** is the most capable version of OmniLMM currently. The model is built based on EVA02-5B and Zephyr-7B-β, connected with a perceiver resampler layer, and trained on multimodal data in a curriculum fashion. The model has three notable features:
12
 
13
  - 🔥 **Strong Performance.**
14
 
15
+ OmniLMM-12B achieves **leading performance** among models with comparable sizes, surpassing established LMMs on multiple benchmarks (including MME, MMBench, SEED-Bench, etc). The model also endows rich multi-modal world knowledge.
16
 
17
  - 🏆 **Trustworthy Behavior.**
18
 
19
+ LMMs are known for suffering from hallucination, often generating text that is not factually grounded in images (e.g., faithfully describing non-existing objects in images). OmniLMM-12B is **the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) technique). It **ranks #1** among open-source models on [MMHal-Bench](https://huggingface.co/datasets/Shengcao1006/MMHal-Bench), and **outperforms GPT-4V** on [Object HalBench](https://arxiv.org/abs/2312.00849).
20
 
21
  - 🕹 **Real-time Multimodal Interaction.**
22
 
23
+ We combine the OmniLMM-12B and GPT-3.5 (text-only) into a **real-time multimodal interactive assistant**. The assistant accepts video streams from the camera and speech streams from the microphone and emits speech output. While still primary, we find the model can **replicate some of the fun cases shown in the Gemini Demo video, without any video edition**.
24
+
25
 
26
  ## Evaluation
27
+
28
+
29
+ <div align="center">
30
+ <img src=https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/LuKikSY4CJiqtHocGP_xu.png width=66% />
31
+ </div>
32
+ <details>
33
+ <summary>Click to view results on MME, MMBench, MMMU, MMBench, MMHal-Bench, Object HalBench, SeedBench, LLaVA Bench W, MathVista. </summary>
34
+
35
  <table>
36
  <thead>
37
  <tr>
38
  <th align="left">Model</th>
39
  <th>Size</th>
40
  <th>MME</th>
41
+ <th nowrap="nowrap">MMB dev (en)</th>
42
  <th nowrap="nowrap" >MMMU val</th>
43
  <th nowrap="nowrap" >MMHal-Bench</th>
44
+ <th nowrap="nowrap" >Object HalBench</th>
45
  <th nowrap="nowrap" >SeedBench-I</th>
 
46
  <th>MathVista</th>
47
  <th nowrap="nowrap" >LLaVA Bench W</th>
48
  </tr>
 
52
  <td align="left">GPT-4V†</td>
53
  <td>-</td>
54
  <td>1409</td>
55
+ <td>75.1 </td>
56
  <td>56.8</td>
57
  <td>3.53 / 70.8</td>
58
+ <td>86.4 / 92.7</td>
59
  <td>71.6 </td>
 
60
  <td>47.8 </td>
61
  <td>93.1 </td>
62
  </tr>
 
64
  <td nowrap="nowrap" align="left">Qwen-VL-Plus†</td>
65
  <td>-</td>
66
  <td>1681</td>
67
+ <td>66.2 </td>
68
  <td>45.2</td>
69
  <td>- </td>
70
+ <td>- </td>
71
  <td>65.7 </td>
 
72
  <td>36.0 </td>
73
  <td>73.7 </td>
74
  </tr>
 
76
  <td align="left">Yi-VL 6B</td>
77
  <td align="right">6.7B </td>
78
  <td>- </td>
79
+ <td>68.2 </td>
80
  <td>39.1 </td>
81
  <td>- </td>
82
+ <td>- </td>
83
  <td>66.1 </td>
 
84
  <td>28.0 </td>
85
  <td>39.9 </td>
86
  </tr>
 
88
  <td nowrap="nowrap" align="left" >Qwen-VL-Chat</td>
89
  <td align="right">9.6B</td>
90
  <td>1488</td>
91
+ <td>60.6 </td>
92
  <td>35.9</td>
93
  <td>2.93 / 59.4</td>
94
+ <td>56.2 / 80.0</td>
95
  <td>64.8 </td>
 
96
  <td>33.8 </td>
97
  <td>67.7 </td>
98
  </tr>
 
100
  <td align="left" >CogVLM</td>
101
  <td align="right">17.4B</td>
102
  <td>1438</td>
103
+ <td>63.7 </td>
104
  <td>32.1 </td>
105
  <td>2.68 / 52.1 </td>
106
+ <td>73.6 / 87.4 </td>
107
  <td>68.8 </td>
 
108
  <td>34.7 </td>
109
  <td>73.9 </td>
110
  </tr>
 
112
  <td align="left" >LLaVA 1.5</td>
113
  <td align="right">13.6B </td>
114
  <td>1531 </td>
115
+ <td>68.2 </td>
116
  <td>36.4 </td>
117
  <td>2.71 / 51.0 </td>
118
+ <td>53.7 / 77.4 </td>
119
  <td>68.1 </td>
 
120
  <td>26.4 </td>
121
  <td>64.6 </td>
122
  </tr>
 
124
  <td nowrap="nowrap" align="left" ><b>OmniLMM-12B</b></td>
125
  <td align="right">11.6B </td>
126
  <td>1637 </td>
127
+ <td>71.6 </td>
128
  <td>40.7 </td>
129
  <td>3.45 / 68.8 </td>
130
+ <td>90.3 / 95.5 </td>
131
  <td>71.1 </td>
 
132
  <td>34.9 </td>
133
  <td>72.0 </td>
134
  </tr>
135
  </tbody>
136
  </table>
137
  <small>†: Proprietary models</small>
138
+ <br>
139
+ </details>
140
+
141
 
142
  ## Demo
143
  Click here to try out the Demo of [OmniLMM-12B](http://120.92.209.146:8081).