Safetensors
qwen2
linqq9 commited on
Commit
e4897ad
1 Parent(s): 948c604

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -63
README.md CHANGED
@@ -19,70 +19,18 @@ Take Hammer2.0-7b as an example, it is a fine-tuned model based on [Qwen2.5-Code
19
  Thanks so much for your attention, a report with all the technical details leading to our models will be published soon.
20
 
21
  ## Evaluation
22
- First, we evaluate Hammer series on the Berkeley Function-Calling Leaderboard (BFCL):
23
-
24
- | | | | | | | | | | | | | | | | Multi Turn | | | | | | Hallucination Measurement | | | | | | |
25
- |:----:|:-----------:|:---------------------------------------------:|:--------------:|:------:|:--------:|:--------:|:-----------------:|:---------------:|:------:|:--------:|:--------:|:-----------------:|:-----------:|:------:|:----------:|:--------:|:-----------------:|:-----------:|:----:|:---------:|:-------------------------:|:------------:|:---------:|:---------:|-------------|---------------|------------------------|
26
- | | | | Non-live (AST) | | | | | Non-live (Exec) | | | | | Live (AST) | | | | | Multi turn | | | | | | | | | |
27
- | Rank | Overall Acc | Model | AST Summary | Simple | Multiple | Parallel | Multiple Parallel | Exec Summary | Simple | Multiple | Parallel | Multiple Parallel | Overall Acc | Simple | Multiple | Parallel | Multiple Parallel | Overall Acc | Base | Miss Func | Miss Param | Long Context | Composite | Relevance | Irrelevance | Organization | License |
28
- | 1 | 59.49 | GPT-4-turbo-2024-04-09 (FC) | 82.65 | 60.58 | 91 | 90 | 89 | 83.8 | 88.71 | 88 | 86 | 72.5 | 73.39 | 67.83 | 74.45 | 75 | 62.5 | 21.62 | 33.5 | 3.5 | 20 | 29.5 | N/A | 70.73 | 79.79 | OpenAI | Proprietary |
29
- | 2 | 59.29 | GPT-4o-2024-08-06 (FC) | 85.52 | 73.58 | 92.5 | 91.5 | 84.5 | 82.96 | 85.36 | 90 | 84 | 72.5 | 71.79 | 67.83 | 69.43 | 75 | 66.67 | 21.25 | 31 | 5 | 19.5 | 29.5 | N/A | 63.41 | 82.91 | OpenAI | Proprietary |
30
- | 3 | 59.13 | xLAM-8x22b-r (FC) | 89.75 | 77 | 95.5 | 92.5 | 94 | 89.32 | 98.29 | 94 | 90 | 75 | 72.81 | 70.93 | 77.72 | 75 | 75 | 15.62 | 21.5 | 3.5 | 17 | 20.5 | N/A | 97.56 | 75.23 | Salesforce | cc-by-nc-4.0 |
31
- | 4 | 58.45 | GPT-4o-mini-2024-07-18 (FC) | 82.83 | 67.83 | 90.5 | 89.5 | 83.5 | 81.8 | 83.21 | 92 | 82 | 70 | 67.53 | 67.83 | 69.82 | 81.25 | 70.83 | 25.75 | 36.5 | 9.5 | 24.5 | 32.5 | N/A | 82.93 | 71.83 | OpenAI | Proprietary |
32
- | 5 | 57.94 | xLAM-8x7b-r (FC) | 88.44 | 77.25 | 95.5 | 92 | 89 | 85.89 | 91.57 | 94 | 88 | 70 | 71.97 | 68.99 | 76.18 | 50 | 75 | 15.75 | 18.5 | 8 | 15.5 | 21 | N/A | 92.68 | 72.35 | Salesforce | cc-by-nc-4.0 |
33
- | 6 | 57.21 | GPT-4o-mini-2024-07-18 (Prompt) | 86.54 | 79.67 | 89.5 | 89 | 88 | 87.95 | 98.29 | 94 | 82 | 77.5 | 72.77 | 72.09 | 73.77 | 81.25 | 70.83 | 11.62 | 15 | 1.5 | 13 | 17 | N/A | 80.49 | 79.2 | OpenAI | Proprietary |
34
- | | 56.96 | MadeAgents/Hammer2.0-7b (FC) | 90.33 | 79.83 | 95 | 94 | 92.5 | 82.2 | 83.29 | 92 | 86 | 67.5 | 68.99 | 67.83 | 76.28 | 75 | 70.83 | 16.5 | 21.5 | 7.5 | 19 | 18 | N/A | 92.68 | 68.88 | MadeAgents | cc-by-nc-4.0 |
35
- | 7 | 55.82 | mistral-large-2407 (FC) | 84.12 | 57.5 | 94 | 93 | 92 | 83.09 | 76.86 | 92 | 86 | 77.5 | 67.17 | 79.07 | 78.88 | 87.5 | 75 | 20.5 | 29 | 13 | 19.5 | 20.5 | N/A | 78.05 | 48.93 | Mistral AI | Proprietary |
36
- | 8 | 55.67 | GPT-4-turbo-2024-04-09 (Prompt) | 91.31 | 82.25 | 94.5 | 95 | 93.5 | 88.12 | 99 | 96 | 80 | 77.5 | 67.97 | 78.68 | 83.12 | 81.25 | 75 | 10.62 | 12.5 | 5.5 | 11 | 13.5 | N/A | 82.93 | 61.82 | OpenAI | Proprietary |
37
- | 9 | 54.83 | Claude-3.5-Sonnet-20240620 (FC) | 70.35 | 75.42 | 93.5 | 62 | 50.5 | 66.34 | 95.36 | 86 | 44 | 40 | 71.39 | 72.48 | 70.68 | 68.75 | 75 | 23.5 | 30.5 | 8 | 27 | 28.5 | N/A | 63.41 | 75.91 | Anthropic | Proprietary |
38
- | 10 | 53.66 | GPT-4o-2024-08-06 (Prompt) | 80.9 | 64.08 | 86.5 | 88 | 85 | 77.89 | 70.57 | 88 | 78 | 75 | 73.88 | 67.44 | 67.21 | 56.25 | 58.33 | 6.12 | 9 | 1 | 7.5 | 7 | N/A | 53.66 | 89.56 | OpenAI | Proprietary |
39
- | 11 | 53.43 | o1-mini-2024-09-12 (Prompt) | 75.48 | 68.92 | 89 | 73.5 | 70.5 | 76.86 | 78.93 | 88 | 78 | 62.5 | 71.17 | 62.79 | 65.09 | 68.75 | 58.33 | 11 | 16 | 2 | 12.5 | 13.5 | N/A | 46.34 | 88.07 | OpenAI | Proprietary |
40
- | 12 | 53.01 | Gemini-1.5-Flash-Preview-0514 (FC) | 77.1 | 65.42 | 94.5 | 71.5 | 77 | 71.23 | 57.93 | 84 | 78 | 65 | 71.17 | 62.79 | 72.61 | 56.25 | 54.17 | 13.12 | 17.5 | 4 | 15.5 | 15.5 | N/A | 60.98 | 76.15 | Google | Proprietary |
41
- | 13 | 52.53 | Gemini-1.5-Pro-Preview-0514 (FC) | 75.54 | 50.17 | 89.5 | 83.5 | 79 | 77.46 | 71.86 | 86 | 82 | 70 | 69.26 | 60.08 | 66.35 | 75 | 54.17 | 10.87 | 15.5 | 1.5 | 11 | 15.5 | N/A | 60.98 | 80.56 | Google | Proprietary |
42
- | | 51.94 | MadeAgents/Hammer2.0-1.5b (FC) | 84.31 | 75.25 | 92.5 | 87.5 | 82 | 81.8 | 83.71 | 90 | 86 | 67.5 | 63.17 | 64.73 | 67.31 | 50 | 66.67 | 11.38 | 14 | 7 | 12 | 12.5 | N/A | 92.68 | 61.83 | MadeAgents | cc-by-nc-4.0 |
43
- | 14 | 51.93 | GPT-3.5-Turbo-0125 (FC) | 84.52 | 74.08 | 93 | 87.5 | 83.5 | 81.66 | 95.14 | 88 | 86 | 57.5 | 59 | 65.5 | 74.16 | 56.25 | 54.17 | 19.12 | 30 | 7.5 | 23 | 16 | N/A | 97.56 | 35.83 | OpenAI | Proprietary |
44
- | 15 | 51.78 | FireFunction-v2 (FC) | 85.71 | 78.83 | 92 | 91 | 81 | 84.23 | 94.43 | 88 | 82 | 72.5 | 61.71 | 69.38 | 70.97 | 56.25 | 54.17 | 11.62 | 21.5 | 1.5 | 17.5 | 6 | N/A | 87.8 | 52.94 | Fireworks | Apache 2.0 |
45
- | 16 | 51.78 | Open-Mistral-Nemo-2407 (FC) | 80.98 | 60.92 | 92 | 85.5 | 85.5 | 81.46 | 91.36 | 86 | 86 | 62.5 | 61.44 | 68.22 | 67.98 | 75 | 62.5 | 14.25 | 21 | 10 | 13.5 | 12.5 | N/A | 65.85 | 59.14 | Mistral AI | Proprietary |
46
- | 17 | 51.45 | xLAM-7b-fc-r (FC) | 86.83 | 77.33 | 92.5 | 91.5 | 86 | 85.02 | 91.57 | 88 | 88 | 72.5 | 68.81 | 63.57 | 63.36 | 56.25 | 50 | 0 | 0 | 0 | 0 | 0 | N/A | 80.49 | 79.76 | Salesforce | cc-by-nc-4.0 |
47
- | 18 | 51.01 | Gorilla-OpenFunctions-v2 (FC) | 87.29 | 77.67 | 95 | 89 | 87.5 | 84.96 | 95.86 | 96 | 78 | 70 | 68.59 | 63.95 | 63.93 | 62.5 | 45.83 | 0 | 0 | 0 | 0 | 0 | N/A | 85.37 | 73.13 | Gorilla LLM | Apache 2.0 |
48
- | | 49.88 | MadeAgents/Hammer2.0-3b (FC) | 86.77 | 77.08 | 92.5 | 89.5 | 88 | 80.25 | 81.5 | 86 | 86 | 67.5 | 66.06 | 63.95 | 72.81 | 56.25 | 66.67 | 0.5 | 1 | 0 | 0.5 | 0.5 | N/A | 92.68 | 68.59 | MadeAgents | cc-by-nc-4.0 |
49
- | 19 | 49.63 | Claude-3-Opus-20240229 (FC tools-2024-04-04) | 58.4 | 74.08 | 89.5 | 35 | 35 | 63.16 | 84.64 | 86 | 52 | 30 | 70.5 | 64.73 | 70.4 | 43.75 | 20.83 | 15.62 | 22 | 4 | 14.5 | 22 | N/A | 73.17 | 76.4 | Anthropic | Proprietary |
50
- | 20 | 49.55 | Meta-Llama-3-70B-Instruct (Prompt) | 87.21 | 75.83 | 94.5 | 91.5 | 87 | 87.41 | 94.14 | 94 | 84 | 77.5 | 63.39 | 69.77 | 78.01 | 75 | 66.67 | 1.12 | 1.5 | 1.5 | 1 | 0.5 | N/A | 92.68 | 50.63 | Meta | Meta Llama 3 Community |
51
- | 21 | 48.14 | Command-R-Plus (Prompt) (Original) | 75.54 | 71.17 | 85 | 80 | 66 | 77.57 | 91.29 | 86 | 78 | 55 | 67.88 | 65.12 | 71.26 | 75 | 58.33 | 0.25 | 0.5 | 0 | 0 | 0.5 | N/A | 75.61 | 69.31 | Cohere For AI | cc-by-nc-4.0 |
52
- | 22 | 47.66 | Granite-20b-FunctionCalling (FC) | 82.67 | 73.17 | 92 | 84 | 81.5 | 82.96 | 85.36 | 90 | 84 | 72.5 | 55.89 | 57.36 | 54.1 | 37.5 | 54.17 | 3.63 | 4.5 | 1.5 | 3.5 | 5 | N/A | 95.12 | 72.43 | IBM | Apache-2.0 |
53
- | 23 | 45.88 | Hermes-2-Pro-Llama-3-70B (FC) | 81.73 | 65.92 | 80.5 | 90.5 | 90 | 81.29 | 80.64 | 88 | 84 | 72.5 | 58.6 | 66.67 | 62.49 | 50 | 66.67 | 0.25 | 0.5 | 0 | 0 | 0.5 | N/A | 80.49 | 53.8 | NousResearch | apache-2.0 |
54
- | 24 | 45.4 | xLAM-1b-fc-r (FC) | 79.17 | 73.17 | 89.5 | 77.5 | 76.5 | 80.5 | 78 | 88 | 86 | 70 | 57.57 | 56.59 | 56.12 | 50 | 58.33 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | N/A | 95.12 | 61.26 | Salesforce | cc-by-nc-4.0 |
55
- | 25 | 45.22 | Command-R-Plus (FC) (Original) | 77.65 | 69.58 | 88 | 82.5 | 70.5 | 77.41 | 89.14 | 86 | 82 | 52.5 | 54.24 | 58.91 | 56.89 | 50 | 54.17 | 6.12 | 9.5 | 0 | 6.5 | 8.5 | N/A | 92.68 | 52.75 | Cohere For AI | cc-by-nc-4.0 |
56
- | 26 | 44.28 | Hermes-2-Pro-Llama-3-8B (FC) | 77.17 | 64.17 | 91 | 79.5 | 74 | 74.05 | 68.71 | 90 | 80 | 57.5 | 57.8 | 60.47 | 58.92 | 43.75 | 41.67 | 1.88 | 2.5 | 0.5 | 2.5 | 2 | N/A | 53.66 | 55.16 | NousResearch | apache-2.0 |
57
- | 27 | 44.23 | Hermes-2-Pro-Mistral-7B (FC) | 73.17 | 62.67 | 85.5 | 77 | 67.5 | 74.25 | 60.5 | 90 | 84 | 62.5 | 54.11 | 59.3 | 57.47 | 43.75 | 33.33 | 9.88 | 12 | 6.5 | 10 | 11 | N/A | 75.61 | 38.55 | NousResearch | apache-2.0 |
58
- | 28 | 43.9 | Hermes-2-Theta-Llama-3-8B (FC) | 73.56 | 61.25 | 82.5 | 75.5 | 75 | 72.54 | 69.14 | 88 | 78 | 55 | 59.57 | 55.81 | 53.13 | 43.75 | 41.67 | 1 | 1.5 | 0 | 1 | 1.5 | N/A | 51.22 | 62.66 | NousResearch | apache-2.0 |
59
- | 29 | 43 | Open-Mixtral-8x22b (FC) | 56.12 | 50.5 | 95 | 8.5 | 70.5 | 59.7 | 77.79 | 92 | 24 | 45 | 65.3 | 68.99 | 70.49 | 12.5 | 54.17 | 8.88 | 12.5 | 6.5 | 8 | 8.5 | N/A | 85.37 | 44.2 | Mistral AI | Proprietary |
60
- | | 39.51 | MadeAgents/Hammer2.0-0.5b (FC) | 67 | 62 | 80 | 68 | 58 | 65.73 | 48.43 | 82 | 80 | 52.5 | 51.62 | 47.67 | 42.14 | 50 | 37.5 | 0 | 0 | 0 | 0 | 0 | N/A | 87.8 | 67 | MadeAgents | cc-by-nc-4.0 |
61
- | 30 | 38.39 | Claude-3-Haiku-20240307 (Prompt) | 62.52 | 77.58 | 93 | 47.5 | 32 | 60.73 | 89.43 | 94 | 32 | 27.5 | 58.06 | 71.71 | 75.99 | 56.25 | 58.33 | 1.62 | 2.5 | 0.5 | 1 | 2.5 | N/A | 85.37 | 18.9 | Anthropic | Proprietary |
62
- | 31 | 37.77 | Claude-3-Haiku-20240307 (FC tools-2024-04-04) | 42.42 | 74.17 | 93 | 2 | 0.5 | 47.16 | 90.64 | 92 | 6 | 0 | 51.98 | 71.32 | 64.9 | 0 | 4.17 | 18.5 | 25 | 6.5 | 24 | 18.5 | N/A | 97.56 | 29.08 | Anthropic | Proprietary |
63
- | 32 | 16.66 | Hermes-2-Theta-Llama-3-70B (FC) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 38.87 | | | | | | | | | | | | | | |
64
-
65
- In addition, we evaluated our Hammer2.0 series (0.5b, 1.5b, 3b, 7b) on other academic benchmarks to further show our model's generalization ability:
66
-
67
- | Model | Size | Func-Name+Args Det. (F1 Func-Name \| F1 Args) | | | | | | | | | | F1 Average | |
68
- |:---------------------------:|:----:|:---------------------------------------------:|:-----:|:------------:|:-----:|:-----------:|:-----:|:---------------------:|:-----:|:-----------:|:-----:|:----------:|:-----:|
69
- | | | API-Bank L-1 | | API-Bank L-2 | | Tool-Alpaca | | SealTool(Single-Tool) | | Nexus Raven | | Func Name | Args |
70
- | GPT-4o-mini (Prompt) | -- | 95.1% | 89.3% | 84.3% | 67.5% | 64.3% | 54.7% | 87.9% | 86.0% | 91.7% | 84.6% | 84.7% | 76.4% |
71
- | qwen2-7b-instruct | 7B | 81.5% | 60.6% | 95.7% | 49.5% | 71.6% | 48.1% | 93.9% | 77.5% | 87.1% | 63.5% | 85.9% | 59.8% |
72
- | qwen1.5-4b-Chat | 4B | 55.3% | 59.8% | 46.7% | 38.5% | 35.4% | 17.0% | 48.4% | 62.3% | 29.0% | 33.7% | 43.0% | 42.2% |
73
- | qwen2-1.5b-instruct | 1.5B | 74.6% | 63.6% | 57.7% | 33.6% | 65.8% | 45.2% | 82.1% | 75.5% | 70.6% | 45.5% | 70.2% | 52.7% |
74
- | Gorilla-openfunctions-v2 | 7B | 69.2% | 70.3% | 48.8% | 54.7% | 72.9% | 51.3% | 93.2% | 91.1% | 72.8% | 68.4% | 71.4% | 67.2% |
75
- | GRANITE-20B-FUNCTIONCALLING | 20B | 90.4% | 77.8% | 78.9% | 59.2% | 77.3% | 58.0% | 94.9% | 92.7% | 94.5% | 75.1% | 87.2% | 72.6% |
76
- | xlam-7b-fc-r | 7B | 90.0% | 80.7% | 72.5% | 64.2% | 67.3% | 59.0% | 79.0% | 76.9% | 54.1% | 57.5% | 72.6% | 67.7% |
77
- | xlam-1b-fc-r | 1.3B | 94.9% | 83.7% | 91.8% | 64.3% | 64.9% | 50.6% | 90.7% | 80.4% | 64.4% | 54.8% | 81.3% | 66.8% |
78
- | Hammer-7b | 7B | 93.5% | 85.8% | 82.9% | 66.4% | 82.3% | 59.9% | 97.4% | 91.7% | 92.5% | 77.4% | 89.7% | 76.2% |
79
- | Hammer-4b | 4B | 91.6% | 81.5% | 77.6% | 61.0% | 85.1% | 57.0% | 96.4% | 92.4% | 81.7% | 64.9% | 86.5% | 71.4% |
80
- | Hammer-1.5b | 1.5B | 82.1% | 72.3% | 79.8% | 59.7% | 80.9% | 53.5% | 95.6% | 88.6% | 79.9% | 56.9% | 83.7% | 66.2% |
81
- | Hammer2.0-0.5B | 0.5B | 81.2% | 67.8% | 62.9% | 52.0% | 79.1% | 50.9% | 94.9% | 83.8% | 74.7% | 49.0% | 78.5% | 60.7% |
82
- | Hammer2.0-1.5B | 1.5B | 90.2% | 80.4% | 82.9% | 63.8% | 86.2% | 59.5% | 97.5% | 92.5% | 86.4% | 65.5% | 88.6% | 72.4% |
83
- | Hammer2.0-3B | 3B | 93.6% | 84.3% | 83.7% | 59.0% | 83.1% | 58.8% | 95.3% | 91.2% | 92.5% | 70.5% | 89.6% | 72.8% |
84
- | Hammer2.0-7B | 7B | 91.0% | 82.1% | 82.5% | 65.1% | 85.2% | 59.6% | 96.8% | 92.7% | 93.0% | 80.5% | 89.7% | 76.0% |
85
 
 
 
 
 
 
 
 
86
  ## Requiements
87
  The code of Hammer2.0-1.5b has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
88
 
 
19
  Thanks so much for your attention, a report with all the technical details leading to our models will be published soon.
20
 
21
  ## Evaluation
22
+ The evaluation results of Hammer 2.0 series on the Berkeley Function-Calling Leaderboard (BFCL) are presented in the following table:
23
+ <div style="text-align: center;">
24
+ <img src="v2_figures/bfcl.PNG" alt="overview" width="1000" style="margin: auto;">
25
+ </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+
28
+ In addition, we evaluated Hammer2.0 on other academic benchmarks to further show our model's generalization ability:
29
+ <div style="text-align: center;">
30
+ <img src="v2_figures/others.PNG" alt="overview" width="1000" style="margin: auto;">
31
+ </div>
32
+
33
+ On comparison, Hammer 2.0 outperforms models with similar sizes and even surpass many larger models overall.
34
  ## Requiements
35
  The code of Hammer2.0-1.5b has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
36