Model Name: Qwen2 orca_mini_v7_72b
Qwen2 orca_mini_v7_72b is trained with various SFT Datasets
Passionate about Generative AI? I help companies to privately train and deploy custom LLM/MLLM affordably. For startups, I can even assist with securing GPU grants to get you started. Let's chat!https://www.linkedin.com/in/pankajam Looking forward to connecting!
NOTICE
By providing proper credit and attribution, you are granted permission to use this model as a foundational base for further Full fine tuning, DPO, PPO or ORPO tuning and any kind of Merges. I actively encourage users to customize and enhance the model according to their specific needs, as this version is designed to be a comprehensive general model. Dive in and innovate!
Example Usage
Here is the ChatML prompt format
<|im_start|>system
You are Orca Mini, a helpful AI assistant.<|im_end|>
<|im_start|>user
Hello Orca Mini, what can you do for me?<|im_end|>
<|im_start|>assistant
Below shows a code example on how to use this model
from transformers import AutoModel, AutoTokenizer
model_slug = "pankajmathur/orca_mini_v7_72b"
model = AutoModel.from_pretrained(model_slug)
tokenizer = AutoTokenizer.from_pretrained(model_slug)
messages = [
{"role": "system", "content": "You are Orca Mini, a helpful AI assistant."},
{"role": "user", "content": "Hello Orca Mini, what can you do for me?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
model.generate(**gen_input)
Quants
GGUF : Coming Soon
AWQ: Coming Soon
Processing Long Texts (Based upon Qwen2-7B-Instruct suggestions at https://huggingface.co/Qwen/Qwen2-7B-Instruct)
To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:
- Install vLLM: You can install vLLM by running the following command.
pip install "vllm>=0.4.3"
Or you can install vLLM from source.
Configure Model Settings: After downloading the model weights, modify the
config.json
file by including the below snippet:{ "architectures": [ "Qwen2ForCausalLM" ], // ... "vocab_size": 152064, // adding the following snippets "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } }
This snippet enable YARN to support longer contexts.
Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:
python -u -m vllm.entrypoints.openai.api_server --model pankajmathur/orca_mini_v7_72b
Then you can access the Chat API by:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "pankajmathur/orca_mini_v7_72b", "messages": [ {"role": "system", "content": "You are Orca Mini, a helpful AI assistant."}, {"role": "user", "content": "Hello Orca Mini, what can you do for me?"} ] }'
Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling
configuration only when processing long contexts is required.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 39.06 |
IFEval (0-Shot) | 59.30 |
BBH (3-Shot) | 55.06 |
MATH Lvl 5 (4-Shot) | 26.44 |
GPQA (0-shot) | 18.01 |
MuSR (0-shot) | 24.21 |
MMLU-PRO (5-shot) | 51.35 |
- Downloads last month
- 2,832
Model tree for pankajmathur/orca_mini_v7_72b
Evaluation results
- strict accuracy on IFEval (0-Shot)Open LLM Leaderboard59.300
- normalized accuracy on BBH (3-Shot)Open LLM Leaderboard55.060
- exact match on MATH Lvl 5 (4-Shot)Open LLM Leaderboard26.440
- acc_norm on GPQA (0-shot)Open LLM Leaderboard18.010
- acc_norm on MuSR (0-shot)Open LLM Leaderboard24.210
- accuracy on MMLU-PRO (5-shot)test set Open LLM Leaderboard51.350