alpindale
/

Llama-2-7b-ONNX

Text Generation

text generation

Model card Files Files and versions Community

Llama-2-7b-ONNX / README.md

alpindale's picture

Update README.md

5f8353d about 1 year ago

|

2.89 kB

	---
	language:
	- en
	thumbnail: null
	tags:
	- text generation
	pipeline_tag: text-generation
	inference: false
	license: llama2
	---

	# Llama-2 ONNX

	This repository contains optimized version of Llama-2 7B.

	## Downloading the model

	You can use `huggingface_hub` to download this repository. This can be done through both python
	scripting and the commandline. Refer to the
	[HuggingFace Hub Documentation](https://huggingface.co/docs/huggingface_hub/guides/download) for
	the Python examples.

	With CLI:

	1. Make sure you have an updated `huggingface_hub` installed.
	```sh
	pip install -U huggingface_hub
	```
	2. Download the repository.
	```sh
	huggingface-cli download alpindale/Llama-2-7b-ONNX --repo-type model --cache-dir /path/to/custom/cache/directory --local-dir /path/to/download/dir --local-dir-use-symlinks False
	```
	The `--cache-dir` kwarg is only necessary if your default cache directory (`~/.cache`)
	does not have enough disk space to accomodate the entire repository.

	### Chat Interface
	You can use the Gradio chat interface to run the models.

	First, install the required packages:
	```sh
	pip install -r ChatApp/requirements.txt
	```

	Set the Python path to the root directory of the repository (necessary for importing the required modules):
	```sh
	export PYTHONPATH=$PYTHONPATH:$(pwd)
	```

	Then you can simply run:

	```sh
	python ChatApp/app.py
	```

	You can then navigate to [http://localhost:7860](https://127.0.0.1:7860) on your browser to access the interface.


	## CLI Interface
	The repository also provides example code for running the models.

	```sh
	python llama2_onnx_inference.py --onnx_file FP16/LlamaV2_7B_float16.onnx --embedding_file embeddings.pth --tokenizer_path tokenizer.model --prompt "What is the lightest element?"
	```

	Output:
	```
	The lightest element is hydrogen. Hydrogen is the lightest element on the periodic table, with an atomic mass of 1.00794 u (unified atomic mass units).
	```

	## FAQ
	### Why is the first inference session slow?
	ONNX runtime execution provider might need to generate JIT binaries for the underlying hardware, typically the binary is cache and will be loaded directly in the subsequent runs to reduce the overhead.

	### Why is FP16 slower than FP32 on my device?
	Your device may not support native FP16 math, therefore weights will be cast to FP32 at runtime. Using the FP32 version of the model will avoid the cast overhead.

	### How do I optimize inference?
	It's recommended that inputs/outputs are put on target device to avoid expensive data copies, please refer to the following documentations for details:

	[I/O Binding \| onnxruntime](https://onnxruntime.ai/docs/performance/tune-performance/iobinding.html)

	### What generation parameters should I use the model with?
	You can perform temperature and top-p sampling with the provided example code. Please refer to Meta's example [here](https://github.com/facebookresearch/llama/).