This is the OpenNMT-py converted version of Mistral 7b Instruct v0.2, 4-bit AWQ quantized (gemm version, faster for large batch sizes).
The safetensors file is 4.2GB hence runs smoothly on any RTX card.
Command line to run is:
python onmt/bin/translate.py --config /pathto/mistral-instruct-inference-awq.yaml --src /pathto/input-vicuna.txt --output /pathto/mistral-output.txt
Where for instance, input-vicuna.txt contains:
USER:⦅newline⦆Show me some attractions in Boston.⦅newline⦆⦅newline⦆ASSISTANT:⦅newline⦆
Output will be:
Absolutely, Boston is rich in history and culture. Here are some must-visit attractions in Boston:⦅newline⦆⦅newline⦆1. Freedom Trail: This 2.5-mile-long path passes through 16 historical sites, including the Paul Revere House, the Old North Church, and the USS Constitution.⦅newline⦆⦅newline⦆2. Boston Common: Established in 1634, Boston Common is the oldest city park in the United States. It covers an area of 50 acres and is home to several monuments, including the Emancipation Monument, the Robert Gould Shaw and the 54th Massachusetts Regiment Memorial, and the Massachusetts Soldiers and Sailors Monument.⦅newline⦆⦅newline⦆3. New England Aquarium: Located on the Central Wharf in the Fort Point Channel, the New England Aquarium is one of the premier visitor attractions in Boston. It covers an area of 23 acres and is home to over 20,000 animals, representing more than 1,200 species. The aquarium is divided into several galleries, including the Giant Ocean Tank, the Caribbean Coral Reef Gallery, the Amazon Rainforest Exhibit, the Sh```
If you run with a batch size of 60 you can get a nice throughput:
[2023-12-27 11:57:35,513 INFO] Loading checkpoint from /mnt/InternalCrucial4/dataAI/mistral-7B/mistral-instruct-v0.2/mistral-instruct-v0.2-onmt-awq-gemm.pt
[2023-12-27 11:57:35,603 INFO] awq_gemm compression of layer ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
[2023-12-27 11:57:39,574 INFO] Loading data into the model
step0 time: 1.2474071979522705
[2023-12-27 11:57:45,686 INFO] PRED SCORE: -0.2316, PRED PPL: 1.26 NB SENTENCES: 59
[2023-12-27 11:57:45,686 INFO] Total translation time (s): 5.2
[2023-12-27 11:57:45,686 INFO] Average translation time (ms): 87.7
[2023-12-27 11:57:45,686 INFO] Tokens per second: 2576.9
Time w/o python interpreter load/terminate: 10.182368755340576