MaralGPT
/

chinkara-7b-improved

Model card Files Files and versions Community

chinkara-7b-improved / README.md

Muhammadreza's picture

added colab link

20e3edc about 1 year ago

|

history blame contribute delete

No virus

2.99 kB

	---
	library_name: peft
	license: mit
	---

	# Chinkara 7B (Improved)

	_Chinkara_ is a Large Language Model trained on [timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) dataset based on Meta's brand new LLaMa-2 with 7 billion parameters using QLoRa Technique, optimized for small consumer size GPUs.
	![logo](chinkara-logo.png)

	## Information

	For more information about the model please visit [prp-e/chinkara](https://github.com/prp-e/chinkara) on Github.

	## Inference Guide

	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/prp-e/chinkara/blob/main/inference-7b-improved.ipynb)

	_NOTE: This part is for the time you want to load and infere the model on your local machine. You still need 8GB of VRAM on your GPU. The recommended GPU is at least a 2080!_

	### Installing libraries

	```
	pip install -U bitsandbytes
	pip install -U git+https://github.com/huggingface/transformers.git
	pip install -U git+https://github.com/huggingface/peft.git
	pip install -U git+https://github.com/huggingface/accelerate.git
	pip install -U datasets
	pip install -U einops
	```

	### Loading the model

	```python
	import torch
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

	model_name = "Trelis/Llama-2-7b-chat-hf-sharded-bf16"
	adapters_name = 'MaralGPT/chinkara-7b-improved'

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	load_in_4bit=True,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
	quantization_config=BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type='nf4'
	),
	)
	model = PeftModel.from_pretrained(model, adapters_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	```

	### Setting the model up

	```python
	from peft import LoraConfig, get_peft_model

	model = PeftModel.from_pretrained(model, adapters_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	```

	### Prompt and inference

	```python
	prompt = "What is the answer to life, universe and everything?"

	prompt = f"###Human: {prompt} ###Assistant:"

	inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
	outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=50, temperature=0.5, repetition_penalty=1.0)
	answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(answer)
	```

	## Training procedure


	The following `bitsandbytes` quantization config was used during training:
	- load_in_8bit: False
	- load_in_4bit: True
	- llm_int8_threshold: 6.0
	- llm_int8_skip_modules: None
	- llm_int8_enable_fp32_cpu_offload: False
	- llm_int8_has_fp16_weight: False
	- bnb_4bit_quant_type: nf4
	- bnb_4bit_use_double_quant: False
	- bnb_4bit_compute_dtype: float16
	### Framework versions


	- PEFT 0.5.0.dev0