File size: 3,902 Bytes
14e2572
 
 
 
 
 
 
 
 
 
 
fe7da37
 
14e2572
 
faa16a9
fe7da37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: gemma
library_name: peft
tags:
- trl
- sft
- generated_from_trainer
base_model: google/gemma-1.1-2b-it
model-index:
- name: gemma-2b-it-example-v1
  results: []
language:
- ko
---


## Model Description  
**git hub** : [https://github.com/aiqwe/instruction-tuning-with-rag-example](https://github.com/aiqwe/instruction-tuning-with-rag-example)  
Instruction Tuning์˜ ํ•™์Šต์„ ์œ„ํ•ด ์˜ˆ์‹œ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.  
[gemma-2b-it](https://huggingface.co/google/gemma-2b-it) ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์•ฝ 1๋งŒ๊ฐœ์˜ ๋ถ€๋™์‚ฐ ๊ด€๋ จ Instruction ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.  
ํ•™์Šต ์ฝ”๋“œ๋Š” ์œ„ git hub๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.  

## Usage
### Inference on GPU example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "aiqwe/gemma-2b-it-example-v1",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

input_text = "์•„ํŒŒํŠธ ์žฌ๊ฑด์ถ•์— ๋Œ€ํ•ด ์•Œ๋ ค์ค˜."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

```


### Inference on CPU example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "aiqwe/gemma-2b-it-example-v1",
    device_map="cpu",
    torch_dtype=torch.bfloat16
)

input_text = "์•„ํŒŒํŠธ ์žฌ๊ฑด์ถ•์— ๋Œ€ํ•ด ์•Œ๋ ค์ค˜."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
```

### Inference on GPU with embedded function example
๋‚ด์žฅ๋œ ํ•จ์ˆ˜๋กœ ๋„ค์ด๋ฒ„ ๊ฒ€์ƒ‰ API๋ฅผ ํ†ตํ•ด RAG๋ฅผ ์ง€์›๋ฐ›์Šต๋‹ˆ๋‹ค.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM 
from utils import generate

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "aiqwe/gemma-2b-it-example-v1",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

rag_config = {
    "api_client_id": userdata.get('NAVER_API_ID'),
    "api_client_secret": userdata.get('NAVER_API_SECRET')
}
completion = generate(
    model=model,
    tokenizer=tokenizer,
    query=query,
    max_new_tokens=512,
    rag=True,
    rag_config=rag_config
)
print(completion)
```

## Chat Template
Gemma ๋ชจ๋ธ์˜ Chat Template์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.  
[gemma-2b-it Chat Template](https://huggingface.co/google/gemma-2b-it#chat-template)
```python
input_text = "์•„ํŒŒํŠธ ์žฌ๊ฑด์ถ•์— ๋Œ€ํ•ด ์•Œ๋ ค์ค˜."

input_text = tokenizer.apply_chat_template(
        conversation=[
            {"role": "user", "content": input_text}
        ],
        add_generate_prompt=True,
        return_tensors="pt"
    ).to(model.device)

outputs = model.generate(input_text, max_new_tokens=512, repetition_penalty = 1.5)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

## Training information
ํ•™์Šต์€ ๊ตฌ๊ธ€ ์ฝ”๋žฉ L4 Single GPU๋ฅผ ํ™œ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.  

| ๊ตฌ๋ถ„                          | ๋‚ด์šฉ               |
|-----------------------------|------------------|
| ํ™˜๊ฒฝ                          | Google Colab     |
| GPU                         | L4(22.5GB)       |
| ์‚ฌ์šฉ VRAM                     | ์•ฝ 13.8GB         |
| dtype                       | bfloat16         |
| Attention                   | flash attention2 |
| Tuning                      | Lora(r=4, alpha=32) |
| Learning Rate               | 1e-4             |
| LRScheduler                 | Cosine           |
| Optimizer                   | adamw_torch_fused |
| batch_size                  | 4                |
| gradient_accumulation_steps | 2                |