File size: 3,998 Bytes
46398af
2621e83
 
 
46398af
1c8a137
 
 
5ee3d4c
2621e83
46398af
a9f07fc
 
 
 
 
 
 
 
2f33103
 
a9f07fc
2f33103
a9f07fc
 
 
 
 
 
 
 
 
 
2f33103
a9f07fc
 
da1bd1d
 
 
 
 
 
a9f07fc
 
46398af
 
8558429
82fca5f
 
 
 
 
c5b33c1
82fca5f
 
 
 
 
 
a9f07fc
82fca5f
 
 
 
 
 
 
 
 
 
 
 
 
1c8a137
82fca5f
 
c5b33c1
 
 
 
82fca5f
 
 
 
 
 
2f33103
82fca5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46398af
82fca5f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
language:
- ja
license: apache-2.0
library_name: peft
datasets:
- ohtaman/kokkai2022
pipeline_tag: text-generation
inference: false
base_model: tiiuae/falcon-7b
---

<a href="https://colab.research.google.com/drive/1oWHM5_DbltvrD27oZL4-fumXChkMkrC5?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

This model learned the proceedings of the Japanese parliament in 2022.
The [dataset](https://huggingface.co/datasets/ohtaman/kokkai2022) is collected using
[National Diet Library's Search API](https://kokkai.ndl.go.jp/api.html).
This model was build for a hackerthon event,  [
第1回大規模言語モデル分散学習ハッカソン](https://abci.ai/event/2023/06/13/ja_event.html) ([#ABCILLM](https://twitter.com/hashtag/ABCILLM)), as an example of training which used multiple GPUs or multiple nodes.

An example input is as follows:

```
# question
麻生太郎

増税が必要とお考えでしょうか?
# answer
鈴木 俊一
```

and the respons is:

```
「財政民主主義」のためには、国庫負担を引き下げるならば、企業の賃上げを実現するためにも、消費者物価の高騰対策等を含めて、経済対策を行い、成長と分配の好循環を実珉化することが重要でございます。
 今回の補正予算案においては、新型コロナウイルス感染症対応事業助成金及び電気料金支援交付金の拡充を盛り込んでおります。それ以外にも、令和四年度一般会計予備費二千五百億円を活用し、政府全体の方針を踏まえ、各種制度改革や人員削減等を通じて、様々な取組を進めてまいりますが、具体的な内容は決定しておりず、その際には、財源不足に関する議論を重ねていく中で検討してまいる所存です。

次に、賃上賃下の問題について申し上げます。

私自身は、賃金水準の向上を目指すことは大変重要であり、
```

## Training procedure

Finetune [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) with [ohtaman/kokkai2022](https://huggingface.co/datasets/ohtaman/kokkai2022)(currentry, private) dataset using LoRA with following configurations. It took 2:34:32 with 16 V-nodes (V100 x 64).

|param|value|
|:--:|:--:|
|r| 4|
|lora_alpha| 2|
|target_modules|query_key_value<br>dense<br>dense_h_to_4h<br>dense_4h_to_h|
|lora_dropout| 0.01|
|bias| None|
|task_type| CAUSAL_LM|
|optimizer|AdamW|
|lr|4e-4|

the prompt template is as follows:

```
# question
{questioner}

{question_text}

# answer
{answerer}

{answer_text}

```

### Example Code

You can try the model with [Colaboratory](https://colab.research.google.com/drive/1oWHM5_DbltvrD27oZL4-fumXChkMkrC5?usp=sharing) .
No Pro or Pro+ is needed.
The typical code to generate texts with this model is as follows:

```python
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = transformers.AutoModelForCausalLM.from_pretrained(base_model_name, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
peft_model = peft.PeftModelForCausalLM.from_pretrained(base_model, peft_model_name, torch_dtype=torch.bfloat16)


prompt = "# question\n麻生太郎\n\n増税が必要とお考えでしょうか?\n# answer\n鈴木 俊一\n\n"
input_tokens = tokenizer(prompt, return_tensors="pt").to(peft_model.device)
input_length = input_tokens.input_ids.shape[1]

with torch.no_grad():
    outputs = peft_model.generate(
        input_ids=input_tokens["input_ids"],
        attention_mask=input_tokens["attention_mask"],
        return_dict_in_generate=True,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
        max_length=max_length,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.05,
    )
    output_tokens = outputs.sequences[0, input_length:-1]

print(tokenizer.decode(output_tokens))
```