File size: 3,998 Bytes
46398af 057e054 46398af 1c8a137 5ee3d4c 057e054 46398af a9f07fc 2f33103 a9f07fc 2f33103 a9f07fc 2f33103 a9f07fc da1bd1d a9f07fc 46398af 8558429 82fca5f c5b33c1 82fca5f a9f07fc 82fca5f 1c8a137 82fca5f c5b33c1 82fca5f 2f33103 82fca5f 46398af 82fca5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
language:
- ja
license: apache-2.0
library_name: peft
datasets:
- ohtaman/kokkai2022
pipeline_tag: text-generation
inference: false
base_model: tiiuae/falcon-7b
---
<a href="https://colab.research.google.com/drive/1oWHM5_DbltvrD27oZL4-fumXChkMkrC5?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>
This model learned the proceedings of the Japanese parliament in 2022.
The [dataset](https://huggingface.co/datasets/ohtaman/kokkai2022) is collected using
[National Diet Library's Search API](https://kokkai.ndl.go.jp/api.html).
This model was build for a hackerthon event, [
第1回大規模言語モデル分散学習ハッカソン](https://abci.ai/event/2023/06/13/ja_event.html) ([#ABCILLM](https://twitter.com/hashtag/ABCILLM)), as an example of training which used multiple GPUs or multiple nodes.
An example input is as follows:
```
# question
麻生太郎
増税が必要とお考えでしょうか?
# answer
鈴木 俊一
```
and the respons is:
```
「財政民主主義」のためには、国庫負担を引き下げるならば、企業の賃上げを実現するためにも、消費者物価の高騰対策等を含めて、経済対策を行い、成長と分配の好循環を実珉化することが重要でございます。
今回の補正予算案においては、新型コロナウイルス感染症対応事業助成金及び電気料金支援交付金の拡充を盛り込んでおります。それ以外にも、令和四年度一般会計予備費二千五百億円を活用し、政府全体の方針を踏まえ、各種制度改革や人員削減等を通じて、様々な取組を進めてまいりますが、具体的な内容は決定しておりず、その際には、財源不足に関する議論を重ねていく中で検討してまいる所存です。
次に、賃上賃下の問題について申し上げます。
私自身は、賃金水準の向上を目指すことは大変重要であり、
```
## Training procedure
Finetune [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) with [ohtaman/kokkai2022](https://huggingface.co/datasets/ohtaman/kokkai2022)(currentry, private) dataset using LoRA with following configurations. It took 2:34:32 with 16 V-nodes (V100 x 64).
|param|value|
|:--:|:--:|
|r| 4|
|lora_alpha| 2|
|target_modules|query_key_value<br>dense<br>dense_h_to_4h<br>dense_4h_to_h|
|lora_dropout| 0.01|
|bias| None|
|task_type| CAUSAL_LM|
|optimizer|AdamW|
|lr|4e-4|
the prompt template is as follows:
```
# question
{questioner}
{question_text}
# answer
{answerer}
{answer_text}
```
### Example Code
You can try the model with [Colaboratory](https://colab.research.google.com/drive/1oWHM5_DbltvrD27oZL4-fumXChkMkrC5?usp=sharing) .
No Pro or Pro+ is needed.
The typical code to generate texts with this model is as follows:
```python
tokenizer = transformers.AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = transformers.AutoModelForCausalLM.from_pretrained(base_model_name, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
peft_model = peft.PeftModelForCausalLM.from_pretrained(base_model, peft_model_name, torch_dtype=torch.bfloat16)
prompt = "# question\n麻生太郎\n\n増税が必要とお考えでしょうか?\n# answer\n鈴木 俊一\n\n"
input_tokens = tokenizer(prompt, return_tensors="pt").to(peft_model.device)
input_length = input_tokens.input_ids.shape[1]
with torch.no_grad():
outputs = peft_model.generate(
input_ids=input_tokens["input_ids"],
attention_mask=input_tokens["attention_mask"],
return_dict_in_generate=True,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
max_length=max_length,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.05,
)
output_tokens = outputs.sequences[0, input_length:-1]
print(tokenizer.decode(output_tokens))
``` |