sho-takase
commited on
Commit
•
48f3ab9
1
Parent(s):
82de7cf
Add readme
Browse files
README.md
CHANGED
@@ -1,3 +1,64 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- wikipedia
|
5 |
+
- mc4
|
6 |
+
- cc100
|
7 |
+
- oscar
|
8 |
+
language:
|
9 |
+
- ja
|
10 |
---
|
11 |
+
|
12 |
+
# japanese-large-lm-1.7b
|
13 |
+
|
14 |
+
This repository provides a 1.7B parameters Japanese language model, trained by [LINE Corporation](https://linecorp.com/ja/).
|
15 |
+
|
16 |
+
## How to use
|
17 |
+
|
18 |
+
```
|
19 |
+
import torch
|
20 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
|
21 |
+
|
22 |
+
model = AutoModelForCausalLM.from_pretrained("line-corporation/japanese-large-lm-1.7b", torch_dtype=torch.float16)
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained("line-corporation/japanese-large-lm-1.7b", use_fast=False)
|
24 |
+
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
|
25 |
+
set_seed(101)
|
26 |
+
|
27 |
+
text = generator(
|
28 |
+
"おはようございます、今日の天気は",
|
29 |
+
max_length=30,
|
30 |
+
do_sample=True,
|
31 |
+
pad_token_id=tokenizer.pad_token_id,
|
32 |
+
num_return_sequences=5,
|
33 |
+
)
|
34 |
+
|
35 |
+
for t in text:
|
36 |
+
print(t)
|
37 |
+
|
38 |
+
# [{'generated_text': 'おはようございます、今日の天気は雨模様ですね。梅雨のこの時期の ジメジメ、ムシムシはたまらないですねえ~。 皆さんもお'},
|
39 |
+
# {'generated_text': 'おはようございます、今日の天気は快晴。 そして、朝8時15分には、 8月9日現在の、 月島・勝どき・'},
|
40 |
+
# {'generated_text': 'おはようございます、今日の天気は曇りです。 朝起きたら雪がチラついていました。 日中も雪が舞い散るような天気です。 朝から寒いですね。'},
|
41 |
+
# {'generated_text': 'おはようございます、今日の天気は雨です。昨日、天気が悪く洗濯物を干しにベランダに出た時に雨に降られ、風邪が悪化しそうです。今日洗濯'},
|
42 |
+
# {'generated_text': 'おはようございます、今日の天気は晴天ですが涼しい1日です、気温は午後になり 若干下がる予報です。 6月も10日を'}]
|
43 |
+
```
|
44 |
+
|
45 |
+
## Model architecture
|
46 |
+
| Model | Vocab size | Architecture | Position type | Layers | Hidden dim | Attention heads |
|
47 |
+
| :---: | :--------: | :----------- | :-----------: | :----: | :--------: | :-------------: |
|
48 |
+
| 1.7B | 51200 | GPT2 | Absolute | 24 | 2304 | 24 |
|
49 |
+
| 3.6B | 51200 | GPTNeoX | RoPE | 30 | 3072 | 32 |
|
50 |
+
|
51 |
+
## Training Corpus
|
52 |
+
Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar.
|
53 |
+
We also incorporated the Web texts crawled by in-house system.
|
54 |
+
The total size of our training corpus is about 650 GB.
|
55 |
+
The trained model achieves 8.57 perplexity on the internal validation sets of Japanese C4,
|
56 |
+
|
57 |
+
## Tokenization
|
58 |
+
We use a sentencepiece tokenizer with a unigram language model and byte-fallback.
|
59 |
+
We **do not** apply pre-tokenization with Japanese tokenizer.
|
60 |
+
Thus, a user may directly feed raw sentences into the tokenizer.
|
61 |
+
|
62 |
+
|
63 |
+
## License
|
64 |
+
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|