omitakahiro
commited on
Commit
•
0ae4abc
1
Parent(s):
d9f6dad
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,71 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- ja
|
5 |
+
library_name: transformers
|
6 |
+
pipeline_tag: text-generation
|
7 |
+
tags:
|
8 |
+
- japanese
|
9 |
+
- llama-2
|
10 |
---
|
11 |
+
|
12 |
+
# stockmark/stockmark-13b
|
13 |
+
|
14 |
+
This repository provides a Llama-2 based model with 13B parameters pre-trained on Japanese corpus of about 220B tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/)
|
15 |
+
|
16 |
+
Please see our [blog](https://tech.stockmark.co.jp/blog/202310_stockmark_13b/) for more details.
|
17 |
+
|
18 |
+
This project is supported by [AWS LLM development support program](https://aws.amazon.com/jp/local/llm-development-support-program/).
|
19 |
+
|
20 |
+
## How to use
|
21 |
+
|
22 |
+
```python
|
23 |
+
import torch
|
24 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
25 |
+
|
26 |
+
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map="auto", torch_dtype=torch.bfloat16)
|
27 |
+
tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-13b")
|
28 |
+
|
29 |
+
inputs = tokenizer("自然言語処理とは", return_tensors="pt").to(model.device)
|
30 |
+
with torch.no_grad():
|
31 |
+
tokens = model.generate(
|
32 |
+
**inputs,
|
33 |
+
max_new_tokens=128,
|
34 |
+
do_sample=True,
|
35 |
+
temperature=0.7
|
36 |
+
)
|
37 |
+
|
38 |
+
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
|
39 |
+
print(output)
|
40 |
+
```
|
41 |
+
|
42 |
+
## Example:
|
43 |
+
|
44 |
+
- LoRA tuning (in preparation): https://huggingface.co/stockmark/stockmark-13b/blob/main/notebooks/LoRA.ipynb
|
45 |
+
|
46 |
+
## Training dataset
|
47 |
+
|
48 |
+
We have used Japanese corpus of total of about 220 billion tokens.
|
49 |
+
|
50 |
+
|corpus|tokens after preprocessing|
|
51 |
+
|:---:|:---:|
|
52 |
+
|Stockmark Web Corpus (This dataset will not be released)|9.1 billion|
|
53 |
+
|Patent|34.8 billion|
|
54 |
+
|Wikipedia|1.0 billion|
|
55 |
+
|CC100|10.9 billion|
|
56 |
+
|mC4|53.2 billion|
|
57 |
+
|CommonCrawl (snapshot: 2023-23, 2022-49, 2022-21, 2021-21)|112.9 billion|
|
58 |
+
|
59 |
+
|
60 |
+
## Library and Accelerators
|
61 |
+
- Library: neuronx-nemo-megatron: https://github.com/aws-neuron/neuronx-nemo-megatron
|
62 |
+
- Accelerator: AWS Trainium: https://aws.amazon.com/machine-learning/trainium/
|
63 |
+
|
64 |
+
## License
|
65 |
+
[The MIT license](https://opensource.org/licenses/MIT)
|
66 |
+
|
67 |
+
## Developed by
|
68 |
+
[Stockmark Inc.](https://stockmark.co.jp/)
|
69 |
+
|
70 |
+
## Author
|
71 |
+
[Takahiro Omi](https://huggingface.co/omitakahiro)
|