omitakahiro commited on
Commit
0ae4abc
1 Parent(s): d9f6dad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md CHANGED
@@ -1,3 +1,71 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - ja
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - japanese
9
+ - llama-2
10
  ---
11
+
12
+ # stockmark/stockmark-13b
13
+
14
+ This repository provides a Llama-2 based model with 13B parameters pre-trained on Japanese corpus of about 220B tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/)
15
+
16
+ Please see our [blog](https://tech.stockmark.co.jp/blog/202310_stockmark_13b/) for more details.
17
+
18
+ This project is supported by [AWS LLM development support program](https://aws.amazon.com/jp/local/llm-development-support-program/).
19
+
20
+ ## How to use
21
+
22
+ ```python
23
+ import torch
24
+ from transformers import AutoModelForCausalLM, AutoTokenizer
25
+
26
+ model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map="auto", torch_dtype=torch.bfloat16)
27
+ tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-13b")
28
+
29
+ inputs = tokenizer("自然言語処理とは", return_tensors="pt").to(model.device)
30
+ with torch.no_grad():
31
+ tokens = model.generate(
32
+ **inputs,
33
+ max_new_tokens=128,
34
+ do_sample=True,
35
+ temperature=0.7
36
+ )
37
+
38
+ output = tokenizer.decode(tokens[0], skip_special_tokens=True)
39
+ print(output)
40
+ ```
41
+
42
+ ## Example:
43
+
44
+ - LoRA tuning (in preparation): https://huggingface.co/stockmark/stockmark-13b/blob/main/notebooks/LoRA.ipynb
45
+
46
+ ## Training dataset
47
+
48
+ We have used Japanese corpus of total of about 220 billion tokens.
49
+
50
+ |corpus|tokens after preprocessing|
51
+ |:---:|:---:|
52
+ |Stockmark Web Corpus (This dataset will not be released)|9.1 billion|
53
+ |Patent|34.8 billion|
54
+ |Wikipedia|1.0 billion|
55
+ |CC100|10.9 billion|
56
+ |mC4|53.2 billion|
57
+ |CommonCrawl (snapshot: 2023-23, 2022-49, 2022-21, 2021-21)|112.9 billion|
58
+
59
+
60
+ ## Library and Accelerators
61
+ - Library: neuronx-nemo-megatron: https://github.com/aws-neuron/neuronx-nemo-megatron
62
+ - Accelerator: AWS Trainium: https://aws.amazon.com/machine-learning/trainium/
63
+
64
+ ## License
65
+ [The MIT license](https://opensource.org/licenses/MIT)
66
+
67
+ ## Developed by
68
+ [Stockmark Inc.](https://stockmark.co.jp/)
69
+
70
+ ## Author
71
+ [Takahiro Omi](https://huggingface.co/omitakahiro)