Guan commited on
Commit
d68d475
1 Parent(s): fb9e083

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ thumbnail: http://coai.cs.tsinghua.edu.cn/coai/img/logo.png?v=13923
5
+ tags:
6
+ - pytorch
7
+ - lm-head
8
+ - zh
9
+ datasets:
10
+ metrics:
11
+ widget:
12
+ - text: "小咕噜对靳司寒完全是个自来熟,小家伙爬进他怀里小手搂着他的脖子,奶声奶气的要求:“靳蜀黎,你给咕噜讲故事好不好?”讲故事?童话故事吗?“我不会。”小家伙明显不信。嘟着小嘴大眼汪汪的盯着他,“哼。”小家伙轻轻哼了一声,靳司寒默了半晌,<extra_id_1>"
13
+ - text: "美女亲自打招呼,这可是破天荒第一次,之前不管他献多少次殷勤,美女<extra_id_1>甩他,难道今天真是老天<extra_id_2>不敢<extra_id_3>的兄连滚带爬的来到<extra_id_4>身边队友都带着艳<extra_id_5>他,<extra_id_6>连计算机系的那票球友都在那儿不住地偷看MAGGIE,这种感觉真<extra_id_7>毙了!"
14
+ inference:
15
+ parameters:
16
+ top_p: 0.9
17
+ ---
18
+ ## LongLM
19
+
20
+ ### 1. Parameters
21
+
22
+ | Versions | $d_m$ | $d_{ff}$ | $d_{kv}$ | $n_h$ | $n_e/n_d$ | \# P |
23
+ | ------------ | ----- | -------- | -------- | ----- | --------- | ---- |
24
+ | LongLM-small | 512 | 2,048 | 64 | 8 | 6/6 | 60M |
25
+ | LongLM-base | 768 | 3,072 | 64 | 12 | 12/12 | 223M |
26
+ | LongLM-large | 1,536 | 3,072 | 64 | 12 | 24/32 | 1B |
27
+
28
+ - $d_m$: the dimension of hidden states
29
+ - $d_{ff}$: the dimension of feed forward layers
30
+ - $d_{kv}$: the dimension of the keys/values in the self-attention layers
31
+ - $n_h$: the number of attention heads
32
+ - $n_e$: the number of hidden layers of the encoder
33
+ - $n_d$: the number of hidden layers of the decoder
34
+ - \#P: the number of parameters
35
+
36
+ ### 2. Pretraining Tasks
37
+
38
+ Encoder-decoder models are trained typically by maximizing the likelihood of the target output given an input. To improve the capacities of both the encoder and decoder, we propose to train LongLM with two pretraining tasks including text infilling (Raffel et al., 2020) and conditional continuation (Radford et al., 2019). For the first task, the input is a text where a number of spans are sampled and replaced by special tokens with unique IDs, while the output is the spans delimited by the special tokens used in the input. The lengths of masked spans are drawn from a Poisson distribution with λ=3 and all masked tokens compress 15% of the original texts. As for the second task, the input and output are respectively the front and back half of a text, which is split into two parts randomly.
39
+
40
+ ### 3. Pretraining Data
41
+
42
+ We collect 120G novels as the pretraining data for LongLM.
43
+
44
+ ### 4. Checkpoints
45
+
46
+
47
+ 1. **Model Loading:**
48
+
49
+ ```python\
50
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
51
+ tokenizer = T5Tokenizer.from_pretrained('LongLM-large')
52
+ model = T5ForConditionalGeneration.from_pretrained('LongLM-large')
53
+ ```
54
+
55
+
56
+ 2. **Generation:**
57
+
58
+ ```python
59
+ input_ids = tokenizer("小咕噜对,<extra_id_1>",return_tensors="pt", padding=True, truncation=True, max_length=512).input_ids.to(device)
60
+
61
+ gen = model.generate(input_ids, do_sample=True, decoder_start_token_id=1, top_p=0.9, max_length=512)
62
+ ```
63
+
64
+
65
+ ### 5. Dependencies
66
+
67
+ ```
68
+ datasets 1.6.2
69
+ deepspeed 0.3.16
70
+ huggingface-hub 0.0.8
71
+ jieba 0.42.1
72
+ jsonlines 2.0.0
73
+ nltk 3.5
74
+ numpy 1.19.5
75
+ pytorch-lightning 1.2.0
76
+ regex 2020.11.13
77
+ rouge 1.0.1
78
+ rouge-score 0.0.4
79
+ sacrebleu 1.5.0
80
+ scipy 1.5.4
81
+ sentencepiece 0.1.95
82
+ tokenizers 0.10.1
83
+ torch 1.8.1
84
+ torchaudio 0.8.0
85
+ torchmetrics 0.2.0
86
+ torchvision 0.9.0
87
+ transformers 4.6.1
88
+ ```
89
+
90
+ ### 6. Contributers
91
+
92
+ [Jian Guan](https://jianguanthu.github.io/) at [thu-coai](http://coai.cs.tsinghua.edu.cn/)
93
+
94
+ ## Citation
95
+
96
+ ```txt
97
+ @misc{guan2021lot,
98
+ title={LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation},
99
+ author={Jian Guan and Zhuoer Feng and Yamei Chen and Ruilin He and Xiaoxi Mao and Changjie Fan and Minlie Huang},
100
+ year={2021},
101
+ eprint={2108.12960},
102
+ archivePrefix={arXiv},
103
+ primaryClass={cs.CL}
104
+ }
105
+ ```