heack commited on
Commit
e9a4aa4
1 Parent(s): 0bef822

initial publish

Browse files
README.md CHANGED
@@ -6,4 +6,80 @@ pipeline_tag: summarization
6
  tags:
7
  - mT5
8
  - summarization
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - mT5
8
  - summarization
9
+ ---
10
+
11
+ # HeackMT5-ZhSum100k: A Summarization Model for Chinese Texts
12
+
13
+ This model, `heack/HeackMT5-ZhSum100k`, is a fine-tuned mT5 model for Chinese text summarization tasks. It was trained on a diverse set of Chinese datasets and is able to generate coherent and concise summaries for a wide range of texts.
14
+
15
+ ## Model Details
16
+
17
+ - Model: mT5
18
+ - Language: Chinese
19
+ - Training data: Mainly Chinese Financial News Sources, NO BBC or CNN source. Training data contains 100k lines.
20
+ - Finetuning epochs: 10
21
+
22
+ ## Evaluation Results
23
+
24
+ The model achieved the following results:
25
+
26
+ - ROUGE-1: 56.46
27
+ - ROUGE-2: 45.81
28
+ - ROUGE-L: 52.98
29
+ - ROUGE-Lsum: 20.22
30
+
31
+ ## Usage
32
+
33
+ Here is how you can use this model for text summarization:
34
+
35
+ ```python
36
+ from transformers import MT5ForConditionalGeneration, T5Tokenizer
37
+
38
+ model = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
39
+ tokenizer = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")
40
+
41
+ chunk = """
42
+ 财联社5月22日讯,据平安包头微信公众号消息,近日,包头警方发布一起利用人工智能(AI)实施电信诈骗的典型案例,福州市某科技公司法人代表郭先生10分钟内被骗430万元。
43
+ 4月20日中午,郭先生的好友突然通过微信视频联系他,自己的朋友在外地竞标,需要430万保证金,且需要公对公账户过账,想要借郭先生公司的账户走账。
44
+ 基于对好友的信任,加上已经视频聊天核实了身份,郭先生没有核实钱款是否到账,就分两笔把430万转到了好友朋友的银行卡上。郭先生拨打好友电话,才知道被骗。骗子通过智能AI换脸和拟声技术,佯装好友对他实施了诈骗。
45
+ 值得注意的是,骗子并没有使用一个仿真的好友微信添加郭先生为好友,而是直接用好友微信发起视频聊天,这也是郭先生被骗的原因之一。骗子极有可能通过技术手段盗用了郭先生好友的微信。幸运的是,接到报警后,福州、包头两地警银迅速启动止付机制,成功止付拦截336.84万元,但仍有93.16万元被转移,目前正在全力追缴中。
46
+ """
47
+ inputs = tokenizer.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
48
+ summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
49
+ summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
50
+
51
+ print(summary)
52
+
53
+ 包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元
54
+ ```
55
+ ## Credits
56
+ This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out to me at my WeChat ID: kongyang.
57
+
58
+ ## License
59
+ This model is released under the CC BY-NC-SA 4.0 license.
60
+
61
+ ## Other
62
+ 这个模型基于google/mt5-base进行迭代训练,epoch=10,在我的GTX 4070机器上跑了大概7天时间。训练集有10万行,主要是国内的中文新闻,不存在BBC,所以翻译的时候不会出现类似于“本文不代表bbc观点”之类的语言。并且我的数据集也经过深度过滤,基本不会出现意外的文字,可以非常好的用来进行缩略。我的缩略主要输入分为300字的chunk,然后再进行缩略,然后再把文字join就可以完成大段的缩略。
63
+ 后续会进一步在我的100万行数据源进行训练,并开源模型,敬请期待。
64
+ 另外,我已经把此缩略用在了我的本地知识库项目中,项目链接为:
65
+ https://github.com/erickong/document.ai
66
+
67
+ 另外感谢此项目:
68
+ https://github.com/csebuetnlp/xl-sum/tree/master/seq2seq
69
+
70
+ This model is iteratively trained based on google/mt5-base with 10 epochs, taking approximately 7 days on my machine equipped with a GTX 4070 graphics card. The training set consists of 100,000 lines, mainly domestic Chinese news, and does not include sources such as BBC, hence translations such as "This article does not represent the views of BBC" will not appear. My dataset has also undergone deep filtering, virtually eliminating unexpected text, making it excellent for summarization. The primary input for my summarization is chunks of 300 words, which are then summarized, and the resulting text can be joined together to summarize larger sections.
71
+
72
+ In the future, I plan to further train this model with my dataset of 1 million lines and release the model. Please stay tuned.
73
+
74
+ Additionally, I have already incorporated this summarization into my local knowledge base project. The link to the project is: https://github.com/erickong/document.ai
75
+
76
+ ## Citation
77
+
78
+ If you use this model in your research, please cite:
79
+
80
+ ```bibtex
81
+ @misc{kongyang2023heackmt5zhsum100k,
82
+ title={HeackMT5-ZhSum100k: A Large-Scale Multilingual Abstractive Summarization for Chinese Texts},
83
+ author={Kong Yang},
84
+ year={2023}
85
+ }
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "heack/HeackMT5-ZhSum100k",
3
+ "architectures": [
4
+ "MT5ForConditionalGeneration"
5
+ ],
6
+ "d_ff": 2048,
7
+ "d_kv": 64,
8
+ "d_model": 768,
9
+ "decoder_start_token_id": 0,
10
+ "dropout_rate": 0.1,
11
+ "eos_token_id": 1,
12
+ "feed_forward_proj": "gated-gelu",
13
+ "initializer_factor": 1.0,
14
+ "is_encoder_decoder": true,
15
+ "layer_norm_epsilon": 1e-06,
16
+ "length_penalty": 0.6,
17
+ "max_length": 84,
18
+ "model_type": "mt5",
19
+ "num_beams": 4,
20
+ "num_decoder_layers": 12,
21
+ "num_heads": 12,
22
+ "num_layers": 12,
23
+ "output_past": true,
24
+ "pad_token_id": 0,
25
+ "relative_attention_num_buckets": 32,
26
+ "tie_word_embeddings": false,
27
+ "tokenizer_class": "T5Tokenizer",
28
+ "transformers_version": "4.10.0.dev0",
29
+ "use_cache": true,
30
+ "vocab_size": 250112
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18c8d23ba8ca99183a85c3dd7e7e6b87bb653b7296789a50288c94c8a616a1a9
3
+ size 2329706751
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>"}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef78f86560d809067d12bac6c09f19a462cb3af3f54d2b8acbba26e1433125d6
3
+ size 4309802
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "extra_ids": 0, "additional_special_tokens": null, "special_tokens_map_file": "/home/patrick/.cache/torch/transformers/685ac0ca8568ec593a48b61b0a3c272beee9bc194a3c7241d15dcadb5f875e53.f76030f3ec1b96a8199b2593390c610e76ca8028ef3d24680000619ffb646276", "tokenizer_file": null, "name_or_path": "google/mt5-base"}