AgaMiko commited on
Commit
d387681
·
1 Parent(s): 2bb8b4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -4,7 +4,7 @@ language:
4
  - pl
5
  - en
6
  datasets:
7
- - Curlicat
8
  pipeline_tag: text2text-generation
9
  tags:
10
  - keywords-generation
@@ -25,13 +25,20 @@ metrics:
25
  ---
26
  # Keyword Extraction from Short Texts with T5
27
 
28
- Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google ([https://huggingface.co/t5-base](https://huggingface.co/t5-base)). The model's input is text preceded by a prefix, and the output is the target text, where the prefix defines the type of task: e.g. "Translate from Polish to English:". The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract.
29
 
30
  The biggest advantage is the transferability of the vlT5 model, as it works well on all domains and types of text. The downside is that the text length and the number of keywords are similar to the training data: the text piece of an abstract length generates approximately 3 to 5 keywords. It works both extractive and abstractively. Longer pieces of text must be split into smaller chunks, and then propagated to the model.
31
 
 
 
 
 
 
 
 
32
  # Corpus
33
 
34
- The model was trained on a curlicat corpus
35
 
36
 
37
  | Domains | Documents | With keywords |
 
4
  - pl
5
  - en
6
  datasets:
7
+ - posmac
8
  pipeline_tag: text2text-generation
9
  tags:
10
  - keywords-generation
 
25
  ---
26
  # Keyword Extraction from Short Texts with T5
27
 
28
+ Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google ([https://huggingface.co/t5-base](https://huggingface.co/t5-base)). The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract.
29
 
30
  The biggest advantage is the transferability of the vlT5 model, as it works well on all domains and types of text. The downside is that the text length and the number of keywords are similar to the training data: the text piece of an abstract length generates approximately 3 to 5 keywords. It works both extractive and abstractively. Longer pieces of text must be split into smaller chunks, and then propagated to the model.
31
 
32
+ ### Overview
33
+ - **Language model:** [t5-base](https://huggingface.co/t5-base)
34
+ - **Language:** pl, en (but works relatively well with others)
35
+ - **Training data:** POSMAC
36
+ - **Online Demo:** [https://nlp-demo-1.voicelab.ai/](https://nlp-demo-1.voicelab.ai/)
37
+ - **Paper:** [TBA](TBA)
38
+
39
  # Corpus
40
 
41
+ The model was trained on a POSMAC corpus. Polish Open Science Metadata Corpus (POSMAC) is a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project.
42
 
43
 
44
  | Domains | Documents | With keywords |