guishe commited on
Commit
961dc97
1 Parent(s): bebb40b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -29
README.md CHANGED
@@ -1,8 +1,10 @@
1
  ---
2
- license: mit
3
  base_model: numind/NuNER-v1.0
4
  tags:
5
- - generated_from_trainer
 
 
6
  metrics:
7
  - precision
8
  - recall
@@ -10,35 +12,142 @@ metrics:
10
  - accuracy
11
  model-index:
12
  - name: nuner-v1_ontonotes5
13
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
-
19
- # nuner-v1_ontonotes5
20
-
21
- This model is a fine-tuned version of [numind/NuNER-v1.0](https://huggingface.co/numind/NuNER-v1.0) on an unknown dataset.
22
- It achieves the following results on the evaluation set:
23
- - Loss: 0.0728
24
- - Precision: 0.8712
25
- - Recall: 0.9000
26
- - F1: 0.8853
27
- - Accuracy: 0.9811
28
-
29
- ## Model description
30
-
31
- More information needed
32
-
33
- ## Intended uses & limitations
34
-
35
- More information needed
36
-
37
- ## Training and evaluation data
38
-
39
- More information needed
40
-
41
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ### Training hyperparameters
44
 
@@ -70,3 +179,17 @@ The following hyperparameters were used during training:
70
  - Pytorch 2.0.0+cu117
71
  - Datasets 2.18.0
72
  - Tokenizers 0.15.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
  base_model: numind/NuNER-v1.0
4
  tags:
5
+ - token-classification
6
+ - ner
7
+ - named-entity-recognition
8
  metrics:
9
  - precision
10
  - recall
 
12
  - accuracy
13
  model-index:
14
  - name: nuner-v1_ontonotes5
15
+ results:
16
+ - task:
17
+ type: token-classification
18
+ name: Named Entity Recognition
19
+ dataset:
20
+ name: OntoNotes5
21
+ type: tner/ontonotes5
22
+ split: eval
23
+ metrics:
24
+ - type: f1
25
+ value: 0.890930568316052
26
+ name: F1
27
+ - type: precision
28
+ value: 0.8777586206896552
29
+ name: Precision
30
+ - type: recall
31
+ value: 0.9045038642622368
32
+ name: Recall
33
+ - type: accuracy
34
+ value: 0.9818887790313181
35
+ name: Accuracy
36
+ datasets:
37
+ - tner/ontonotes5
38
+ language:
39
+ - en
40
+ library_name: transformers
41
+ pipeline_tag: token-classification
42
  ---
43
 
44
+ # numind/NuNER-v1.0 fine-tuned on OntoNotes5
45
+
46
+ This is a [NuNER](https://arxiv.org/abs/2402.15343) model fine-tuned on the [OntoNotes5](https://huggingface.co/datasets/tner/ontonotes5) dataset that can be used for Named Entity Recognition. NuNER model uses [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) as the backbone encoder and it was trained on the [NuNER dataset](https://huggingface.co/datasets/numind/NuNER), which is a large and diverse dataset synthetically labeled by gpt-3.5-turbo-0301 of 1M sentences. This further pre-training phase allowed the generation of high quality token embeddings, a good starting point for fine-tuning on more specialized datasets.
47
+
48
+ ## Model Details
49
+
50
+ The model was fine-tuned as a regular BERT-based model for NER task using HuggingFace Trainer class.
51
+
52
+ ## Model labels
53
+
54
+ Entity Types: CARDINAL, DATE, PERSON, NORP, GPE, LAW, PERCENT, ORDINAL, MONEY, WORK_OF_ART, FAC, TIME, QUANTITY, PRODUCT, LANGUAGE, ORG, LOC, EVENT
55
+
56
+ ## Uses
57
+
58
+ ### Direct Use for Inference
59
+
60
+ ```python
61
+ >>> from transformers import pipeline
62
+
63
+ >>> text = """Foreign governments may be spying on your smartphone notifications, senator says. Washington (CNN) — Foreign governments have reportedly attempted to spy on iPhone and Android users through the mobile app notifications they receive on their smartphones - and the US government has forced Apple and Google to keep quiet about it, according to a top US senator. Through legal demands sent to the tech giants, governments have allegedly tried to force Apple and Google to turn over sensitive information that could include the contents of a notification - such as previews of a text message displayed on a lock screen, or an update about app activity, Oregon Democratic Sen. Ron Wyden said in a new report. Wyden's report reflects the latest example of long-running tensions between tech companies and governments over law enforcement demands, which have stretched on for more than a decade. Governments around the world have particularly battled with tech companies over encryption, which provides critical protections to users and businesses while in some cases preventing law enforcement from pursuing investigations into messages sent over the internet."""
64
+
65
+ >>> classifier = pipeline(
66
+ "ner",
67
+ model="guishe/nuner-v1_ontonotes5",
68
+ grouped_entities=True
69
+ )
70
+ >>> classifier(text)
71
+
72
+ [{'entity_group': 'GPE',
73
+ 'score': 0.99179757,
74
+ 'word': ' Washington',
75
+ 'start': 82,
76
+ 'end': 92},
77
+ {'entity_group': 'ORG',
78
+ 'score': 0.9535868,
79
+ 'word': 'CNN',
80
+ 'start': 94,
81
+ 'end': 97},
82
+ {'entity_group': 'PRODUCT',
83
+ 'score': 0.6833637,
84
+ 'word': ' iPhone',
85
+ 'start': 157,
86
+ 'end': 163},
87
+ {'entity_group': 'PRODUCT',
88
+ 'score': 0.5540275,
89
+ 'word': ' Android',
90
+ 'start': 168,
91
+ 'end': 175},
92
+ {'entity_group': 'GPE',
93
+ 'score': 0.98848885,
94
+ 'word': ' US',
95
+ 'start': 263,
96
+ 'end': 265},
97
+ {'entity_group': 'ORG',
98
+ 'score': 0.9939406,
99
+ 'word': ' Apple',
100
+ 'start': 288,
101
+ 'end': 293},
102
+ {'entity_group': 'ORG',
103
+ 'score': 0.9933014,
104
+ 'word': ' Google',
105
+ 'start': 298,
106
+ 'end': 304},
107
+ {'entity_group': 'GPE',
108
+ 'score': 0.99083686,
109
+ 'word': ' US',
110
+ 'start': 348,
111
+ 'end': 350},
112
+ {'entity_group': 'ORG',
113
+ 'score': 0.99349517,
114
+ 'word': ' Apple',
115
+ 'start': 449,
116
+ 'end': 454},
117
+ {'entity_group': 'ORG',
118
+ 'score': 0.99239254,
119
+ 'word': ' Google',
120
+ 'start': 459,
121
+ 'end': 465},
122
+ {'entity_group': 'GPE',
123
+ 'score': 0.99598336,
124
+ 'word': ' Oregon',
125
+ 'start': 649,
126
+ 'end': 655},
127
+ {'entity_group': 'NORP',
128
+ 'score': 0.99030787,
129
+ 'word': ' Democratic',
130
+ 'start': 656,
131
+ 'end': 666},
132
+ {'entity_group': 'PERSON',
133
+ 'score': 0.9957912,
134
+ 'word': ' Ron Wyden',
135
+ 'start': 672,
136
+ 'end': 681},
137
+ {'entity_group': 'PERSON',
138
+ 'score': 0.83941424,
139
+ 'word': ' Wyden',
140
+ 'start': 704,
141
+ 'end': 709},
142
+ {'entity_group': 'DATE',
143
+ 'score': 0.87746465,
144
+ 'word': ' more than a decade',
145
+ 'start': 869,
146
+ 'end': 887}]
147
+ ```
148
+
149
+
150
+ ## Training Details
151
 
152
  ### Training hyperparameters
153
 
 
179
  - Pytorch 2.0.0+cu117
180
  - Datasets 2.18.0
181
  - Tokenizers 0.15.2
182
+
183
+ - ## Citation
184
+
185
+ ### BibTeX
186
+ ```
187
+ @misc{bogdanov2024nuner,
188
+ title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
189
+ author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
190
+ year={2024},
191
+ eprint={2402.15343},
192
+ archivePrefix={arXiv},
193
+ primaryClass={cs.CL}
194
+ }
195
+ ```