Sami92 commited on
Commit
04e6362
1 Parent(s): d7c567e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - de
5
+ - es
6
+ - en
7
+ pipeline_tag: token-classification
8
+ tags:
9
+ - politics
10
+ - communication
11
+ - public sphere
12
+ ---
13
+ # Model Card for Model ID
14
+
15
+ This is a Named Entity Recognition model fine-tuned for public entities:
16
+ - Politicians
17
+ - Parties
18
+ - Authorities
19
+ - Media
20
+ - Journalists
21
+
22
+ ## Model Details
23
+ Public Entity Recognition (PER). PER is a domainspecific version of NER, that is trained for five entities types that are common to public discourse: politicians, parties, authorities, media, and journalists. PER can be used for preprocessing documents, in a pipeline with other classifiers or directly for analyzing information in texts. The taxonomy for PER is taken from the database of (German) public speakers (Schmidt et al., 2023) and aims at low-threshold integration into computational social science research.
24
+
25
+
26
+ ## Bias, Risks, and Limitations
27
+
28
+ The performance for female entities (only applying to politicians and journalists) is slightly below that for male entities. This applies to entities that are referred to by name (Anna-Lena Baerbock/Olaf Scholz) or by profession (Innenministerin/Innenminister).
29
+
30
+ ### Recommendations
31
+
32
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
33
+
34
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
35
+
36
+ ## How to Get Started with the Model
37
+
38
+ ```python
39
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
40
+ from transformers import pipeline
41
+
42
+ model_name = "Sami92/XLM-PER-L"
43
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
44
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
45
+
46
+ ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)
47
+
48
+ text = '''
49
+ Nach dem Treffen mit Außenministerin Baerbock betont Israels Premier die Eigenständigkeit seines Landes.
50
+ Baerbock hatte zur Zurückhaltung aufgerufen.
51
+ Nach seinem Treffen mit Außenministerin Annalena Baerbock und dem britischen Außenminister David Cameron dringt der israelische Ministerpräsident Benjamin Netanjahu auf die Unabhängigkeit seines Landes.
52
+ '''
53
+
54
+ entities = ner_pipeline(text)
55
+
56
+ for entity in entities:
57
+ print(f"Entity: {entity['word']}, Type: {entity['entity']}, Score: {entity['score']:.4f}")
58
+
59
+
60
+ ```
61
+
62
+ ## Training Details
63
+
64
+ ### Training Data
65
+
66
+ The model was first fine-tuned on a weakly annotated dataset: German newspaper articles (total = 267,786) and German Wikipedia articles (total = 4,348).
67
+ The weak annotation was based on the [database of public speakers](https://github.com/Leibniz-HBI/DBoeS-data/).
68
+ In a second step the model was fine-tuned on a manually annotated dataset of 3090 sentences from similar sources. The test-split of this data was used for evaluation.
69
+
70
+
71
+
72
+ #### Training Hyperparameters
73
+
74
+ - Learning Rate = 5e-6
75
+ - Scheduler = Reduce learning rate on plateau
76
+ - Batch size = 8
77
+ - Epochs = 20
78
+
79
+
80
+
81
+ #### Metrics
82
+
83
+ - type: f1
84
+ value: 0.82
85
+ - type: recall
86
+ value: 0.80
87
+ - type: precision
88
+ value: 0.85
89
+
90
+
91
+ ## Model Card Authors [optional]
92
+ ```css
93
+ @misc{your_model_name,
94
+ author = {Nenno, Sami},
95
+ title = {Public Entity Recognition Model},
96
+ year = {2024},
97
+ publisher = {HuggingFace},
98
+ journal = {HuggingFace Model Repository},
99
+ url = {https://huggingface.co/Sami92/XLM-PER-L}
100
+ }
101
+
102
+ ```