emilys commited on
Commit
c2cca7d
1 Parent(s): 1ac8449

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: token-classification
6
+ ---
7
+
8
+ # Byline Detection
9
+
10
+ ## Model description
11
+
12
+ **byline_detection** is a fine-tuned DistilBERT token classification model, which tags bylines and datelines in news articles.
13
+
14
+ It is trained to deal with OCR noise.
15
+
16
+
17
+ ## Intended uses
18
+
19
+ You can use this model with Transformers pipeline for NER.
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
23
+ from transformers import pipeline
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/byline-detection")
26
+ model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/byline-detection")
27
+
28
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
29
+ example = "NEW ORLEANS, (UP) — The Roman Catholic Church, through its leaders in the United States today appealed "
30
+
31
+ ner_results = nlp(example)
32
+ print(ner_results)
33
+ ```
34
+
35
+ ## Limitations and bias
36
+
37
+ This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting.
38
+ Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
39
+
40
+ ## Training data
41
+
42
+ This model was fine-tuned on historical English-language news that had been OCRd from American newspapers.
43
+
44
+ #### # of training examples per entity type
45
+ Dataset|Count
46
+ -|-
47
+ Train|1,392
48
+ Dev|464
49
+ Test|464
50
+
51
+
52
+ ## Training procedure
53
+
54
+ The data was used to fine-tune a DistilBERT model at a learning rate of 2e−5 with a batch size of 16 for 25 epochs.
55
+
56
+
57
+ ## Eval results
58
+ Statistic|Result
59
+ -|-
60
+ F1 | 0.96
61
+
62
+
63
+ ## Notes
64
+
65
+ This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md)
66
+
67
+
68
+ ## Citation
69
+
70
+ If you use this model, you can cite the following paper:
71
+
72
+ ```
73
+ @misc{silcock2024newswirelargescalestructureddatabase,
74
+ title={Newswire: A Large-Scale Structured Database of a Century of Historical News},
75
+ author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell},
76
+ year={2024},
77
+ eprint={2406.09490},
78
+ archivePrefix={arXiv},
79
+ primaryClass={cs.CL},
80
+ url={https://arxiv.org/abs/2406.09490},
81
+ }
82
+ ```
83
+
84
+ # Applications
85
+
86
+ We applied this model to a century of historical news articles, and georeference the bylines. You can see them all in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire).
87
+