dell-research-harvard
/

byline-detection

Token Classification

Inference Endpoints

Model card Files Files and versions Community

emilys commited on Aug 15

Commit

c2cca7d

•

1 Parent(s): 1ac8449

Create README.md

Files changed (1) hide show

README.md +87 -0

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+---
+license: cc-by-4.0
+language:
+- en
+pipeline_tag: token-classification
+---
+# Byline Detection
+## Model description
+**byline_detection** is a fine-tuned  DistilBERT token classification model, which tags bylines and datelines in news articles.
+It is trained to deal with OCR noise.
+## Intended uses
+You can use this model with Transformers pipeline for NER.
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from transformers import pipeline
+tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/byline-detection")
+model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/byline-detection")
+nlp = pipeline("ner", model=model, tokenizer=tokenizer)
+example = "NEW ORLEANS, (UP) — The Roman Catholic Church, through its leaders in the United States today appealed "
+ner_results = nlp(example)
+print(ner_results)
+```
+## Limitations and bias
+This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting.
+Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases.
+## Training data
+This model was fine-tuned on historical English-language news that had been OCRd from American newspapers.
+#### # of training examples per entity type
+Dataset|Count
+-|-
+Train|1,392
+Dev|464
+Test|464
+## Training procedure
+The data was used to fine-tune a DistilBERT model at a learning rate of 2e−5 with a batch size of 16 for 25 epochs.
+## Eval results
+Statistic|Result
+-|-
+F1 | 0.96
+## Notes
+This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md)
+## Citation
+If you use this model, you can cite the following paper:
+```
+@misc{silcock2024newswirelargescalestructureddatabase,
+      title={Newswire: A Large-Scale Structured Database of a Century of Historical News},
+      author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell},
+      year={2024},
+      eprint={2406.09490},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2406.09490},
+}
+```
+# Applications
+We applied this model to a century of historical news articles, and georeference the bylines. You can see them all in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire).