2 1 1

Catherine Arnett

catherinearnett

https://catherinearnett.github.io/

AI & ML interests

multilingual NLP, tokenization

Recent Activity

upvoted an article 17 days ago

Releasing the largest multilingual open pretraining dataset

authored a paper 26 days ago

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

authored a paper 26 days ago

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

View all activity

Articles

Organizations

catherinearnett's activity

upvoted an article 17 days ago

Article

Releasing the largest multilingual open pretraining dataset

•

17 days ago

• 96

authored 2 papers 26 days ago

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Paper • 2409.04599 • Published Sep 6 • 1

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Paper • 2311.09194 • Published Nov 15, 2023

updated a model 26 days ago

PleIAs/OCRonos-Vintage-CT2

Updated 26 days ago • 8

New activity in PleIAs/ToxicCommons 27 days ago

Link to the annotation creation scrip private

#2 opened 29 days ago by

davanstrien

updated a dataset 27 days ago

PleIAs/ToxicCommons

Viewer • Updated 27 days ago • 1.96M • 147 • 6

updated a model 27 days ago

PleIAs/celadon

Text Classification • Updated 27 days ago • 282 • 20

authored a paper about 1 month ago

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published Oct 29 • 8

commented a paper about 1 month ago

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published Oct 29 • 8 •

updated a collection about 1 month ago

Toxic Commons

Collection

Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors. • 3 items • Updated about 1 month ago • 2

published an article 2 months ago

Article

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

•

Sep 27

• 36

updated 8 models 2 months ago

catherinearnett/B-GPT_pl_en_sequential

Text Generation • Updated Sep 26 • 436

catherinearnett/B-GPT_en_pl_sequential

Text Generation • Updated Sep 26 • 440

catherinearnett/B-GPT_pl_en_simultaneous

Text Generation • Updated Sep 26 • 426

catherinearnett/B-GPT_en_pl_simultaneous

Text Generation • Updated Sep 26 • 596

catherinearnett/B-GPT_el_en_sequential

Text Generation • Updated Sep 26 • 420

catherinearnett/B-GPT_en_el_sequential

Text Generation • Updated Sep 26 • 433

catherinearnett/B-GPT_el_en_simultaneous

Text Generation • Updated Sep 26 • 427

catherinearnett/B-GPT_en_el_simultaneous

Text Generation • Updated Sep 26 • 428

Catherine Arnett

AI & ML interests

Recent Activity

Articles

Releasing the largest multilingual open pretraining dataset

Detoxifying the Commons

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

Organizations

catherinearnett's activity

Releasing the largest multilingual open pretraining dataset

Link to the annotation creation scrip private

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??