bclavie's picture
Update README.md
690bdf8 verified
metadata
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
  - ModernBERT
  - fineweb
  - filtering
  - regression
metrics:
  - precision
  - recall
  - accuracy
model-index:
  - name: 8e-5_one_label
    results: []
datasets:
  - HuggingFaceFW/fineweb-edu-llama3-annotations
language:
  - en

One-off run using a modified version of the original Fineweb-Edu quality filter regression training code, simply replacing the original model (snowflake-embed-m, a model fine-tuned on BERT-base) with ModernBERT-base.

w/o extensive tuning, the model trains considerably faster than BERT-base, and gets +5 Weighted F1:

Results

ModernBERT-base-fineweb-edu-example

Weighted F1: 0.76

Detailed:

Validation Report:
              precision    recall  f1-score   support

           0       0.80      0.55      0.65      5694
           1       0.82      0.86      0.84     26512
           2       0.64      0.71      0.67     10322
           3       0.65      0.60      0.63      3407
           4       0.80      0.37      0.51       807
           5       0.00      0.00      0.00         1

    accuracy                           0.76     46743
   macro avg       0.62      0.51      0.55     46743
weighted avg       0.76      0.76      0.76     46743

Original Classifier (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier):

Weighted F1: 0.71

Detailed:

              precision    recall  f1-score   support

           0       0.75      0.49      0.59      5694
           1       0.78      0.84      0.81     26512
           2       0.57      0.61      0.59     10322
           3       0.56      0.50      0.53      3407
           4       0.58      0.35      0.44       807
           5       0.33      0.01      0.02       125

    accuracy                           0.71     46867
   macro avg       0.60      0.47      0.50     46867
weighted avg       0.71      0.71      0.71     46867

(for some reason, the currently available annotated dataset is identical, except that it's missing 124 of the 125 5-rated examples. These are so anecdotal they have no real impact on the weighted metrics.)

Params

Most parameters detailed in the script. Key hparams:

  • Learning Rate: 5e-5
  • Weight Decay: 0.1 (decoupled)
  • Seed: 1
  • Warmup: 10% steps
  • Schedule: Linear decay
  • Max epochs: 10
  • Best Epoch: #3
  • Precision: bfloat16