distilbert-truncated

This model is a fine-tuned version of distilbert-base-uncased on the 20 Newsgroups dataset. It achieves the following results on the evaluation set:

Training and evaluation data

The data was split into training and testing: model trained on 90% of the data, and had a testing data size of 10% of the original dataset.

DistilBERT has a maximum input length of 512, so with this in mind the following was performed:

I used the distilbert-base-uncased pretrained model to initialize an AutoTokenizer.
Setting a maximum length of 256, each entry in the training, testing and validation data was truncated if it exceeded the limit and padded if it didn't reach the limit.

The following hyperparameters were used during training:

optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': {'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 2e-05, 'decay_steps': 1908, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}}, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False}
training_precision: float32

EPOCHS = 3 batches_per_epoch = 636 total_train_steps = 1908

Model accuracy 0.8337758779525757

Model loss 0.568471074104309