File size: 3,457 Bytes
107f1a8
 
 
 
 
 
 
8b186fd
 
3e45e0c
107f1a8
 
8b186fd
107f1a8
37fcec8
107f1a8
 
 
 
 
 
716b33b
107f1a8
18b2a82
 
107f1a8
 
 
 
 
 
18b2a82
 
107f1a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18b2a82
 
fcf0046
18b2a82
 
 
107f1a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b186fd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
base_model: pszemraj/random-mega-small-2048
tags:
- generated_from_trainer
metrics:
- accuracy
datasets:
- pszemraj/simple_wikipedia_LM
pipeline_tag: fill-mask
---

# mega-small-2048 on simple wikipedia

[MEGA](https://arxiv.org/abs/2209.10655) for masked LM 'small' (12 layers, 512 hidden size, 2048 ctx in chunks of 1024) on the `pszemraj/simple_wikipedia_LM` dataset.
It achieves the following results on the evaluation set:
- Loss: 3.4773
- Accuracy: 0.4591

## Model description

See [config](https://huggingface.co/pszemraj/mega-small-2048-C1024-tk_id-simplewiki-MR50/blob/main/config.json) for architecture details. While not a ready 'pretrained' model, this was trained from scratch.

This model uses the tokenizer from `roberta-base`.

## Intended uses & limitations

More information needed

## Training and evaluation data

> **Note:** this was trained in `bf16`. the [official recommendation](https://github.com/facebookresearch/mega#tips) is fp32 - still exploring this.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 1
- eval_batch_size: 1
- seed: 3208
- gradient_accumulation_steps: 64
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-07
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 3.0

Additionally:

- mask rate of 50% (See [paper for details](https://arxiv.org/abs/2202.08005))
- whole-word masking


### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 7.2691        | 0.11  | 50   | 7.1000          | 0.0677   |
| 7.1597        | 0.22  | 100  | 6.8388          | 0.0794   |
| 6.5476        | 0.33  | 150  | 6.4004          | 0.1359   |
| 6.5335        | 0.44  | 200  | 6.1776          | 0.1708   |
| 5.7228        | 0.55  | 250  | 5.6106          | 0.2437   |
| 5.4574        | 0.66  | 300  | 5.1391          | 0.2884   |
| 5.2275        | 0.78  | 350  | 4.8626          | 0.3174   |
| 4.9589        | 0.89  | 400  | 4.6454          | 0.3374   |
| 4.6406        | 1.0   | 450  | 4.4498          | 0.3578   |
| 4.8251        | 1.11  | 500  | 4.3055          | 0.3706   |
| 4.4728        | 1.22  | 550  | 4.1877          | 0.3821   |
| 4.3975        | 1.33  | 600  | 4.0709          | 0.3955   |
| 4.4245        | 1.44  | 650  | 3.9909          | 0.4045   |
| 4.2613        | 1.55  | 700  | 3.8976          | 0.4128   |
| 4.1806        | 1.66  | 750  | 3.8515          | 0.4177   |
| 3.9469        | 1.77  | 800  | 3.7883          | 0.4227   |
| 3.9563        | 1.88  | 850  | 3.7314          | 0.4306   |
| 4.0063        | 1.99  | 900  | 3.6975          | 0.4336   |
| 3.9274        | 2.1   | 950  | 3.6561          | 0.4378   |
| 3.788         | 2.21  | 1000 | 3.6280          | 0.4410   |
| 3.8711        | 2.33  | 1050 | 3.5736          | 0.4467   |
| 3.8623        | 2.44  | 1100 | 3.5535          | 0.4496   |
| 3.8575        | 2.55  | 1150 | 3.5407          | 0.4521   |
| 4.0079        | 2.66  | 1200 | 3.5172          | 0.4543   |
| 3.8265        | 2.77  | 1250 | 3.4786          | 0.4591   |
| 3.9513        | 2.88  | 1300 | 3.4741          | 0.4578   |
| 3.554         | 2.99  | 1350 | 3.4773          | 0.4591   |


### Framework versions

- Transformers 4.33.1
- Pytorch 2.2.0.dev20230907+cu118
- Datasets 2.13.1
- Tokenizers 0.13.3