pszemraj
/

mega-small-2048-C1024-tk_id-simplewiki-MR50

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

mega-small-2048-C1024-tk_id-simplewiki-MR50 / README.md

pszemraj's picture

Update README.md

fcf0046 about 1 year ago

|

history blame contribute delete

3.46 kB

	---
	license: apache-2.0
	base_model: pszemraj/random-mega-small-2048
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	datasets:
	- pszemraj/simple_wikipedia_LM
	pipeline_tag: fill-mask
	---

	# mega-small-2048 on simple wikipedia

	[MEGA](https://arxiv.org/abs/2209.10655) for masked LM 'small' (12 layers, 512 hidden size, 2048 ctx in chunks of 1024) on the `pszemraj/simple_wikipedia_LM` dataset.
	It achieves the following results on the evaluation set:
	- Loss: 3.4773
	- Accuracy: 0.4591

	## Model description

	See [config](https://huggingface.co/pszemraj/mega-small-2048-C1024-tk_id-simplewiki-MR50/blob/main/config.json) for architecture details. While not a ready 'pretrained' model, this was trained from scratch.

	This model uses the tokenizer from `roberta-base`.

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	> Note: this was trained in `bf16`. the [official recommendation](https://github.com/facebookresearch/mega#tips) is fp32 - still exploring this.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0005
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 3208
	- gradient_accumulation_steps: 64
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-07
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 3.0

	Additionally:

	- mask rate of 50% (See [paper for details](https://arxiv.org/abs/2202.08005))
	- whole-word masking


	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| 7.2691 \| 0.11 \| 50 \| 7.1000 \| 0.0677 \|
	\| 7.1597 \| 0.22 \| 100 \| 6.8388 \| 0.0794 \|
	\| 6.5476 \| 0.33 \| 150 \| 6.4004 \| 0.1359 \|
	\| 6.5335 \| 0.44 \| 200 \| 6.1776 \| 0.1708 \|
	\| 5.7228 \| 0.55 \| 250 \| 5.6106 \| 0.2437 \|
	\| 5.4574 \| 0.66 \| 300 \| 5.1391 \| 0.2884 \|
	\| 5.2275 \| 0.78 \| 350 \| 4.8626 \| 0.3174 \|
	\| 4.9589 \| 0.89 \| 400 \| 4.6454 \| 0.3374 \|
	\| 4.6406 \| 1.0 \| 450 \| 4.4498 \| 0.3578 \|
	\| 4.8251 \| 1.11 \| 500 \| 4.3055 \| 0.3706 \|
	\| 4.4728 \| 1.22 \| 550 \| 4.1877 \| 0.3821 \|
	\| 4.3975 \| 1.33 \| 600 \| 4.0709 \| 0.3955 \|
	\| 4.4245 \| 1.44 \| 650 \| 3.9909 \| 0.4045 \|
	\| 4.2613 \| 1.55 \| 700 \| 3.8976 \| 0.4128 \|
	\| 4.1806 \| 1.66 \| 750 \| 3.8515 \| 0.4177 \|
	\| 3.9469 \| 1.77 \| 800 \| 3.7883 \| 0.4227 \|
	\| 3.9563 \| 1.88 \| 850 \| 3.7314 \| 0.4306 \|
	\| 4.0063 \| 1.99 \| 900 \| 3.6975 \| 0.4336 \|
	\| 3.9274 \| 2.1 \| 950 \| 3.6561 \| 0.4378 \|
	\| 3.788 \| 2.21 \| 1000 \| 3.6280 \| 0.4410 \|
	\| 3.8711 \| 2.33 \| 1050 \| 3.5736 \| 0.4467 \|
	\| 3.8623 \| 2.44 \| 1100 \| 3.5535 \| 0.4496 \|
	\| 3.8575 \| 2.55 \| 1150 \| 3.5407 \| 0.4521 \|
	\| 4.0079 \| 2.66 \| 1200 \| 3.5172 \| 0.4543 \|
	\| 3.8265 \| 2.77 \| 1250 \| 3.4786 \| 0.4591 \|
	\| 3.9513 \| 2.88 \| 1300 \| 3.4741 \| 0.4578 \|
	\| 3.554 \| 2.99 \| 1350 \| 3.4773 \| 0.4591 \|


	### Framework versions

	- Transformers 4.33.1
	- Pytorch 2.2.0.dev20230907+cu118
	- Datasets 2.13.1
	- Tokenizers 0.13.3