--- tags: - generated_from_trainer datasets: - Graphcore/wikipedia-bert-128 - Graphcore/wikipedia-bert-512 model-index: - name: Graphcore/bert-large-uncased results: [] --- # Graphcore/bert-large-uncased This model is a pre-trained BERT-Large trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets. ## Model description Pre-trained BERT Large model trained on Wikipedia data. ## Training and evaluation data Trained on wikipedia datasets: - [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) - [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) ## Training procedure Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962). Trained on 64 Graphcore Mk2 IPUs using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore) Command lines: Phase 1: ``` python examples/language-modeling/run_pretraining.py \ --config_name bert-large-uncased \ --tokenizer_name bert-large-uncased \ --ipu_config_name Graphcore/bert-large-ipu \ --dataset_name Graphcore/wikipedia-bert-128 \ --do_train \ --logging_steps 5 \ --max_seq_length 128 \ --max_steps 10550 \ --is_already_preprocessed \ --dataloader_num_workers 64 \ --dataloader_mode async_rebatched \ --lamb \ --lamb_no_bias_correction \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 512 \ --pod_type pod64 \ --learning_rate 0.006 \ --lr_scheduler_type linear \ --loss_scaling 32768 \ --weight_decay 0.01 \ --warmup_ratio 0.28 \ --config_overrides "layer_norm_eps=0.001" \ --ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \ --output_dir output-pretrain-bert-large-phase1 ``` Phase 2: ``` python examples/language-modeling/run_pretraining.py \ --config_name bert-large-uncased \ --tokenizer_name bert-large-uncased \ --model_name_or_path ./output-pretrain-bert-large-phase1 \ --ipu_config_name Graphcore/bert-large-ipu \ --dataset_name Graphcore/wikipedia-bert-512 \ --do_train \ --logging_steps 5 \ --max_seq_length 512 \ --max_steps 2038 \ --is_already_preprocessed \ --dataloader_num_workers 96 \ --dataloader_mode async_rebatched \ --lamb \ --lamb_no_bias_correction \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 512 \ --pod_type pod64 \ --learning_rate 0.002828 \ --lr_scheduler_type linear \ --loss_scaling 16384 \ --weight_decay 0.01 \ --warmup_ratio 0.128 \ --config_overrides "layer_norm_eps=0.001" \ --ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \ --output_dir output-pretrain-bert-large-phase2 ``` ### Training hyperparameters The following hyperparameters were used during phase 1 training: - learning_rate: 0.006 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - distributed_type: IPU - gradient_accumulation_steps: 512 - total_train_batch_size: 65536 - total_eval_batch_size: 512 - optimizer: LAMB - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.28 - training_steps: 10550 - training precision: Mixed Precision The following hyperparameters were used during phase 2 training: - learning_rate: 0.002828 - train_batch_size: 2 - eval_batch_size: 8 - seed: 42 - distributed_type: IPU - gradient_accumulation_steps: 512 - total_train_batch_size: 16384 - total_eval_batch_size: 512 - optimizer: LAMB - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.128 - training_steps: 2038 - training precision: Mixed Precision ### Training results ``` train/epoch: 2.04 train/global_step: 2038 train/loss: 1.2002 train/train_runtime: 12022.3897 train/train_steps_per_second: 0.17 train/train_samples_per_second: 2777.367 ``` ### Framework versions - Transformers 4.17.0 - Pytorch 1.10.0+cpu - Datasets 2.0.0 - Tokenizers 0.11.6