--- language: vi tags: - vi - vietnamese - gpt2 - text-generation - lm - nlp datasets: - oscar widget: - text: "Việt Nam là quốc gia có" --- # GPT-2 Pretrained model on Vietnamese language using a causal language modeling (CLM) objective. It was introduced in [this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and first released at [this page](https://openai.com/blog/better-language-models/). # How to use the model ~~~~ import torch from transformers import GPT2Tokenizer, GPT2LMHeadModel tokenizer = GPT2Tokenizer.from_pretrained('NlpHUST/gpt2-vietnamese') model = GPT2LMHeadModel.from_pretrained('NlpHUST/gpt2-vietnamese') text = "Albert Einstein là nhà vật lý học tạo ra thuyết lượng tử" input_ids = tokenizer.encode(text, return_tensors='pt') max_length = 100 sample_outputs = model.generate(input_ids,pad_token_id=tokenizer.eos_token_id, do_sample=True, max_length=max_length, min_length=max_length, top_k=40, num_beams=5, early_stopping=True, no_repeat_ngram_size=2, num_return_sequences=3) for i, sample_output in enumerate(sample_outputs): print(">> Generated text {}\n\n{}".format(i+1, tokenizer.decode(sample_output.tolist()))) print('\n---') ~~~~ # Model architecture A 12-layer, 768-hidden-size transformer-based language model. # Training The model was trained on Vietnamese Oscar dataset (32 GB) to optimize a traditional language modelling objective on v3-8 TPU for around 6 days. It reaches around 13.4 perplexity on a chosen validation set from Oscar. ### GPT-2 Finetuning The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before the tokenization). The loss here is that of causal language modeling. The script [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) . ```bash python run_clm.py \ --model_name_or_path NlpHUST/gpt2-vietnamese \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --do_train \ --do_eval \ --output_dir /tmp/test-clm ```