anugrahap/gpt2-indo-textgen

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Model Description

A GPT-2 (Generative Pretrained Transformer-2) model is a transformer based architecture for Causal Language Modeling, meaning it's required a left token/word as an input prompt for generating the right/next token, developed by Open AI {Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya}. See the paper here: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

Limitation

Since GPT-2 is an unsupervised model and trained using an unlabelled of text sequences without any explicit supervision, the clarity and output of this model often comes with randomness. To overcome this issue we have to create a specific seed for determined output. Supported language for this model is only English (get from GPT-2 pretrained model) and Indonesian (fine tune using Indonesian Wikipedia Dataset).

How To Use

Direct use of using Pytorch:

>>> from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, set_seed

>>> model_name = 'anugrahap/gpt2-indo-textgen'
>>> tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
>>> model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
>>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

>>> #set_seed(1)
>>> result = generator("Skripsi merupakan tugas akhir mahasiswa", min_length=10, max_length=30, num_return_sequences=1)
>>> result[0]["generated_text"]

Learn more

| GPT-2 Pretrained Model Medium-345M Parameters
| Indonesian Wikipedia Dataset - 433MB by IndoNLP
| Project Repository

anugrahap
/

gpt2-indo-textgen

Model Description

Limitation

How To Use

Learn more

Dataset used to train anugrahap/gpt2-indo-textgen

Space using anugrahap/gpt2-indo-textgen 1