Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition. To enable multinomial sampling set do_sample=True and num_beams=1. thon from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed set_seed(0) # For reproducibility checkpoint = "openai-community/gpt2-large" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint) prompt = "Today was an amazing day because" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100) tokenizer.batch_decode(outputs, skip_special_tokens=True) ['Today was an amazing day because when you go to the World Cup and you don\'t, or when you don\'t get invited, that\'s a terrible feeling."'] Beam-search decoding Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence.