File size: 36,987 Bytes

7848bdf

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "papuGaPT2_text_generation.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-jlP8InZ6FuU"
      },
      "source": [
        "# Examples of generating text with papuGaPT2 - Polish GPT2 language model\n",
        "\n",
        "This notebook intends to show some examples of generating text with the Polish GPT2 model, [papuGaPT2](https://huggingface.co/flax-community/papuGaPT2)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zNXhY6w7oAY7",
        "outputId": "229305ac-1892-4603-9698-0dcdfada1ce2"
      },
      "source": [
        "!pip install transformers -qq"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\u001b[K     |████████████████████████████████| 2.5MB 5.0MB/s \n",
            "\u001b[K     |████████████████████████████████| 901kB 35.2MB/s \n",
            "\u001b[K     |████████████████████████████████| 3.3MB 38.3MB/s \n",
            "\u001b[?25h"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "d_XIbTMDoLeN"
      },
      "source": [
        "from transformers import pipeline, set_seed\n",
        "from transformers import AutoTokenizer, AutoModelWithLMHead"
      ],
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "o47RrqSU-hnS",
        "outputId": "081a2675-2b8d-4832-c9fb-6becc1e52c13"
      },
      "source": [
        "model = AutoModelWithLMHead.from_pretrained('flax-community/papuGaPT2')\n",
        "tokenizer = AutoTokenizer.from_pretrained('flax-community/papuGaPT2')\n",
        "set_seed(42) # reproducibility"
      ],
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.\n",
            "  FutureWarning,\n",
            "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9DjG3LKELhAz"
      },
      "source": [
        "## Text Generation\n",
        "\n",
        "Let's first start with the text-generation pipeline. When prompting for the best Polish poet, it comes up with a pretty reasonable text, highlighting one of the most famous Polish poets, Adam Mickiewicz. \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "s3mDGuxGoOA2",
        "outputId": "0b58cd6d-2cac-44f8-81d6-bf9a5790b217"
      },
      "source": [
        "generator = pipeline('text-generation', model='flax-community/papuGaPT2')"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iTPH2S-rL_xn",
        "outputId": "3a2165ee-348f-4c6e-eb5c-2cd92435357d"
      },
      "source": [
        "generator('Największym polskim poetą był')"
      ],
      "execution_count": 40,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'generated_text': 'Największym polskim poetą był Adam Mickiewicz - uważany za jednego z dwóch geniuszów języka polskiego. \"Pan Tadeusz\" był jednym z najpopularniejszych dzieł w historii Polski. W 1801 został wystawiony publicznie w Teatrze Wilama Horzycy. Pod jego'}]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 40
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xTZtviLSLsYf"
      },
      "source": [
        "Let's now explore the text generation/decoding method in more detail. The following code and examples were adapted from Patrick von Platen's [excellent article](https://huggingface.co/blog/how-to-generate).\n",
        "\n",
        "\n",
        "#### Greedy Search\n",
        "\n",
        "In this approach, we pick the most probable token at each step during the generation. As we can see, this results in a lot of repetitions. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "A8sspEnO-X6W",
        "outputId": "68f3ba22-491f-4776-f384-f98886876352"
      },
      "source": [
        "# encode context the generation is conditioned on\n",
        "input_ids = tokenizer.encode('Największym polskim poetą był', return_tensors='pt')\n",
        "\n",
        "# generate text until the output length (which includes the context length) reaches 50\n",
        "greedy_output = model.generate(input_ids, max_length=50)\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był Julian Tuwim, który w latach 60. i 70. był jednym z najbardziej znanych poetów. W latach 70. i 80. był jednym z najbardziej znanych poetów w Polsce.\n",
            "W latach 70. i 80. Tuwi\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ADNi9ehHOIJy"
      },
      "source": [
        "#### Beam Search\n",
        "\n",
        "Beam search allows us to maximize the probability of the entire sequence of generated tokens, as we search through the tree of possible options for the next probable token. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hUmnyzJU-fXR",
        "outputId": "63bf0414-8854-49bc-e137-c8fed8746c81"
      },
      "source": [
        "# activate beam search and early_stopping\n",
        "beam_output = model.generate(\n",
        "    input_ids, \n",
        "    max_length=50, \n",
        "    num_beams=5, \n",
        "    early_stopping=True\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(beam_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n",
            "/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.\n",
            "To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)\n",
            "  return torch.floor_divide(self, other)\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i dorosłych, a także dla dzieci i młodzieży, m.in. dla Jana Brzechwy, Juliana Tuwima, Jana Brzechwy, Jana Brzechwy i wielu innych.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jSVLNwCWOjuC"
      },
      "source": [
        "#### N-gram repetitions\n",
        "\n",
        "We can prevent the generated text from repeating n-grams like this. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2QeDJh5R_5bo",
        "outputId": "a0c530ef-adcc-4b78-b91f-a051742e0f10"
      },
      "source": [
        "# set no_repeat_ngram_size to 2\n",
        "beam_output = model.generate(\n",
        "    input_ids, \n",
        "    max_length=50, \n",
        "    num_beams=5, \n",
        "    no_repeat_ngram_size=2, \n",
        "    early_stopping=True\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(beam_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 27,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Bolesława Leśmiana,\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C1QtiC5HOsOn"
      },
      "source": [
        "#### Multiple Output Sentences\n",
        "\n",
        "We can ask the model to generate several output sentences. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ELSiU-nEAHY6",
        "outputId": "aa1416b4-2cdd-4c6e-c5bb-775c194e811b"
      },
      "source": [
        "# set return_num_sequences > 1\n",
        "beam_outputs = model.generate(\n",
        "    input_ids, \n",
        "    max_length=50, \n",
        "    num_beams=5, \n",
        "    no_repeat_ngram_size=2, \n",
        "    num_return_sequences=5, \n",
        "    early_stopping=True\n",
        ")\n",
        "\n",
        "# now we have 3 output sequences\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "for i, beam_output in enumerate(beam_outputs):\n",
        "  print(\"{}: {}\".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))"
      ],
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "0: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Bolesława Leśmiana,\n",
            "1: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Jana Lechonia\n",
            "2: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Czesława Janczarskiego\n",
            "3: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej, Czesława Miłosza,\n",
            "4: Największym polskim poetą był Julian Przyboś, który pisał wiersze dla dzieci i młodzieży, a także dla dorosłych, m.in. dla Jana Brzechwy, Juliana Tuwima, Marii Pawlikowskiej-Jasnorzewskiej i wielu innych.\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SkAV930BO3Zz"
      },
      "source": [
        "#### Sampling\n",
        "\n",
        "To produce more interesting text, instead of picking the most likely choice, we can sample next token from the probability distribution learned by our model. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4Yw7ZJi0AOa0",
        "outputId": "b249b80a-8108-4e06-dbfe-f1749862c6fd"
      },
      "source": [
        "# activate sampling and deactivate top_k by setting top_k sampling to 0\n",
        "sample_output = model.generate(\n",
        "    input_ids, \n",
        "    do_sample=True, \n",
        "    max_length=50, \n",
        "    top_k=0\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był Paweł Jasienica, postać barwna, pełna temperamentów, jakże zacna kobieta, Brat naszego serca dziś utarte cyruliki, kulon, Kościuszko Juliusz Polski Prowuaja Kozacyczcyca\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h7IlhqK1PGyr"
      },
      "source": [
        "#### Temperature scaling\n",
        "\n",
        "If the model picks a very low-probability token, this can lead to gibberish results. We can reduce this risk by sharpening the distribution with temperature. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "E-_lundzAfSc",
        "outputId": "8ef81b22-caa4-40a1-e935-aec0146d7ea5"
      },
      "source": [
        "# use temperature to decrease the sensitivity to low probability candidates\n",
        "sample_output = model.generate(\n",
        "    input_ids, \n",
        "    do_sample=True, \n",
        "    max_length=50, \n",
        "    top_k=0, \n",
        "    temperature=0.8\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 31,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był Adam Zagajewski. Zdjęcie poniżej pochodzi z 2010 roku.\n",
            "W „Gazecie Wyborczej” ukazał się nowy tekst Adama Zagajewskiego. Piszemy w nim o… Bolku i Lolku z „Niedzieli”.\n",
            "ZW\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gbe5_Z1kPUlH"
      },
      "source": [
        "#### Top-k Sampling\n",
        "\n",
        "We can also ask the model to only pick tokens from the list of k most probable tokens. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6eMOD-VeAvlR",
        "outputId": "dd3257ac-713d-471d-e793-3e8dd11b47f3"
      },
      "source": [
        "# set top_k to 50\n",
        "sample_output = model.generate(\n",
        "    input_ids, \n",
        "    do_sample=True, \n",
        "    max_length=50, \n",
        "    top_k=50\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 32,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był Stanisław Lem, który zasłynął z antyutopii, a także wielkim poczuciem humoru, wykazując się niezwykłą inteligencją. Poeci o jego twórczości mówią, że jest „żywym malarzem języka polskiego, a jednocześnie\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UrzIElatPkqW"
      },
      "source": [
        "#### Top-p Sampling\n",
        "\n",
        "Rather than picking among the k most probable tokens, we can decide to pick from the tokens that sum up to p probability. This way, we can give our text generation more freedom when many tokens are feasible, and narrow its focus when only a few options make sense.  We can also combine top-k and top-p sampling. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Sk_tAsLcA94W",
        "outputId": "22b86f18-c43d-4bf0-9ae1-24a970e3ed1a"
      },
      "source": [
        "# deactivate top_k sampling and sample only from 93% most likely words\n",
        "sample_output = model.generate(\n",
        "    input_ids, \n",
        "    do_sample=True, \n",
        "    max_length=50, \n",
        "    top_p=0.93, \n",
        "    top_k=0\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "print(tokenizer.decode(sample_output[0], skip_special_tokens=True))"
      ],
      "execution_count": 37,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "Największym polskim poetą był sobie Andrzej Poniedzielski, do którego wroc. to jako autor: Adrian Waksmundzki. Powstało 13 utworów poetyckich, przedstawionych w formie prozatorskiej, poetyckiej i scenicznej, jak\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zo0irbRWBIOH",
        "outputId": "5d30d98c-5f7e-4392-d9d1-e5dcae91ae57"
      },
      "source": [
        "# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3\n",
        "sample_outputs = model.generate(\n",
        "    input_ids,\n",
        "    do_sample=True, \n",
        "    max_length=50, \n",
        "    top_k=50, \n",
        "    top_p=0.95, \n",
        "    num_return_sequences=3\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "for i, sample_output in enumerate(sample_outputs):\n",
        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
      ],
      "execution_count": 38,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "0: Największym polskim poetą był Roman Ingarden. Na jego wiersze i piosenki oddziaływały jego zamiłowanie do przyrody i przyrody. Dlatego też jako poeta w czasie pracy nad utworami i wierszami z tych wierszy, a następnie z poezji własnej - pisał\n",
            "1: Największym polskim poetą był Julian Przyboś, którego poematem „Wierszyki dla dzieci”.\n",
            "W okresie międzywojennym, pod hasłem „Papież i nie tylko” Polska, jak większość krajów europejskich, była państwem faszystowskim.\n",
            "Prócz\n",
            "2: Największym polskim poetą był Bolesław Leśmian, który był jego tłumaczem, a jego poezja tłumaczyła na kilkanaście języków.\n",
            "W 1895 roku nakładem krakowskiego wydania \"Scientio\" ukazała się w języku polskim powieść W krainie kangurów\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cO2sDlX0QZ4N"
      },
      "source": [
        "## Avoiding Bad Words\n",
        "\n",
        "You may want to prevent certain words from occuring in the generated text. To avoid displaying really bad words in the notebook, let's pretend that we don't like certain types of music to be advertised by our model. The prompt says: *my favorite type of music is*. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Da2O9jNmQvie",
        "outputId": "a686c703-377e-4a3d-d557-59e061050ecb"
      },
      "source": [
        "# encode context the generation is conditioned on\n",
        "input_ids = tokenizer.encode('Mój ulubiony gatunek muzyki to', return_tensors='pt')\n",
        "\n",
        "sample_outputs = model.generate(\n",
        "    input_ids,\n",
        "    do_sample=True, \n",
        "    max_length=20, \n",
        "    top_k=50, \n",
        "    top_p=0.95, \n",
        "    num_return_sequences=5\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "for i, sample_output in enumerate(sample_outputs):\n",
        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
      ],
      "execution_count": 49,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "0: Mój ulubiony gatunek muzyki to rock i pop. U nas bardzo, bardzo często króluje rock i pop.\n",
            "1: Mój ulubiony gatunek muzyki to disco, czyli tango, a od 10.05 także fokstro\n",
            "2: Mój ulubiony gatunek muzyki to soul i reggae. Kocham hiphop i ska, to są moi\n",
            "3: Mój ulubiony gatunek muzyki to hip hop i wszelkiego rodzaju metal, głównie industrialne brzmienia (metal,\n",
            "4: Mój ulubiony gatunek muzyki to oczywiście soul, do dzisiaj pamiętam swój zachwyt nad głosem Damiena Per\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hFnNWFkSYzOx"
      },
      "source": [
        "Now let's prevent the model from generating text containing these words: *disco, rock, pop, soul, reggae, hip-hop*. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fcnODcEeBkGr"
      },
      "source": [
        "bad_words = [' disco', ' rock', ' pop', ' soul', ' reggae', ' hip-hop']\n",
        "bad_word_ids = []\n",
        "for bad_word in bad_words: \n",
        "  ids = tokenizer(bad_word).input_ids\n",
        "  bad_word_ids.append(ids)"
      ],
      "execution_count": 77,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JAr0EmJwRmka",
        "outputId": "94c463ae-c269-4577-a1ba-74dc528732ba"
      },
      "source": [
        "sample_outputs = model.generate(\n",
        "    input_ids,\n",
        "    do_sample=True, \n",
        "    max_length=20, \n",
        "    top_k=50, \n",
        "    top_p=0.95, \n",
        "    num_return_sequences=5,\n",
        "    bad_words_ids=bad_word_ids\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "for i, sample_output in enumerate(sample_outputs):\n",
        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
      ],
      "execution_count": 76,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "0: Mój ulubiony gatunek muzyki to muzyka klasyczna. Nie wiem, czy to kwestia sposobu, w jaki gramy,\n",
            "1: Mój ulubiony gatunek muzyki to reggea. Zachwycają mnie piosenki i piosenki muzyczne o ducho\n",
            "2: Mój ulubiony gatunek muzyki to rockabilly, ale nie lubię też punka. Moim ulubionym gatunkiem\n",
            "3: Mój ulubiony gatunek muzyki to rap, ale to raczej się nie zdarza w miejscach, gdzie nie chodzi\n",
            "4: Mój ulubiony gatunek muzyki to metal aranżeje nie mam pojęcia co mam robić. Co roku,\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g080rafsZEqo"
      },
      "source": [
        "Ok, it seems this worked: we can see *classical music, rap, metal* among the outputs. Interestingly, *reggae* found a way through via a misspelling *reggea*. Take it as a caution to be careful with curating your bad word lists!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nGzC7t6HaC4n"
      },
      "source": [
        "## Few Shot Learning\n",
        "\n",
        "Let's see now if our model is able to pick up training signal directly from a prompt, without any finetuning. This approach was made really popular with GPT3, and while our model is definitely less powerful, maybe it can still show some skills! If you'd like to explore this topic in more depth, check out [the following article](https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api) which we used as reference."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WqAYyfWZaCBd"
      },
      "source": [
        "prompt = \"\"\"Tekst: \"Nienawidzę smerfów!\"\n",
        "Sentyment: Negatywny\n",
        "###\n",
        "Tekst: \"Jaki piękny dzień 👍\"\n",
        "Sentyment: Pozytywny\n",
        "###\n",
        "Tekst: \"Jutro idę do kina\"\n",
        "Sentyment: Neutralny\n",
        "###\n",
        "Tekst: \"Ten przepis jest świetny!\"\n",
        "Sentyment:\"\"\""
      ],
      "execution_count": 134,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OXex5Zh8aSe2",
        "outputId": "2efcd460-fe1a-4d97-c740-d5d3a034fb20"
      },
      "source": [
        "res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)\n",
        "for x in res: \n",
        "  print(res[i]['generated_text'].split(' ')[1])"
      ],
      "execution_count": 135,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Pozytywny\n",
            "Pozytywny\n",
            "Pozytywny\n",
            "Pozytywny\n",
            "Pozytywny\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mP-hSxPBb5ky"
      },
      "source": [
        "prompt = \"\"\"Tekst: \"Nienawidzę smerfów!\"\n",
        "Sentyment: Negatywny\n",
        "###\n",
        "Tekst: \"Jaki piękny dzień 👍\"\n",
        "Sentyment: Pozytywny\n",
        "###\n",
        "Tekst: \"Jutro idę do kina\"\n",
        "Sentyment: Neutralny\n",
        "###\n",
        "Tekst: \"No po prostu beznadzieja\"\n",
        "Sentyment:\"\"\""
      ],
      "execution_count": 136,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wi5i1Dl5bemF",
        "outputId": "455e6602-03d0-480f-b306-e94a6022f403"
      },
      "source": [
        "res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)\n",
        "for x in res: \n",
        "  print(res[i]['generated_text'].split(' ')[1])"
      ],
      "execution_count": 137,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Negatywny\n",
            "Negatywny\n",
            "Negatywny\n",
            "Negatywny\n",
            "Negatywny\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "e96CRXtHcFfg"
      },
      "source": [
        "prompt = \"\"\"Tekst: \"Nienawidzę smerfów!\"\n",
        "Sentyment: Negatywny\n",
        "###\n",
        "Tekst: \"Jaki piękny dzień 👍\"\n",
        "Sentyment: Pozytywny\n",
        "###\n",
        "Tekst: \"Jutro idę do kina\"\n",
        "Sentyment: Neutralny\n",
        "###\n",
        "Tekst: \"Przyjechał wczoraj wieczorem.\"\n",
        "Sentyment:\"\"\""
      ],
      "execution_count": 140,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FsCeE80QcNUY",
        "outputId": "ea6ff86b-8adb-4b5a-bcaa-8b893a825aa5"
      },
      "source": [
        "res = generator(prompt, max_length=85, temperature=0.5, end_sequence='###', return_full_text=False, num_return_sequences=5,)\n",
        "for x in res: \n",
        "  print(res[i]['generated_text'].split(' ')[1])"
      ],
      "execution_count": 141,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Neutralny,\n",
            "Neutralny,\n",
            "Neutralny,\n",
            "Neutralny,\n",
            "Neutralny,\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P6NJOgzwk-gz"
      },
      "source": [
        "It looks like our model is able to pick up some signal from the prompt. Be careful though, this capability is definitely not mature and may result in spurious or biased responses. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n5r8vnFVdHn-"
      },
      "source": [
        "## Zero-Shot Learning\n",
        "\n",
        "Large language models are known to store a lot of knowledge in its parameters. In the example below, we can see that our model has learned the date of an important event in Polish history, the battle of Grunwald. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2lzoMNPic96F",
        "outputId": "88d5a77a-ec23-4c29-884e-0e51dd059b8f"
      },
      "source": [
        "prompt = \"Bitwa pod Grunwaldem miała miejsce w roku\"\n",
        "input_ids = tokenizer.encode(prompt, return_tensors='pt')\n",
        "# activate beam search and early_stopping\n",
        "beam_outputs = model.generate(\n",
        "    input_ids, \n",
        "    max_length=20, \n",
        "    num_beams=5, \n",
        "    early_stopping=True,\n",
        "    num_return_sequences=3\n",
        ")\n",
        "\n",
        "print(\"Output:\\n\" + 100 * '-')\n",
        "for i, sample_output in enumerate(beam_outputs):\n",
        "  print(\"{}: {}\".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))"
      ],
      "execution_count": 118,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "Output:\n",
            "----------------------------------------------------------------------------------------------------\n",
            "0: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie pod\n",
            "1: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie pokona\n",
            "2: Bitwa pod Grunwaldem miała miejsce w roku 1410, kiedy to wojska polsko-litewskie,\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "k_o4H2v1dWxV"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}