File size: 17,206 Bytes

{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Airline Sentiment Prediction using BERT\n",
        "\n",
        "### Approach\n",
        "First I analysed the data and I found that there was a huge imbalance in the dataset, to resolve this I used Textattack for augumentation of data.\n",
        "Before the augumenting the dataset I used the following techniques to clean the data & reduce the noise:\n",
        "- Removed the @usernames\n",
        "- Removed the URLs\n",
        "- Removed hashtags\n",
        "- Replacement of emojis with their meaning\n",
        "\n",
        "After cleaning the data I used EasyDataAugment of Textattack to augment the data, augmenting the data helped me to increase the accuracy of the model by more than 3%. I also tried using Clare(It replaces the words with their synonyms) but that was very resource intensive & it was taking very long to get output.\n",
        "\n",
        "### Model\n",
        "Since, this was a binary classification task I used BERT for training the model. I used the pretrained BERT model from Huggingface transformers library. I used the BERT model with the following parameters:\n",
        "- BERT-base-uncased\n",
        "- Max length of the input sequence: 128\n",
        "- Learning rate: 3e-5\n",
        "- Batch size: 32\n",
        "\n",
        "### Results\n",
        "The dataset was split into 80:20 ratio for training & validation.\n",
        "I got the following results after training the model:\n",
        "Training loss: 0.0137\n",
        "Validation loss: 0.1209\n",
        "Training accuracy: 0.9955\n",
        "Validation accuracy: 0.9794\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "========================================================================================================================================"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Install the required libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nwuC7017BwyD"
      },
      "outputs": [],
      "source": [
        "%pip install transformers\n",
        "%pip install emoji\n",
        "%pip install numpy pandas\n",
        "%pip install scikit-learn\n",
        "%pip install textattack"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Importing the libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iM2I9UEjm_pE"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "from pprint import pprint"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Reading the data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 676
        },
        "id": "mrnzcvkzm_pF",
        "outputId": "61550835-27fc-4049-f3a3-f9ab9e1ba1bb"
      },
      "outputs": [],
      "source": [
        "df = pd.read_csv(\"airline_sentiment_analysis.csv\")\n",
        "df.head(20)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Assigning 1 to positive sentiment and 0 to negative sentiment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 676
        },
        "id": "Jbl-wjpWm_pG",
        "outputId": "19c8b0ca-4506-4960-d588-505feecf678e"
      },
      "outputs": [],
      "source": [
        "for label in df['airline_sentiment']:\n",
        "    if label == 'positive':\n",
        "        df['airline_sentiment'].replace(label, 1, inplace=True)\n",
        "    elif label == 'negative':\n",
        "        df['airline_sentiment'].replace(label, 0, inplace=True)\n",
        "df.head(20)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Remove the @usernames, URLs, hashtags & Replace the emojis with their meaning"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 676
        },
        "id": "ApZHkGw2m_pG",
        "outputId": "330f44ff-3314-4392-eef4-69f2994b1cae"
      },
      "outputs": [],
      "source": [
        "\n",
        "import emoji\n",
        "for i,r in df.iterrows():\n",
        "  \n",
        "  df.loc[i,\"text\"] = emoji.demojize(df.loc[i,\"text\"])\n",
        "  df.loc[i,\"text\"] = df.loc[i,\"text\"].replace(\":\",\" \")\n",
        "  df.loc[i,\"text\"] = ' '.join(df.loc[i,\"text\"].split())\n",
        "\n",
        "df['text'] = df['text'].str.replace(\"@[A-Za-z0-9]+\", \"\",regex=True)\n",
        "df['text'] = df['text'].str.replace(\"#\", \"\",regex=True)\n",
        "df['text'] = df['text'].str.replace(\"https?://[A-Za-z0-9./]+\", \"\",regex=True)\n",
        "df['text'] = df['text'].str.replace(\"[^a-zA-Z.!?']\", \" \",regex=True)\n",
        "\n",
        "\n",
        "df.head(20)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Augumenting Positive Sentiment using EasyDataAugment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lNr4OWDaGYww"
      },
      "outputs": [],
      "source": [
        "positive_feedback = (df.loc[df[\"airline_sentiment\"] == 1])[\"text\"]\n",
        "positive_feedback = positive_feedback.tolist()\n",
        "# pprint(positive_feedback)\n",
        "\n",
        "from textattack.augmentation import EasyDataAugmenter\n",
        "esy_aug = EasyDataAugmenter()\n",
        "aug_list = []\n",
        "for sen in positive_feedback:\n",
        "  aug_list.append(esy_aug.augment(sen))\n",
        "serial_list = []\n",
        "for l in aug_list:\n",
        "  for sen in l:\n",
        "    serial_list.append(sen)\n",
        "df = df.drop(df.columns[[0]],axis=1)\n",
        "\n",
        "df2 = pd.DataFrame(list(zip([1]*len(serial_list),serial_list)),columns=[\"airline_sentiment\",\"text\"])\n",
        "\n",
        "df = pd.concat([df,df2])\n",
        "\n",
        "df.to_csv(\"modified.csv\") #save the modified dataset\n",
        "df.head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Split dataset into train & validation in 80:20 ratio"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MrynUQ9Xm_pG"
      },
      "outputs": [],
      "source": [
        "# split the data into train and test\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "train, test = train_test_split(df, test_size=0.2, random_state=42)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Initalise the BERT model & tokenizer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "MambfTNXm_pG",
        "outputId": "f0e11223-8e74-445a-8cdc-cc8492f26b14"
      },
      "outputs": [],
      "source": [
        "from transformers import BertTokenizer, TFBertForSequenceClassification\n",
        "from transformers import InputExample, InputFeatures\n",
        "import tensorflow as tf\n",
        "\n",
        "model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')\n",
        "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Utility function to convert the data into the format required by BERT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uxgZ7GsEm_pH"
      },
      "outputs": [],
      "source": [
        "def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): \n",
        "  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case\n",
        "                                                          text_a = x[DATA_COLUMN], \n",
        "                                                          text_b = None,\n",
        "                                                          label = x[LABEL_COLUMN]), axis = 1)\n",
        "\n",
        "  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case\n",
        "                                                          text_a = x[DATA_COLUMN], \n",
        "                                                          text_b = None,\n",
        "                                                          label = x[LABEL_COLUMN]), axis = 1)\n",
        "  \n",
        "  return train_InputExamples, validation_InputExamples\n",
        "\n",
        "  \n",
        "def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):\n",
        "    features = [] # -> will hold InputFeatures to be converted later\n",
        "\n",
        "    for e in examples:\n",
        "        # Documentation is really strong for this method, so please take a look at it\n",
        "        input_dict = tokenizer.encode_plus(\n",
        "            e.text_a,\n",
        "            add_special_tokens=True,\n",
        "            max_length=max_length, # truncates if len(s) > max_length\n",
        "            return_token_type_ids=True,\n",
        "            return_attention_mask=True,\n",
        "            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length\n",
        "            truncation=True\n",
        "        )\n",
        "\n",
        "        input_ids, token_type_ids, attention_mask = (input_dict[\"input_ids\"],\n",
        "            input_dict[\"token_type_ids\"], input_dict['attention_mask'])\n",
        "\n",
        "        features.append(\n",
        "            InputFeatures(\n",
        "                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label\n",
        "            )\n",
        "        )\n",
        "\n",
        "    def gen():\n",
        "        for f in features:\n",
        "            yield (\n",
        "                {\n",
        "                    \"input_ids\": f.input_ids,\n",
        "                    \"attention_mask\": f.attention_mask,\n",
        "                    \"token_type_ids\": f.token_type_ids,\n",
        "                },\n",
        "                f.label,\n",
        "            )\n",
        "\n",
        "    return tf.data.Dataset.from_generator(\n",
        "        gen,\n",
        "        ({\"input_ids\": tf.int32, \"attention_mask\": tf.int32, \"token_type_ids\": tf.int32}, tf.int64),\n",
        "        (\n",
        "            {\n",
        "                \"input_ids\": tf.TensorShape([None]),\n",
        "                \"attention_mask\": tf.TensorShape([None]),\n",
        "                \"token_type_ids\": tf.TensorShape([None]),\n",
        "            },\n",
        "            tf.TensorShape([]),\n",
        "        ),\n",
        "    )\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "BERT model for training"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tPsHpWhJm_pH",
        "outputId": "a9f7b2b8-d0bb-474b-d91a-25f0c8a40905"
      },
      "outputs": [],
      "source": [
        "DATA_COLUMN = 'text'\n",
        "LABEL_COLUMN = 'airline_sentiment'\n",
        "\n",
        "\n",
        "train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)\n",
        "\n",
        "train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)\n",
        "train_data = train_data.shuffle(100).batch(32).repeat(2)\n",
        "\n",
        "validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)\n",
        "validation_data = validation_data.batch(32)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GDcgmUOCm_pI",
        "outputId": "2f262b78-65f4-4cfc-deb3-a51d2b499eab"
      },
      "outputs": [],
      "source": [
        "model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), \n",
        "              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), \n",
        "              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])\n",
        "\n",
        "model.fit(train_data, epochs=2, validation_data=validation_data)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Saving the trained weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K3pzOJS8R1dx"
      },
      "outputs": [],
      "source": [
        "model.save_weights(\"weights.h5\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Inference: Predicting the sentiment of the tweet"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 143
        },
        "id": "BLcz_yKOr38C",
        "outputId": "a56245b6-395a-4ff1-aeba-ab30c98cedb1"
      },
      "outputs": [],
      "source": [
        "pred_data = [\"@abc The flight was great\", \"@abc ☹️\",\"🎊 it was bad experience\"]\n",
        "pred_data = pd.DataFrame(pred_data)\n",
        "\n",
        "\n",
        "for i,r in pred_data.iterrows():\n",
        "  pred_data.loc[i,0] = emoji.demojize(r[0])\n",
        "  pred_data.loc[i,0] = r[0].replace(\":\",\" \")\n",
        "  pred_data.loc[i,0] = ' '.join(r[0].split())\n",
        "\n",
        "\n",
        "pred_data[0] = pred_data[0].str.replace(\"@[A-Za-z0-9]+\", \"\",regex=True)\n",
        "pred_data[0] = pred_data[0].str.replace(\"#\", \"\",regex=True)\n",
        "pred_data[0] = pred_data[0].str.replace(\"https?://[A-Za-z0-9./]+\", \"\",regex=True)\n",
        "pred_data[0] = pred_data[0].str.replace(\"[^a-zA-Z.!?']\", \" \",regex=True)\n",
        "\n",
        "pred_data.head()\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lrTfXZzLsKd6",
        "outputId": "0f2c6e24-bc62-45a6-bc71-9b6918ab961e"
      },
      "outputs": [],
      "source": [
        "pred_data = pred_data[0].values.tolist()\n",
        "print(pred_data)\n",
        "tf_batch = tokenizer(pred_data, max_length=128, padding=True, truncation=True, return_tensors='tf')\n",
        "tf_outputs = model(tf_batch)\n",
        "tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)\n",
        "labels = ['Negative','Positive']\n",
        "label = tf.argmax(tf_predictions, axis=1)\n",
        "label = label.numpy()\n",
        "for i in range(len(pred_data)):\n",
        "  print(pred_data[i], \": \\n\", labels[label[i]])"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.9 (tags/v3.10.9:1dd9be6, Dec  6 2022, 20:01:21) [MSC v.1934 64 bit (AMD64)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "497fb9213e55408ad8b1ca9a37e341ac93888d86a532670599a03e0c8054f45a"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}