## Data Collection 🛠

The subjQA dataset is constructed based on publicly available review datasets. Specifically, the movies, books, electronics, and grocery categories are constructed using reviews from the Amazon Review dataset. The TripAdvisor category, as the name suggests, is constructed using reviews from TripAdvisor which can be found [here](link). Finally, the restaurants category is constructed using the Yelp Dataset which is also publicly available.

The process of constructing SubjQA is discussed in detail in our paper. In a nutshell, the dataset construction consists of the following steps:

1. First, all opinions expressed in reviews are extracted. In the pipeline, each opinion is modeled as a (modifier, aspect) pair which is a pair of spans where the former describes the latter. *(e.g., "good, hotel", and "terrible, acting" are a few examples of extracted opinions)*.
2. Using Matrix Factorization techniques, implication relationships between different expressed opinions are mined. For instance, the system mines that "responsive keys" implies "good keyboard". In our pipeline, we refer to the conclusion of an implication (i.e., "good keyboard" in this example) as the query opinion, and we refer to the premise (i.e., "responsive keys") as its neighboring opinion.
3. Annotators are then asked to write a question based on query opinions. For instance, given "good keyboard" as the query opinion, they might write "Is this keyboard any good?"
4. Each question written based on a query opinion is then paired with a review that mentions its neighboring opinion. In our example, that would be a review that mentions "responsive keys".
5. The question and review pairs are presented to annotators to select the correct answer span, and rate the subjectivity level of the question as well as the subjectivity level of the highlighted answer span.

## Data Format 📊

All files are in standard CSV format, and they consist of the following columns:

- **domain**: The category/domain of the review (e.g., hotels, books, ...).
- **question**: The question (written based on a query opinion).
- **review**: The review (that mentions the neighboring opinion).
- **human_ans_spans**: The span labeled by annotators as the answer.
- **human_ans_indices**: The (character-level) start and end indices of the answer span highlighted by annotators.
- **question_subj_level**: The subjectivity level of the question (on a 1 to 5 scale with 1 being the most subjective).
- **ques_subj_score**: The subjectivity score of the question computed using the TextBlob package.
- **is_ques_subjective**: A boolean subjectivity label derived from question_subj_level (i.e., scores below 4 are considered as subjective).
- **answer_subj_level**: The subjectivity level of the answer span (on a 1 to 5 scale with 5 being the most subjective).
- **ans_subj_score**: The subjectivity score of the answer span computed using the TextBlob package.
- **is_ans_subjective**: A boolean subjectivity label derived from answer_subj_level (i.e., scores below 4 are considered as subjective).
- **nn_mod**: The modifier of the neighboring opinion (which appears in the review).
- **nn_asp**: The aspect of the neighboring opinion (which appears in the review).
- **query_mod**: The modifier of the query opinion (around which a question is manually written).
- **query_asp**: The aspect of the query opinion (around which a question is manually written).
- **item_id**: The id of the item/business discussed in the review.
- **review_id**: A unique id associated with the review.
- **q_review_id**: A unique id assigned to the question-review pair.
- **q_reviews_id**: A unique id assigned to all question-review pairs with a shared question.

### Citation
Johannes Bjerva, Nikita Bhutani, Behzad Golahn, Wang-Chiew Tan, and Isabelle Augenstein. (2020). SubjQA: A Dataset for Subjectivity and Review Comprehension. In *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics.

In [None]:
from google.colab import userdata
userdata.get('HuggingFace')

# Retrieve secret name
secret_name = userdata.get('HuggingFace')

# Set up Git configuration
!git config --global user.email "kagantimur@icloud.com"
!git config --global user.name "kgntmr"

# Log in to the Hugging Face Hub
!huggingface-cli login

In [14]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [15]:
from datasets import load_dataset
import datasets
from transformers import AutoTokenizer

In [16]:
model = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [17]:
# Is it a fast tokenizer or not?
# A fast tokenizer is optimized for speed and efficiency in tokenizing text
# Often implement faster processing, useful for large-scale NLP tasks.
tokenizer.is_fast

True

In [19]:
# # Define the maximum length and stride parameters for tokenization
# max_length = 384 # 384 is commonly used as it is sufficient to cover a significant portion of most input
# stride = 128 # 128 is often used as it provides a good balance between capturing context and avoiding redundancy

# # Define a function to preprocess training samples
# def preprocess_training_samples(samples):
#     # Extract questions from the samples dictionary and strip whitespace
#     questions = [q.strip() for q in samples["question"]]

#     # Tokenize questions and contexts using the tokenizer
#     inputs = tokenizer(
#         questions,
#         samples["context"],
#         max_length=max_length,
#         truncation="only_second",
#         stride=stride,
#         return_overflowing_tokens=True,
#         return_offsets_mapping=True,
#         padding="max_length",
#     )

#     # Extract offset_mapping, sample_map, and answers from the tokenized inputs
#     offset_mapping = inputs.pop("offset_mapping")
#     sample_map = inputs.pop("overflow_to_sample_mapping")
#     answers = samples["answers"]

#     # Initialize lists to store start and end positions of answers
#     start_positions = []
#     end_positions = []

#     # Iterate over the offset_mapping to process each tokenized input
#     for i, offset in enumerate(offset_mapping):
#         # Get the sample index for the current tokenized input
#         sample_idx = sample_map[i]

#         # Get the answer text and start position from the answers dictionary
#         answer = answers[sample_idx]
#         start_char = answer["answer_start"][0]
#         end_char = answer["answer_start"][0] + len(answer["text"][0])

#         # Get the sequence_ids to identify the start and end of the context
#         sequence_ids = inputs.sequence_ids(i)

#         # Find the start and end token positions of the context
#         idx = 0
#         while sequence_ids[idx] != 1:
#             idx += 1
#         context_start = idx
#         while sequence_ids[idx] == 1:
#             idx += 1
#         context_end = idx - 1

#         # If the answer is fully contained within the context, find its token positions
#         if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
#             start_positions.append(0)
#             end_positions.append(0)
#         else:
#             idx = context_start
#             while idx <= context_end and offset[idx][0] <= start_char:
#                 idx += 1
#             start_positions.append(idx - 1)

#             idx = context_end
#             while idx >= context_start and offset[idx][1] >= end_char:
#                 idx -= 1
#             end_positions.append(idx + 1)

#     # Add start and end positions to the inputs dictionary
#     inputs["start_positions"] = start_positions
#     inputs["end_positions"] = end_positions

#     # Return the modified inputs dictionary
#     return inputs

In [22]:
import pandas as pd
df_train=pd.read_csv('/content/drive/MyDrive/subjqa-train.csv')
df_test=pd.read_csv('/content/drive/MyDrive/subjqa-test.csv')

In [20]:
# Define the maximum length and stride parameters for tokenization
max_length = 384  # Maximum length of tokenized sequences, commonly used for a balance between context and memory usage
stride = 128  # Stride determines overlap between tokenized sequences, providing context while avoiding redundancy

# Define a function to preprocess training samples
def preprocess_training_samples(samples, batch_size=32):
    # Get the total number of samples
    num_samples = len(samples["question"])
    processed_samples = []

    # Process samples in batches to optimize memory usage
    for i in range(0, num_samples, batch_size):
        # Extract questions, contexts, and answers for the current batch
        batch_questions = [q.strip() for q in samples["question"][i:i+batch_size]]
        batch_contexts = samples["context"][i:i+batch_size]
        batch_answers = samples["answers"][i:i+batch_size]

        # Tokenize questions and contexts using the tokenizer
        inputs = tokenizer(
            batch_questions,
            batch_contexts,
            max_length=max_length,
            truncation="only_second",
            stride=stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length",
        )

        # Extract offset_mapping and sample_map from tokenized inputs
        offset_mapping = inputs.pop("offset_mapping")
        sample_map = inputs.pop("overflow_to_sample_mapping")

        # Process each tokenized input in the batch
        for j, offset in enumerate(offset_mapping):
            # Get the sample index for the current tokenized input
            sample_idx = sample_map[j]
            answer = batch_answers[j]
            start_char = answer["answer_start"][0]
            end_char = start_char + len(answer["text"][0])
            sequence_ids = inputs.sequence_ids(j)

            # Find the start and end token positions of the context
            context_start = next(idx for idx, seq_id in enumerate(sequence_ids) if seq_id == 1)
            context_end = next(idx for idx, seq_id in enumerate(sequence_ids[::-1]) if seq_id == 1)

            # Calculate start and end positions of the answer within the tokenized sequence
            start_position = max(0, context_start - 1)
            end_position = min(len(offset), context_end + 1)

            # Initialize lists to store start and end positions of answers
            start_positions = [0] * len(offset)
            end_positions = [0] * len(offset)

            # Set start and end positions if answer is fully contained within the context
            if offset[start_position][0] <= start_char and offset[end_position][1] >= end_char:
                start_positions[start_position] = 1
                end_positions[end_position] = 1

            # Add tokenized input and corresponding start/end positions to processed samples
            processed_samples.append({
                "inputs": inputs,
                "start_positions": start_positions,
                "end_positions": end_positions
            })

    return processed_samples

In [23]:
df_train.head()

Unnamed: 0,item_id,domain,nn_mod,nn_asp,query_mod,query_asp,q_review_id,q_reviews_id,question,question_subj_level,ques_subj_score,is_ques_subjective,review_id,review,human_ans_spans,human_ans_indices,answer_subj_level,ans_subj_score,is_ans_subjective
0,B00BVMXBDO,movies,addictive,show,full,series,d9a9615d45df2f6e6108db4ca46bfded,399f1046fe6bd97990107f9d7aa86f4a,Who is the author of this series?,1,0.0,False,090671369dddfeb02db9bf7125a47c79,Whether it be in her portrayal of a nerdy lesb...,ANSWERNOTFOUND,"(251, 265)",1,0.0,False
1,1404918051,movies,enough simple,film,charming,movie,06ffe37a8023636a3ce00b020a517e87,42d9dd5b0c67150cac1e13308811cbb5,Can we enjoy the movie along with our family ?,1,0.5,False,a29821121e74d319cb93f77101e99c88,"An outstanding romantic comedy, 13 Going on 30...",ANSWERNOTFOUND,"(1195, 1209)",1,0.0,False
2,B0000633ZP,movies,weak,plot,bad,one,3b625c68e91b9e6987a08b84a9a9d234,32d06ccf2132cda644aea791fa688c53,Does this one good?,5,0.6,True,12a1b821f761bd19a75be7b16cef4a7c,"To let the truth be known, I watched this movi...",ANSWERNOTFOUND,"(1476, 1490)",5,0.0,False
3,B0000AQS0F,movies,outstanding,show,wonderful,series,f3abfa98b011127e7cb49bcd07f8deeb,e546636f0bb9f93d5f24b4ade9ebab45,Is this series good and excelent?,1,0.6,True,cd0f92322e67cc9d70de6674caace78c,"At the time of my review, there had been 910 c...",this show is OUTSTANDING,"(296, 320)",1,0.875,True
4,B003Y5H5FG,movies,great,production design,great,costume design,1b03744e764b257592c2c768345c14bc,a0a97e460a194bcb3286fe68d20aadc2,How is the costume design?,1,0.0,False,f6b5024393ebc70287befdaf47a50b75,"""Fright Night"" is great! This is how the story...",The costume design by Susan Matheson is great,"(1254, 1299)",1,0.75,True


In [29]:
df_train.info

In [30]:
df_train.columns

Index(['item_id', 'domain', 'nn_mod', 'nn_asp', 'query_mod', 'query_asp',
       'q_review_id', 'q_reviews_id', 'question', 'question_subj_level',
       'ques_subj_score', 'is_ques_subjective', 'review_id', 'review',
       'human_ans_spans', 'human_ans_indices', 'answer_subj_level',
       'ans_subj_score', 'is_ans_subjective'],
      dtype='object')

In [34]:
df_train.iloc[0:10].question

0                    Who is the author of this series?
1       Can we enjoy the movie along with our family ?
2                                  Does this one good?
3                    Is this series good and excelent?
4                           How is the costume design?
5                         How are the special effects?
6                              Do you have any credit?
7                           How do you like the story?
8    What criticism deserves the movie Passion of C...
9             How much is missing from the collection?
Name: question, dtype: object

In [56]:
df_train.iloc[2].review

"To let the truth be known, I watched this movie with a mix of anticipation and fear. Being an avid Star Wars fan, I was excited to see any Star Wars movie, but I suspected this would be as disappointing as the Phantom Menace. WRONG! Although this doesn't even come close to the great casting and story lines and sheer art of the first three Star Wars series, it was WAY better than Phantom Menace for the following reasons: 1) This movie included LESS Jar-Jar, which, despite initial heavy marketing for the first movie, the character was found by the general consensus to be REALLY annoying. 2) This movie demonstrated some of the political turmoil behind the original Star Wars movies. 3) You get to see some of what led Anakin to turn over to the Dark Side. Finally, the special effects were really good!It was not 4 or 5 stars because the actors that were cast in this movie (as well as The Phantom Menace) are all well known for other cinematic accomplishments, and it was hard to believe that 

In [57]:
df_train.iloc[2].human_ans_indices

'(1476, 1490)'

In [59]:
df_train.iloc[2].review[1476:1490]

'ANSWERNOTFOUND'

In [60]:
# Picking the necessary columns for further analysis
df_train=df_train[['question','human_ans_indices','review','human_ans_spans']]
df_test=df_test[['question','human_ans_indices','review','human_ans_spans']]

In [61]:
# Generate a sequence evenly spaced numbers
import numpy as np
df_train['id']=np.linspace(0,len(df_train)-1,len(df_train)) # Generates a sequence of IDs from 0 to the length of the training data minus 1
df_test['id']=np.linspace(0,len(df_test)-1,len(df_test)) # Same

# Convert to strings
df_train['id']=df_train['id'].astype(str)
df_test['id']=df_test['id'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['id']=np.linspace(0,len(df_train)-1,len(df_train))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['id']=df_train['id'].astype(str)


In [63]:
df_train.info

In [64]:
df_test.info

In [68]:
int(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[0])

251

In [67]:
float(df_train.iloc[0].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])

265.0

In [70]:
# Where the answers are
df_train['answers']=df_train['human_ans_spans']
# Actual answer text itself, right answer where should be
df_test['answers']=df_test['human_ans_spans']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['answers']=df_train['human_ans_spans']


In [71]:
# Extract answer data and adds it to a new column
for i in range(0,len(df_train)):
  answer1={}
  si=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_train.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_train.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_train.at[i, 'answers']=answer1

In [72]:
print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

In [73]:
# Same for the test data
for i in range(0,len(df_test)):
  answer1={}
  si=int(df_test.iloc[i].human_ans_indices.split('(')[1].split(',')[0])
  ei=int(df_test.iloc[i].human_ans_indices.split('(')[1].split(',')[1].split(' ')[1].split(')')[0])
  answer1['text']=[df_test.iloc[i].review[si:ei]]
  answer1['answer_start']=[si]
  df_test.at[i, 'answers']=answer1

In [74]:
print(df_train.iloc[i].answers,df_train.iloc[i].human_ans_spans)

In [75]:
df_train.columns

Index(['question', 'human_ans_indices', 'review', 'human_ans_spans', 'id',
       'answers'],
      dtype='object')

In [76]:
# Standardizing the columns for clarity (context)
df_train.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans', 'id',
       'answers']

df_test.columns=['question', 'human_ans_indices', 'context', 'human_ans_spans','id',
       'answers']

In [78]:
# Creating Datasets from Pandas DataFrames for Validation and Training
val_dataset2 = datasets.Dataset.from_pandas(df_test)
train_dataset2 = datasets.Dataset.from_pandas(df_train)

In [81]:
# Preprocess the training examples .map() function on training dataset with the preprocessing function
train_dataset = train_dataset2.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=train_dataset2.column_names,
)
len(train_dataset2), len(train_dataset) # compare the lengths of the original dataset (train_dataset2) and the preprocessed dataset (train_dataset).

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

(2501, 4862)

It shows that all 2501 examples were processed in 10 seconds at a speed of 260.48 examples per second. The resulting dataset has 4862 examples.

In [82]:
train_dataset2.shape

(2501, 6)

In [83]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]] # Cleaning the questions by stripping leading and trailing whitespace for consistency
    inputs = tokenizer( # Tokenization; converting questions and contexts into numerical IDs, enabling the model to understand
        questions,
        examples["context"],
        max_length=max_length, # Total length of the input sequence
        truncation="only_second", # If the total length exceeds max_length, only the context will be truncated
        stride=stride, # Overlap between the chunks
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [84]:
!git init

In [85]:
!git add Capstone-1-SubjQATransformer.ipynb