The Entailment Model is a pre-trained classifier to generate Entailment score for fact verification purpose.

Specifically, we fine-tune NorBERT on a collection of machine translated VitaminC dataset which is designed to determine whether the evidence supports assumption and is suitable for training a model on whether the given context entails the generated texts. Then, we employ the fine-tuned model as our Entailment model.

Prompt format:

{article}[SEP]{positive_sample}

Inference format:

{article}[SEP]{generated_text}

Run the Model

import torch
from transformers import AutoTokenizer, BertForSequenceClassification

model_id = "NorGLM/Entailment"
tokenizer = AutoTokenizer.from_pretrained(model_id, fast_tokenizer=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = BertForSequenceClassification.from_pretrained(
    model_id
)

Inference Example

from torch.utils.data import TensorDataset, DataLoader

def entailment_score(texts, references, generated_texts):
    # Entailment: 1, Contradict: 0, Neutral: 2
    # concatinate news articles and generated summaries as input
    input_texts = [t + ' [SEP] '+ g for t,g in zip(texts, generated_texts)]
    # Set the maximum sequence length according to NorBERT config.
    MAX_LEN = 512
    batch_size = 16

    test_inputs = tokenizer(text=input_texts, add_special_tokens=True, return_attention_mask = True, return_tensors="pt", padding=True, truncation=True,  max_length=MAX_LEN)
    validation_data = TensorDataset(test_inputs['input_ids'],test_inputs['attention_mask'])
    validation_dataloader = DataLoader(validation_data,batch_size=batch_size)

    model.eval()

    results = []
    num_batches = 1
    for batch in validation_dataloader:
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask = batch
        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

        # Move logits and labels to CPU
        logits = logits[0].to('cpu').numpy()
        pred_flat = np.argmax(logits, axis=1).flatten()

        results.extend(pred_flat)
        num_batches += 1

    ent_ratio = results.count(1) / float(len(results))
    neu_ratio = results.count(2) / float(len(results))
    con_ratio = results.count(0) / float(len(results))
    print("Entailment ratio: {}; Neutral ratio: {}; Contradict ratio: {}.".format(ent_ratio, neu_ratio, con_ratio))
    return ent_ratio, neu_ratio, con_ratio

# load evaluation text
eva_file_name = <input csv file for evaluation>
eval_df = pd.read_csv(eva_file_name)

remove_str = 'Token indices sequence length is longer than 2048.'
eval_df = eval_df[eval_df!=remove_str]
eval_df = eval_df.dropna()
references = eval_df['positive_sample'].to_list()
hypo_list = eval_df['generated_text'].to_list()
articles = eval_df['article'].to_list()
ent_ratio, neu_ratio, con_ratio = entailment_score(articles, references, hypo_list)

Citation Information

If you feel our work is helpful, please cite our paper:

@article{liu2023nlebench+,
  title={NLEBench+ NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian},
  author={Liu, Peng and Zhang, Lemei and Farup, Terje Nissen and Lauvrak, Even W and Ingvaldsen, Jon Espen and Eide, Simen and Gulla, Jon Atle and Yang, Zhirong},
  journal={arXiv preprint arXiv:2312.01314},
  year={2023}
}
Downloads last month
13
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including NorGLM/Entailment