README.md · knowledgator/comprehend_it-base at ba57149003bbc48cb61fed923db1e3f2607770a1

metadata

license: apache-2.0
datasets:
  - multi_nli
  - xnli
  - dbpedia_14
  - SetFit/bbc-news
  - squad_v2
  - race
language:
  - en
metrics:
  - accuracy
  - f1
library_name: transformers
pipeline_tag: zero-shot-classification
tags:
  - classification
  - information-extraction
  - zero-shot

comprehend_it-base

This is a model based on DeBERTaV3-base that was trained on natural language inference datasets as well as on multiple text classification datasets.

It demonstrates better quality on the diverse set of text classification datasets in a zero-shot setting than Bart-large-mnli while being almost 3 times smaller.

Moreover, the model can be used for multiple information extraction tasks in zero-shot setting, including:

Named-entity recognition;
Relation extraction;
Entity linking;
Question-answering;

With the zero-shot classification pipeline

The model can be loaded with the zero-shot-classification pipeline like so:

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="knowledgator/comprehend_it-base")

You can then use this pipeline to classify sequences into any of the class names you specify.

sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)
#{'labels': ['travel', 'dancing', 'cooking'],
# 'scores': [0.9938651323318481, 0.0032737774308770895, 0.002861034357920289],
# 'sequence': 'one day I will see the world'}

If more than one candidate label can be correct, pass multi_label=True to calculate each class independently:

candidate_labels = ['travel', 'cooking', 'dancing', 'exploration']
classifier(sequence_to_classify, candidate_labels, multi_label=True)
#{'labels': ['travel', 'exploration', 'dancing', 'cooking'],
# 'scores': [0.9945111274719238,
#  0.9383890628814697,
#  0.0057061901316046715,
#  0.0018193122232332826],
# 'sequence': 'one day I will see the world'}

With manual PyTorch

# pose sequence as a NLI premise and label as a hypothesis
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('knowledgator/comprehend_it-base')
tokenizer = AutoTokenizer.from_pretrained('knowledgator/comprehend_it-base')

premise = sequence
hypothesis = f'This example is {label}.'

# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

Benchmarking

Below, you can see the F1 score on several text classification datasets. All tested models were not fine-tuned on those datasets and were tested in a zero-shot setting.

Model	IMDB	AG_NEWS	Emotions
Bart-large-mnli (407 M)	0.89	0.6887	0.3765
Deberta-base-v3 (184 M)	0.85	0.6455	0.5095
Comprehendo (184M)	0.90	0.7982	0.5660

Future reading

Check our blogpost - "The new milestone in zero-shot capabilities (it’s not Generative AI).", where we highlighted possible use-cases of the model and why next-token prediction is not the only way to achive amazing zero-shot capabilites. While most of the AI industry is focused on generative AI and decoder-based models, we are committed to developing encoder-based models. We aim to achieve the same level of generalization for such models as their decoder brothers. Encoders have several wonderful properties, such as bidirectional attention, and they are the best choice for many information extraction tasks in terms of efficiency and controllability.