sciroshot / README.md
mapama247's picture
update github link
c25a2ba
|
raw
history blame
9.18 kB
metadata
pipeline_tag: zero-shot-classification
license: apache-2.0
language:
  - en
tags:
  - zero-shot
  - text-classification
  - science
  - mag
widget:
  - text: Leo Messi is the best player ever
    candidate_labels: politics, science, sports, environment
    multi_class: true

SCIroShot

Overview

Click to expand
  • Model type: Language Model
  • Architecture: RoBERTa-large
  • Language: English
  • License: Apache 2.0
  • Task: Zero-Shot Text Classification
  • Data: Microsoft Academic Graph
  • Additional Resources:
    • Paper <-- WiP (soon to be published in EACL 2023)
    • GitHub

Model description

SCIroShot is an entailment-based Zero-Shot Text Classification model that has been fine-tuned using a self-made dataset composed of scientific articles from Microsoft Academic Graph (MAG). The resulting model achieves SOTA performance in the scientific domain and very competitive results in other areas.

Intended Usage

This model is intended to be used for zero-shot text classification in English.

How to use

from transformers import pipeline

zstc = pipeline("zero-shot-classification", model="BSC-LT/sciroshot")

sentence = "Leo Messi is the best player ever."
candidate_labels = ["politics", "science", "sports", "environment"]
template = "This example is {}"

output = zstc(sentence, candidate_labels, hypothesis_template=template, multi_label=False)

print(output)
print(f'Predicted class: {output["labels"][0]}')

Limitations and bias

No measures have been taken to estimate the bias and toxicity embedded in the model.

Even though the fine-tuning data (which is of a scientific nature) may seem harmless, it is important to note that the corpus used to pre-train the vanilla model is very likely to contain a lot of unfiltered content from the internet, as stated in the RoBERTa-large model card.

Training

Training data

Our data builds on top of scientific-domain annotated data from Microsoft Academic Graph (MAG). This database consists of a heterogeneous graph with billions of records from both scientific publications and patents, in addition to metadata information such as the authors, institutions, journals, conferences and their citation relationships. The documents are organized in a six-level hierarchical structure of scientific concepts, where the two top-most levels are manually curated in order to guarantee a high level of accuracy.

To create the training corpus, a random sample of scientific articles with a publication year between 2000 and 2021 were retrieved from MAG with their respective titles and abstracts in English. This results in over 2M documents with their corresponding Field Of Study, which was obtained from the 1-level MAG taxonomy (292 possible classes, such as "Computational biology" or "Transport Engineering").

The fine-tuning dataset was constructed in a weakly supervised manner by converting text classification data to the entailment format. Using the relationship between scientific texts and their matching concepts in the 1-level MAG taxonomy we are able to generate the premise- hypothesis pairs corresponding to the entailment label. Conversely, we generate the pairs for the neutral label by removing the actual relationship between the texts and their scientific concepts and creating a virtual relationship with those to which they are not matched.

Training procedure

The newly-created scientific dataset described in the previous section was used to fine-tune a 355M parameters RoBERTa model on the entailment task. To do so, the model has to compute the entailment score between every text that is fed to it and all candidate labels. The final prediction would be the highest-scoring class in a single-label classification setup, or the N classes above a certain threshold in a multi-label scenario.

A subset of 52 labels from the training data were kept apart so that they could be used as a development set of fully-unseen classes. As a novelty, the validation was not performed on the entailment task (which is used a proxy) but directly on the target text classification task. This allows us to stop training at the right time via early stopping, which prevents the model from "overfitting" to the training task. This method was our way to counteract an effect that was empirically discovered during the experimentation period, where it was observed that after a certain point the model can start to worsen in the target task (ZSTC) despite still continuing to improve in the training task (RTE). The simple act of shortening the training time led to a boost in performance.

Read the paper for more details on the methodology and the analysis of RTE/ZSTC correlation.

Evaluation

Evaluation data

The model's performance was evaluated on a collection of disciplinary-labeled textual datasets, both from the scientific domain (closer to training data) and the general domain (to assess generalizability).

The following table provides an overview of the number of examples and labels for each dataset:

Dataset Labels Size
arXiv 11 3,838
SciDocs-MeSH 11 16,433
SciDocs-MAG 19 17,501
Konstanz 24 10,000
Elsevier 26 14,738
PubMed 109 5,000
Topic Categorization (Yahoo! Answers) 10 60,000
Emotion Detection (UnifyEmotion) 10 15,689
Situation Frame Detection (Situation Typing) 12 3,311

Please refer to the paper for further details on each particular dataset.

Evaluation results

These are the official results reported in the paper:

Scientific domain benchmark

Model arXiv SciDocs-MesH SciDocs-MAG Konstanz Elsevier PubMed
fb/bart-large-mnli 33.28 66.18πŸ”₯ 51.77 54.62 28.41 31.59πŸ”₯
SCIroShot 42.22πŸ”₯ 59.34 69.86πŸ”₯ 66.07πŸ”₯ 54.42πŸ”₯ 27.93

General domain benchmark

Model Topic Emotion Situation
RTE (Yin et al., 2019) 43.8 12.6 37.2πŸ”₯
FEVER (Yin et al., 2019) 40.1 24.7 21.0
MNLI (Yin et al., 2019) 37.9 22.3 15.4
NSP (Ma et al., 2021) 50.6 16.5 25.8
NSP-Reverse (Ma et al., 2021) 53.1 16.1 19.9
SCIroShot 59.08πŸ”₯ 24.94πŸ”₯ 27.42

All the numbers reported above represent label-wise weighted F1 except for the Topic classification dataset, which is evaluated in terms of accuracy following the notation from (Yin et al., 2019).

Additional information

Authors

  • SIRIS Lab, Research Division of SIRIS Academic.
  • Language Technologies Unit, Barcelona Supercomputing Center.

Contact

For further information, send an email to either langtech@bsc.es or info@sirisacademic.com.

License

This work is distributed under a Apache License, Version 2.0.

Funding

This work was partially funded by 2 projects under EU’s H2020 Research and Innovation Programme:

  • INODE (grant agreement No 863410).
  • IntelComp (grant agreement No 101004870).

Citation

@inproceedings{pamies2023weakly,
  title={A weakly supervised textual entailment approach to zero-shot text classification},
  author={P{\`a}mies, Marc and Llop, Joan and Multari, Francesco and Duran-Silva, Nicolau and Parra-Rojas, C{\'e}sar and Gonz{\'a}lez-Agirre, Aitor and Massucci, Francesco Alessandro and Villegas, Marta},
  booktitle={Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
  pages={286--296},
  year={2023}
}

Disclaimer

Click to expand

The model published in this repository is intended for a generalist purpose and is made available to third parties under a Apache v2.0 License.

Please keep in mind that the model may have bias and/or any other undesirable distortions. When third parties deploy or provide systems and/or services to other parties using this model (or a system based on it) or become users of the model itself, they should note that it is under their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.