---
base_model: jjzha/jobbert-base-cased
model-index:
- name: jobbert-base-cased-compdecs
  results: []
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
widget:
- text: "You must be proficient in Excel."
- text: "Would you like to join a major manufacturing company?"
---

_Nesta, the UK's innovation agency, has been scraping online job adverts since 2021 and building algorithms to extract and structure information as part of the [Open Jobs Observatory](https://www.nesta.org.uk/project/open-jobs-observatory/) project._ 

_Although we are unable to share the raw data openly, we aim to open source **our models, algorithms and tools** so that anyone can use them for their own research and analysis._

## 🖊️ Model description

This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased). JobBERT is a continuously pre-trained bert-base-cased checkpoint on ~3.2M sentences from job postings.

It has been fine tuned with a classification head to binarily classify job advert sentences as being a `company description` or not.  

The model was trained on **486 manually labelled company description sentences** and **1000 non company description sentences less than 250 characters in length.**


It achieves the following results on a held out test set 147 sentences:
- Accuracy: 0.92157

| Label      | precision | recall | f1-score | support |
| ----------- | ----------- | ----------- |----------- |----------- |
| not company description      | 0.930693       |0.959184|0.944724|98|
| company description   | 0.913043        |0.857143|0.884211|49|

The code for training the model is in our [ojd_daps_language_models repo](https://github.com/nestauk/ojd_daps_language_models), a central repository for fine-tuning transformer models on our database of scraped job adverts.

## 🖨️ Use

To use the model: 

```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

model = AutoModelForSequenceClassification.from_pretrained("nestauk/jobbert-base-cased-compdecs")
tokenizer = AutoTokenizer.from_pretrained("nestauk/jobbert-base-cased-compdecs")

comp_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)
```
An example use is as follows:

```
job_sent = "Would you like to join a major manufacturing company?"
comp_classifier(job_sent)

>> [{'label': 'LABEL_1', 'score': 0.9953641891479492}]
```

The intended use of this model is to extract company descriptions from online job adverts to use in downstream tasks such as mapping to [Standardised Industrial Classification (SIC)](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) codes. 


### ⚖️ Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10

### ⚖️ Training results

The fine-tuning metrics are as follows:
- eval_loss: 0.462236
- eval_runtime: 0.629300
- eval_samples_per_second: 233.582000
- eval_steps_per_second: 15.890000
- epoch: 10.000000
- perplexity: 1.590000
- 

### ⚖️ Framework versions

- Transformers 4.32.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3