File size: 3,403 Bytes
5f8a486 3dd32cd 37c6c34 bddd728 8b39823 5f8a486 02dec50 8b39823 3dd32cd 5f8a486 3dd32cd 5f8a486 3dd32cd 5f8a486 37c6c34 5f8a486 3dd32cd 5f8a486 3dd32cd 5f8a486 8b39823 3dd32cd 5f8a486 3dd32cd 5f8a486 3dd32cd 5f8a486 cf5074d 3dd32cd 5f8a486 3dd32cd 5f8a486 3dd32cd 5f8a486 3dd32cd 5f8a486 3dd32cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
base_model: jjzha/jobbert-base-cased
model-index:
- name: jobbert-base-cased-compdecs
results: []
license: mit
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
widget:
- text: "You must be proficient in Excel."
- text: "Would you like to join a major manufacturing company?"
---
_Nesta, the UK's innovation agency, has been scraping online job adverts since 2021 and building algorithms to extract and structure information as part of the [Open Jobs Observatory](https://www.nesta.org.uk/project/open-jobs-observatory/) project._
_Although we are unable to share the raw data openly, we aim to open source **our models, algorithms and tools** so that anyone can use them for their own research and analysis._
## 🖊️ Model description
This model is a fine-tuned version of [jjzha/jobbert-base-cased](https://huggingface.co/jjzha/jobbert-base-cased). JobBERT is a continuously pre-trained bert-base-cased checkpoint on ~3.2M sentences from job postings.
It has been fine tuned with a classification head to binarily classify job advert sentences as being a `company description` or not.
The model was trained on **486 manually labelled company description sentences** and **1000 non company description sentences less than 250 characters in length.**
It achieves the following results on a held out test set 147 sentences:
- Accuracy: 0.92157
| Label | precision | recall | f1-score | support |
| ----------- | ----------- | ----------- |----------- |----------- |
| not company description | 0.930693 |0.959184|0.944724|98|
| company description | 0.913043 |0.857143|0.884211|49|
The code for training the model is in our [ojd_daps_language_models repo](https://github.com/nestauk/ojd_daps_language_models), a central repository for fine-tuning transformer models on our database of scraped job adverts.
## 🖨️ Use
To use the model:
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
model = AutoModelForSequenceClassification.from_pretrained("nestauk/jobbert-base-cased-compdecs")
tokenizer = AutoTokenizer.from_pretrained("nestauk/jobbert-base-cased-compdecs")
comp_classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)
```
An example use is as follows:
```
job_sent = "Would you like to join a major manufacturing company?"
comp_classifier(job_sent)
>> [{'label': 'LABEL_1', 'score': 0.9953641891479492}]
```
The intended use of this model is to extract company descriptions from online job adverts to use in downstream tasks such as mapping to [Standardised Industrial Classification (SIC)](https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic) codes.
### ⚖️ Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
### ⚖️ Training results
The fine-tuning metrics are as follows:
- eval_loss: 0.462236
- eval_runtime: 0.629300
- eval_samples_per_second: 233.582000
- eval_steps_per_second: 15.890000
- epoch: 10.000000
- perplexity: 1.590000
-
### ⚖️ Framework versions
- Transformers 4.32.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3 |