Model Overview
This model is a fine-tuned version of the cmarkea/distilcamembert-base-ner, adapted for Named Entity Recognition (NER) on French datasets. It is a lighter variant of CamemBERT, which is specifically optimized for NER tasks involving entities like locations, organizations, persons, and other miscellaneous entities in French texts.
Model Type
- Architecture:
CamembertForTokenClassification
- Base Model: DistilCamemBERT
- Number of Layers: 6 hidden layers, 12 attention heads
- Tokenizer: Based on CamemBERT's tokenizer
- Vocab Size: 32,005 tokens
Intended Use
This model is fine-tuned for Named Entity Recognition (NER) tasks, identifying and classifying entities such as:
- LOC (Location)
- PER (Person)
- ORG (Organization)
- MISC (Miscellaneous) It can also identify the Starting city and the Ending city of a travel
Example Use Case:
Given a sentence such as "Je veux aller de Paris à Lyon", the model will detect and label:
Paris
asLOC
Lyon
asLOC
Limitations:
- Language: The model is primarily designed for French texts.
- Performance: Performance may degrade if used for non-French text or tasks outside NER.
Labels and Tokens
The model uses the following entity labels:
O
: Outside any named entityB-START
: Beginning of a named entity (start location)I-START
: Inside a named entity (start location)B-END
: Beginning of a named entity (end location)I-END
: Inside a named entity (end location)
Training Data
The model was fine-tuned using a French NER dataset of travel queries, including phrases like "Je veux aller de Paris à Lyon" to simulate common transportation-related interactions. The dataset contains named entity labels for city and station names.
Hyperparameters and Fine-Tuning:
- Learning Rate: 2e-5
- Batch Size: 16
- Epochs: 3
- Evaluation Strategy: Epoch-based
- Optimizer: AdamW
- Early Stopping: Used to prevent overfitting
Tokenizer
The tokenizer is based on the pre-trained CamemBERT tokenizer, adapted for the specific entity-labeling task. It uses subword tokenization based on the BPE (Byte-Pair Encoding) approach, which splits words into smaller units.
Tokenizer special settings:
- Max Length: 128
- Padding: Right-padded to 128 tokens
- Truncation: Longest-first strategy, truncating tokens beyond 128.
How to Use
You can load and use this model with Hugging Face’s transformers
library as follows:
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")
model = AutoModelForTokenClassification.from_pretrained("Crysy-rthomas/T-AIA-CamemBERT-NER-V2")
text = "Je veux aller de Paris à Lyon"
tokens = tokenizer(text, return_tensors="pt")
outputs = model(**tokens)
Limitations and Bias
- The model may not generalize well beyond French texts.
- Results may be biased towards specific named entities frequently seen in the training data (such as city names).
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 6
Model tree for Crysy-rthomas/T-AIA-CamemBERT-NER-V2
Base model
cmarkea/distilcamembert-base