Questions and request
Hello Urchade,
Thanks for this incredible work. I have two questions and a request.
- How this dataset compares to:
https://huggingface.co/datasets/ai4privacy/pii-masking-300k - How the GliNER PII compares to this approach in terms of perf (if I were to ft GLINER on the same dataset):
https://huggingface.co/Isotonic/deberta-v3-base_finetuned_ai4privacy_v2 - Is it possible to share the synth data generation script ?
Thanks
Thank you for your interest in GLiNER :)
- I think that the quality of my dataset is not great as it is purely synthetic. The one you mentioned should be better
- the model you mentioned should better, but GLiNER is not limited in terms of label it can predict
- I have provided a general example for synthetic data generation here (you can tailor it for pii extraction):
https://github.com/urchade/GLiNER/blob/main/examples/synthetic_data_generation.ipynb
you can join the GLiNER discussion server here, as I am not very actif in HF: https://discord.gg/Y2yVxpSQnG
Great, thanks. I'll check out the script.
Hi! Where can I see the tuning script? I want to add data in other languages.
hi,
can you just share me a sample data to train the gliner model. I tried using a dataset of json format. Here is the sample data can you say me is this okay or need to modify the data and can you say how to use the data to fine tune the model.
{"text": "Aadhaar is 437686033996 PAN is JRNPZ0751P Email is lakshitgulati@example.org Name is Purab Varghese Mobile is 910863034052 Age is 30 Credit Card is 2262854559438311 CVV is 961 Address is 51/138, Rastogi Nagar, Morena, Sikkim",
"entities": [{"entity": "AADHAAR", "start": 0, "end": 7, "value": "Aadhaar"}, {"entity": "AADHAAR_VALUE", "start": 11, "end": 23, "value": "437686033996"}, {"entity": "PAN", "start": 24, "end": 27, "value": "PAN"}, {"entity": "PAN_VALUE", "start": 31, "end": 41, "value": "JRNPZ0751P"}, {"entity": "EMAIL", "start": 42, "end": 47, "value": "Email"}, {"entity": "EMAIL_VALUE", "start": 51, "end": 76, "value": "lakshitgulati@example.org"}, {"entity": "NAME", "start": 77, "end": 81, "value": "Name"}, {"entity": "NAME_VALUE", "start": 85, "end": 99, "value": "Purab Varghese"}, {"entity": "MOBILE", "start": 100, "end": 106, "value": "Mobile"}, {"entity": "MOBILE_VALUE", "start": 110, "end": 122, "value": "910863034052"}, {"entity": "AGE", "start": 123, "end": 126, "value": "Age"}, {"entity": "AGE_VALUE", "start": 130, "end": 132, "value": "30"}, {"entity": "CREDIT CARD", "start": 133, "end": 144, "value": "Credit Card"}, {"entity": "CREDIT CARD_VALUE", "start": 148, "end": 164, "value": "2262854559438311"}, {"entity": "CVV", "start": 165, "end": 168, "value": "CVV"}, {"entity": "CVV_VALUE", "start": 172, "end": 175, "value": "961"}, {"entity": "ADDRESS", "start": 176, "end": 183, "value": "Address"}, {"entity": "ADDRESS_VALUE", "start": 187, "end": 224, "value": "51/138, Rastogi Nagar, Morena, Sikkim"}]}
Hi, it has been trained on https://huggingface.co/datasets/urchade/synthetic-pii-ner-mistral-v1
I suggest you to use lower case for entity named and without "_"