File size: 3,386 Bytes
24baf0a
519c5f0
 
 
 
24baf0a
66afaaa
24baf0a
519c5f0
 
 
 
eceaf2d
519c5f0
 
1c7a9cd
519c5f0
 
 
f001eb4
519c5f0
 
 
 
 
 
 
 
 
 
f001eb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
519c5f0
 
4efad85
 
 
 
 
 
 
82e1d58
4efad85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
519c5f0
 
eae51d8
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
inference: false
language: pt
datasets:
- lener_br
license: mit
pipeline_tag: token-classification
---

# DeBERTinha XSmall for NER


## Full Token Classification Example

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig
import torch

model_name = "sagui-nlp/debertinha-ptbr-xsmall-lenerbr"
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=13)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."

inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

entities = []
current_entity = []
current_label = None
for token, prediction in zip(tokens[1:-1], predictions[0].numpy()[1:-1]):
    # print((token, model.config.id2label[prediction]))
    if not len(current_entity):
        current_entity.append(token)
        current_label = model.config.id2label[prediction]
    elif token.startswith("▁"):
        entities.append(("".join(current_entity), current_label))
        current_entity = [token]
        current_label = model.config.id2label[prediction]
    else:
        current_entity.append(token)
entities.append(("".join(current_entity), current_label))
list(filter(lambda x:x[1]!="O", entities))
```

## Training notes
Training was done on label of only the first token
```python
label_all_tokens = False
task="ner"

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, max_length=512)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

dataset = dataset.map(tokenize_and_align_labels, batched=True)
```

## Citation 

```
@misc{campiotti2023debertinha,
      title={DeBERTinha: A Multistep Approach to Adapt DebertaV3 XSmall for Brazilian Portuguese Natural Language Processing Task}, 
      author={Israel Campiotti and Matheus Rodrigues and Yuri Albuquerque and Rafael Azevedo and Alyson Andrade},
      year={2023},
      eprint={2309.16844},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```