Special tokens ids
Hi,
Apologies in advance if this is a silly question. How come that the pretrained nb-bert-large tokenizer use different special_tokens_ids than most other BERT models?
nb-bert-large use: [505, 504, 503, 501, 502]
etc nb-bert-base use: [100, 102, 0, 101, 103]
Furthermore, any suggestions on how to overwride the current mapping to the normal special_token_ids for the nb-bert-large?
Hi!
Not silly at all. NB-BERT-base was trained by further pre-training off the multilingual BERT weights, which already came with its own tokenizer. NB-BERT-large was pre-trained from scratch and a new tokenizer was built for it based on a different corpus (and libraries) than the one used for mBERT. Hence, the discrepancy.
As for the overriding, I'm not completely sure it's possible. The base version as a vocab size of 119,547 (because of its multilingual nature), the large version of 50,000 (mostly Norwegian and Scandinavian based).