--- license: mit base_model: - google-bert/bert-base-multilingual-uncased tags: - ner - indonesian - bert language: - id library_name: transformers --- # ner-bert-indonesian-v1 ### Model Description **ner-bert-indonesian-v1** is a fine-tuned **google-bert/bert-base-multilingual-uncased** which is used for **named-entity-recognition (NER)** tasks in **Indonesian**. **In version 1**, the model is quite good at recognizing the following 4 entity types: - 0 others (entities not yet recognized by the model) - Lainnya - Person - Orang - Organisation - Organisasi - Place - Tempat/Lokasi ### Usage Using **pipelines** ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1') model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1') nlp = pipeline("ner", model=model, tokenizer=tokenizer) example = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini." ner_results = nlp(example) for n in ner_results: print(n) ``` Using **using custom parsers** ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch id_to_label = {0: 'O', 1: 'Place', 2: 'Organisation', 3: 'Person'} # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1') model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1') def tokenize_input(sentence): tokenized_input = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True) return tokenized_input def predict_ner(sentence): inputs = tokenize_input(sentence) with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predictions = torch.argmax(logits, dim=2) # Convert predictions and tokens back to readable format tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) predicted_labels = [id_to_label[p.item()] for p in predictions[0]] # Merge subwords and filter out special tokens merged_tokens, merged_labels = [], [] current_token, current_label = "", None for token, label in zip(tokens, predicted_labels): print(token, ' ', label) # Skip special tokens and punctuation (like [CLS], [SEP], commas, and periods) if token in ["[CLS]", "[SEP]"] or (label == "O" and token in [",", "."]): continue if token.startswith("##"): current_token += token[2:] if current_label == 'O': current_label = label else: if current_token: merged_tokens.append(current_token) merged_labels.append(current_label) current_token = token current_label = label if current_token: merged_tokens.append(current_token) merged_labels.append(current_label) results = list(zip(merged_tokens, merged_labels)) return results sentence = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini." results = predict_ner(sentence) print(results) for token, label in results: print(f"{token}: {label}") ``` ### Dataset and citation info ``` @article{DBLP:journals/corr/abs-1810-04805, author = {Jacob Devlin and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language Understanding}, journal = {CoRR}, volume = {abs/1810.04805}, year = {2018}, url = {http://arxiv.org/abs/1810.04805}, archivePrefix = {arXiv}, eprint = {1810.04805}, timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` * The DEE NER dataset: Ika Alfina, Ruli Manurung, and Mohamad Ivan Fanany, ["DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER"](https://ieeexplore.ieee.org/document/7872784), in Proceeding of 8th International Conference on Advanced Computer Science and Information Systems 2016 (ICACSIS 2016). * The MDEE and Singgalang NER dataset: Ika Alfina, Septiviana Savitri, and Mohamad Ivan Fanany, ["Modified DBpedia Entities Expansion for Tagging Automatically NER Dataset"](https://ieeexplore.ieee.org/document/8355036), in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017 (ICACSIS 2017). * The Gold Standard: Andry Luthfi, Bayu Distiawan, and Ruli Manurung, ["Building an Indonesian named entity recognizer using Wikipedia and DBPedia"](https://ieeexplore.ieee.org/document/6973520), in the Proceesing of 2014 International Conference on Asian Language Processing (IALP 2014).