point-to-span-estimation / README.md

gabrielandrade2

Update README

40d964a 7 months ago

preview code

raw

history blame contribute delete

No virus

4.56 kB

	---
	language: ja
	license: gpl-3.0
	widget:
	- text: 今回も意識⧫障害が出現し救急外来を受診した。
	---

	A model used to estimate the start and end of a Named Entity (NE) span based on a Point annotation, as used in the paper "Is boundary annotation necessary? Evaluating boundary-free approaches to improve clinical named entity annotation efficiency".

	Basically, the goal of this model is to convert a point annotation to a corresponding span annotation with the correct span.

	The model locates an identifier token (⧫) and based on its surround context estimates where the NE concept starts and ends.

	The model is trained to estimate the spans of diseases and symptom names in Japanese medical texts.

	If you want to re-train the model for a different language or domain, dataset preprocessing and training scripts are available [here](https://github.com/gabrielandrade2/Point-to-Span-estimation).

	## Concepts

	### Point annotation

	Unlike span-based paradigms, a point annotation is composed by a single position within the NE span.
	It is a simple and fast way to annotate NEs, but it introduces ambiguity in the information captured by the annotation.

	On this repository implementation, a point annotation is represented by a lozenge character (⧫).

	Example:
	```
	The patient has a history of dia⧫betes.
	```

	### Span annotation

	A span annotation is composed by the two markings, identifying both start and end positions of the NE span.

	The implementation on this repository is based on the span annotation schema defined by [Yada et al. (2020)](https://aclanthology.org/2020.lrec-1.561/).

	Example:
	```
	The patient has a history of <C>diabetes</C>.
	```

	## Model architecture

	This model was fine-tuned on top of [cl-tohoku/bert-base-japanese-char-v2] (https://huggingface.co/cl-tohoku/bert-base-japanese-char-v2).

	The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

	To be executed, this model requires the following dependencies:
	- fugashi
	- unidic-lite

	## Training data

	The model was finetuned using a dataset of Japanese medical texts (which is not available pubicly), comprised of 1027 synthetic medication history notes generated through crowd-sourcing.

	Ten experienced dispensing pharmacists were hired as writers to craft the corpus. Each writer was assigned one of 285 drug names and tasked with creating a ``typical'' clinical narrative. This corpus was later fully annotated for symptoms and disease names.

	Each annotation received a ⧫ token within its span based on a Truncated normal distribution.

	The model was then trained to identify this token and output a span corresponding to the surrounding concept.

	## Usage

	The `requirements.txt` file contains all the dependencies needed to run the example code.

	```python
	import mojimoji
	import numpy as np
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	import iob_util #pip install git+https://github.com/gabrielandrade2/IOB-util.git

	model_name = "gabrielandrade2/point-to-span-estimation"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Point-annotated text
	text = "肥大型心⧫筋症、心房⧫細動に対してＷＦ投与が開始となった。\
	治療経過中に非持続性心⧫室頻拍が認められたためアミオダロンが併用となった。"

	# Convert to zenkaku and tokenize
	text = mojimoji.han_to_zen(text)
	tokenized = tokenizer.tokenize(text)

	# Encode text
	input_ids = tokenizer.encode(text, return_tensors="pt")

	# Predict spans
	output = model(input_ids)
	logits = output[0].detach().cpu().numpy()
	tags = np.argmax(logits, axis=2)[:, :].tolist()[0]

	# Convert model output to IOB format
	id2label = model.config.id2label
	tags = [id2label[t] for t in tags]

	# Convert input_ids back to chars
	tokens = [tokenizer.convert_ids_to_tokens(t) for t in input_ids][0]

	# Remove model special tokens (CLS, SEP, PAD)
	tags = [y for x, y in zip(tokens, tags) if x not in ['[CLS]', '[SEP]', '[PAD]']]
	tokens = [x for x in tokens if x not in ['[CLS]', '[SEP]', '[PAD]']]

	# Convert from IOB to XML tag format
	xml_text = iob_util.convert_iob_to_xml(tokens, tags)
	xml_text = xml_text.replace('⧫', '')
	print(xml_text)
	```

	### Output
	```xml
	<C>肥大型心筋症</C>、<C>心房細動</C>に対してWF投与が開始となった。治療経過中に<C>非持続性心室頻拍</C>が認められたためアミオダロンが併用となった。
	```