Why do some codes appear multiple times in the tokenizer dictionary?
When checking the tokenizer dictionary, I can see that some codes appear multiple times. Here is a list of the most common codes:
- LOINC/28570-0: 4700
- LOINC/11506-3: 3454
- LOINC/34748-4: 3354
- LOINC/8663-7: 2452
- LOINC/LP173418-7: 1585
I understand that some codes might appear more frequently in the dictionary (e.g., for measurement percentiles), but could you explain why the counts are so high?
The code I used to measure the frequency:
import msgpack
from collections import Counter
dictionary_path = 'dictionary.msgpack'
with open(dictionary_path, "rb") as f:
dictionary = msgpack.load(f)
for code_string, count in Counter([item['code_string'] for item in dictionary['vocab']]).most_common(20):
print(f"{code_string}: {count}")
Thanks for reaching out!
The high code count is because we assign each unique (code, value)
to a token. Each of these codes is related to notes, and thus was associated with thousands of unique textual values. As each textual value got its own unique token, this caused each of these LOINC codes to have many tokens in our vocabulary. Each of the top codes you've printed is a type of note:
- LOINC/28570-0 is a procedure note
- LOINC/11506-3 is a progress note
- LOINC/34748-4 is a telephone encounter note
- LOINC/8663-7 is self-reported smoking
- LOINC/LP173418-7 is a note
However, we had to remove all text values from our model's vocabulary for data privacy reasons. Thus, almost all of those repeated codes were rendered unusable when we published our model. We marked all unusable codes with type="unused"
in their entry.
If you filter out the "unused" entries (which correspond to tokens representing textual values that were not allowed to be published), you'll get a much saner amount of code repetition (due to deciling of numerical values + the vanilla code itself):
>>> codes = [ x for x in dictionary['vocab'] if x['code_string'] == 'LOINC/28570-0' ]
>>> len(codes)
4700
>>> codes = [ x for x in dictionary['vocab'] if x['code_string'] == 'LOINC/28570-0' and x['type'] != 'unused' ]
>>> len(codes)
11