StanfordShahLab/clmbr-t-base · Why do some codes appear multiple times in the tokenizer dictionary?

19 days ago

•

When checking the tokenizer dictionary, I can see that some codes appear multiple times. Here is a list of the most common codes:

LOINC/28570-0: 4700
LOINC/11506-3: 3454
LOINC/34748-4: 3354
LOINC/8663-7: 2452
LOINC/LP173418-7: 1585

I understand that some codes might appear more frequently in the dictionary (e.g., for measurement percentiles), but could you explain why the counts are so high?

The code I used to measure the frequency:

import msgpack
from collections import Counter

dictionary_path = 'dictionary.msgpack'

with open(dictionary_path, "rb") as f:
    dictionary = msgpack.load(f)

for code_string, count in Counter([item['code_string'] for item in dictionary['vocab']]).most_common(20):
    print(f"{code_string}: {count}")

Miking98

Stanford Shah Lab org 19 days ago

Thanks for reaching out!

The high code count is because we assign each unique (code, value) to a token. Each of these codes is related to notes, and thus was associated with thousands of unique textual values. As each textual value got its own unique token, this caused each of these LOINC codes to have many tokens in our vocabulary. Each of the top codes you've printed is a type of note:

LOINC/28570-0 is a procedure note
LOINC/11506-3 is a progress note
LOINC/34748-4 is a telephone encounter note
LOINC/8663-7 is self-reported smoking
LOINC/LP173418-7 is a note

However, we had to remove all text values from our model's vocabulary for data privacy reasons. Thus, almost all of those repeated codes were rendered unusable when we published our model. We marked all unusable codes with type="unused" in their entry.

If you filter out the "unused" entries (which correspond to tokens representing textual values that were not allowed to be published), you'll get a much saner amount of code repetition (due to deciling of numerical values + the vanilla code itself):

>>> codes = [ x for x in dictionary['vocab'] if x['code_string'] == 'LOINC/28570-0' ]
>>> len(codes)
4700
>>> codes = [ x for x in dictionary['vocab'] if x['code_string'] == 'LOINC/28570-0' and x['type'] != 'unused' ]
>>> len(codes)
11

Miking98 changed discussion status to closed 19 days ago