Why do some codes appear multiple times in the tokenizer dictionary?

#4
by akwasigroch - opened

When checking the tokenizer dictionary, I can see that some codes appear multiple times. Here is a list of the most common codes:

  • LOINC/28570-0: 4700
  • LOINC/11506-3: 3454
  • LOINC/34748-4: 3354
  • LOINC/8663-7: 2452
  • LOINC/LP173418-7: 1585

I understand that some codes might appear more frequently in the dictionary (e.g., for measurement percentiles), but could you explain why the counts are so high?

The code I used to measure the frequency:

import msgpack
from collections import Counter

dictionary_path = 'dictionary.msgpack'

with open(dictionary_path, "rb") as f:
    dictionary = msgpack.load(f)

for code_string, count in Counter([item['code_string'] for item in dictionary['vocab']]).most_common(20):
    print(f"{code_string}: {count}")
Stanford Shah Lab org

Thanks for reaching out!

The high code count is because we assign each unique (code, value) to a token. Each of these codes is related to notes, and thus was associated with thousands of unique textual values. As each textual value got its own unique token, this caused each of these LOINC codes to have many tokens in our vocabulary. Each of the top codes you've printed is a type of note:

  • LOINC/28570-0 is a procedure note
  • LOINC/11506-3 is a progress note
  • LOINC/34748-4 is a telephone encounter note
  • LOINC/8663-7 is self-reported smoking
  • LOINC/LP173418-7 is a note

However, we had to remove all text values from our model's vocabulary for data privacy reasons. Thus, almost all of those repeated codes were rendered unusable when we published our model. We marked all unusable codes with type="unused" in their entry.

If you filter out the "unused" entries (which correspond to tokens representing textual values that were not allowed to be published), you'll get a much saner amount of code repetition (due to deciling of numerical values + the vanilla code itself):

>>> codes = [ x for x in dictionary['vocab'] if x['code_string'] == 'LOINC/28570-0' ]
>>> len(codes)
4700
>>> codes = [ x for x in dictionary['vocab'] if x['code_string'] == 'LOINC/28570-0' and x['type'] != 'unused' ]
>>> len(codes)
11
Miking98 changed discussion status to closed

Sign up or log in to comment