Preprocessing/multilinguality/README.md · Flux9665/EnglishToucan at main

Zero-Shot Approximation of Language Embeddings

This directory contains all scripts that are needed to reproduce the meta learning for zero-shot part of our system. These scripts allow you to predict representations of languages purely based on distances between them, as measured by a variety of linguistically informed metrics, or even better, a learned combination thereof.

Applying zero-shot approximation to a trained model

Use run_zero_shot_lang_emb_injection.py to update the language embeddings of a trained model for all languages that were not seen during training (by default, supervised_languages.json is used to determine which languages were seen). See the script for arguments that can be passed (e.g. to use a custom model path). Here is an example:

cd IMS-Toucan/
python run_zero_shot_lang_emb_injection.py -m <model_path> -d <distance_type> -k <number_of_nearest_neighbors>

By default, the updated model is saved with a modified filename in the same directory.

Cached distance lookups

In order to apply any zero-shot approximation, cache files for distance lookups are required.

The ASP lookup file (asp_dict.pkl) needs to be downloaded from the release page. All other cache files are automatically generated as required when running run_zero_shot_lang_emb_injection.py.

Note: While the map, tree, and inverse ASP distances are model independent, the learned distance lookup is only applicable for the model it was trained on, i.e., different Toucan models require different learned-distance lookups. If you want to apply zero-shot approximation to a new model, make sure that you are not using an outdated, pre-existing learned distance lookup, but instead train a new learned distance metric.