Error of overexpress insilico mode in 95M pretrained model

#423
by ZYSK-huggingface - opened

Hi !

I encountered an error when I used your updated 20L-95M pretrained model to overexpress a specific gene

My codes are:

from geneformer import InSilicoPerturber
isp = InSilicoPerturber(
perturb_type="overexpress",
perturb_rank_shift=None,
genes_to_perturb=["ENSG00000196843"],
combos=0,
anchor_gene=None,
model_type="Pretrained",
num_classes=0,
emb_mode="cls_and_gene",
max_ncells=None,
emb_layer=0,
forward_batch_size=64,
nproc=24,
clear_mem_ncells=64)
isp.perturb_data(
"/home/Geneformer-2/gf-20L-95M-i4096/",
"/data/02_Datasets/02_Geneformer/Homo_sapiens/HeartAtrium/update_test_Endo_95M.dataset/",
"/data/03_Results/02_Geneformer/in_silico/overexpress_95M_pretrained-test-only/",
"ARID5A_Endo_cls_and_gene")

And the errors:

image.png

I found that in delete mode, cells detected are cells which express targeted genes, however, in overexpress mode, cells detected are all the cells whatever gene they express. Maybe for those cells which do not express my targeted gene, overexpressing it result in this error ?(cosine calculation error because of different number between original and perturbated datas)

This comment has been hidden

Sadly, when using 30M pretrained model to overexpress my targeted gene, I encountered again similar problem:

image.png

I guess that it Should be encountered a cell row, after I overexpress the operation (essentially is to advance the position of a gene token, whether in 2048 or not, to the first position, overflow is still truncated), its perturbed this contains a bunch of gene tokens (input_ids) number becomes 2047, And originally this cell should be full of 2048, so this command

873109361e5dff1484956eb651a2890.png

Mistakes will be made. But what we don't know is, why, after this overexpression or essentially changing position, does this cell row or mini-btach lose one of its input_ids

Thank you for your question! The two models have two different token dictionaries, which is likely the source of your error. So you should provide the appropriate token dictionary for each model for the in silico perturbation to ensure the correct gene is being targeted. Your data should also be tokenized with the appropriate dictionary and gene median file, with the necessary special tokens when using the new model, and with the appropriate input size for the given model. Please let us know if that doesn’t resolve the error.

ctheodoris changed discussion status to closed

Hi !
I found where the problem is after hours' thinking. For example, for my targeted gene 'arid5a', not all cells express it, which means 'arid5a''s value is 0 in those cells, and according to your code, these 0 values are excluded, so when insilico try to overexpress 'arid5a' in cells without 'arid5a' token(expression 0 value), nothing could be ranked top, so these cells are tensor(2047) , which do not match size 2048. Is there any good way to solve this? Or only to filter cells with 'arid5a' expression not zero before tokenize? (But this way do not simulate real biological overexpression). But strangely previous time I never encounter this error and that time I do not filter cell types, while this time I filter Endo cells, most of which lack expression of 'arid5a', so is that if most cells's expression not zero then is ok?

There are a couple of things with the 0th position to keep in mind. If there is a CLS token as the 0th token, the over-expressed gene should move to the 1st token. This means it’s important that the way you tokenized the data aligns with the model and the dictionary/method for the in silico perturbation. Also, if a gene is already at the 0th position, it is not over-expressed because there is no change from the original so these cells are skipped. The issue may still arise from a mismatch of dictionaries. However, if you ensure they are aligned and you still have an issue, please let us know the line of the code you are talking about.

Thanks for your answer! Let me show you my codes step by step to see where the problems arise:
Ⅰ30M-pretrained
step①tokenize my loom using 30M dict:
from geneformer import TranscriptomeTokenizer
tk = TranscriptomeTokenizer({"age": "age", "age_group": "age_group", "Unified_sample":"sample",
"condition":"condition", "gender":"gender","species":"species",
"tissue":"tissue", "CellType":"cell_type", "n_counts":"n_counts"},
nproc=24, model_input_size=2048, special_token=False,
gene_median_file="/home/Geneformer-2/geneformer/gene_median_dictionary_gc30M.pkl",
token_dictionary_file="/home/Geneformer-2/geneformer/token_dictionary_gc30M.pkl")

tk.tokenize_data("/data/02_Datasets/01_loom/Homo_sapiens/HeartAtrium/filtered/test+train_filter_endo_cells", "/data/02_Datasets/02_Geneformer/Homo_sapiens/HeartAtrium/", "update_test+train_Endo_30M.dataset")

step②insilico
from geneformer import InSilicoPerturber
isp = InSilicoPerturber(
perturb_type="overexpress",
perturb_rank_shift=None,
genes_to_perturb=["ENSG00000196843"],
combos=0,
anchor_gene=None,
model_type="Pretrained",
num_classes=0,
emb_mode="cell_and_gene",
max_ncells=None,
emb_layer=0,
forward_batch_size=64,
nproc=24,
token_dictionary_file='/home/Geneformer-2/geneformer/token_dictionary_gc30M.pkl',
clear_mem_ncells=64)
isp.perturb_data(
"/home/Geneformer-2/gf-6L-30M-i2048",
"/data/02_Datasets/02_Geneformer/Homo_sapiens/HeartAtrium/update_test+train_Endo_30M.dataset/",
"/data/03_Results/02_Geneformer/in_silico/e-overexpress_30M_pretrained-test+train/",
"ARID5A_Endo_cell_and_gene")

from geneformer import InSilicoPerturberStats
ispstats = InSilicoPerturberStats(
mode="aggregate_gene_shifts",
genes_perturbed=["ENSG00000196843"],
combos=0,
anchor_gene=None,
token_dictionary_file="/home/Geneformer-2/geneformer/token_dictionary_gc30M.pkl",
gene_name_id_dictionary_file="/home/jiaming/Geneformer-2/geneformer/gene_name_id_dict_gc30M.pkl",
)
ispstats.get_stats("/data/03_Results/02_Geneformer/in_silico/e-overexpress_30M_pretrained-test+train/",
None,
"/data/03_Results/02_Geneformer/in_silico_stats/e-overexpress_30M_finetuned-test+train/",
"30M_pretrained_overexpress_ARID5A_Endo-test+train")

Ⅱ 95M-pretrained
the same as above,while all the “30M” replaced by “95M” and model_input_size=4096、special_token=True、emb_mode="cls_and_gene" and model is "/home/Geneformer-2/gf-20L-95M-i4096/"

I also have some summarized questions for pertirbation:

For my target gene ‘ARID5A’(ensemble_id: "ENSG00000196843")
ⅠFor delete:
30M not work because 'ARID5A' rank over 2048, 95M work because 'ARID5A' rank inside 4096,
but the question is , not all cells's 'ARID5A' rank over 2048 or inside 4096, isn't there a contratict between different cells ? Why 30 M always wrong and 95 M always work since this existing contradict?

Ⅱ For overexpress:
30M and 95M both give me the above errors "size of tensor a2047/4095 must match size of b2048/4096", I initially thought reason why they produced a2047/4095 is beacuse some cells which donot express 'arid5a' with 0 values excluded , can not find 'arid5a' token to put in the front, so they lose one dimension(one gene token). So if not this reason, I also wander how you deal with this situation.

Thank you for following up!

If you try "cell" (30M) and "cls" (95M) for the emb_mode options, does the same error occur? It will be helpful to isolate if this is from the cell embedding or gene embedding part, or both.

For your question about ARID5A ranking, could you explain further what you mean? I could not understand what the contradiction was that you were asking about and what you meant by "always wrong" and "always work"? In case this addresses your question: The ranking is based on scaled expression, so some cells will have a given gene ranked at different positions in the encoding than others. If ARID5A always falls between the ranks of 2048 and 4096, it will end up being included in the 4096 input size model but never being included in the 2048 input size model.

For your question about how we deal with overexpression in the case that the token is not in the cell: with the overexpression for "all" genes, only genes expressed in the cell are overexpressed. For the case with a single gene as you have set it up, the intended process is to delete the gene if it's there, insert that gene in the front regardless of if it was there previously or not, truncate the cell back to the input size of the model, and replace the last gene with the EOS token.

Thank you for following up!

If you try "cell" (30M) and "cls" (95M) for the emb_mode options, does the same error occur? It will be helpful to isolate if this is from the cell embedding or gene embedding part, or both.

For your question about ARID5A ranking, could you explain further what you mean? I could not understand what the contradiction was that you were asking about and what you meant by "always wrong" and "always work"? In case this addresses your question: The ranking is based on scaled expression, so some cells will have a given gene ranked at different positions in the encoding than others. If ARID5A always falls between the ranks of 2048 and 4096, it will end up being included in the 4096 input size model but never being included in the 2048 input size model.

For your question about how we deal with overexpression in the case that the token is not in the cell: with the overexpression for "all" genes, only genes expressed in the cell are overexpressed. For the case with a single gene as you have set it up, the intended process is to delete the gene if it's there, insert that gene in the front regardless of if it was there previously or not, truncate the cell back to the input size of the model, and replace the last gene with the EOS token.

Thank you so much for your patient answer!
For the first question, the contradiction I mean is that, in previous tests, when I try to delete 'ARID5A' in 30M models, it got errors and you say it was because this gene lay behind 2048, the contradiction is I guess at least there are a few cells having this gene ranking inside 2048, but errors say 'can not find gene to delete', it seems exaggerating to say all the cells(like 200Kcells) having this gene ranking over 2048.

For the second question, I think I finally get clarified, so whatever the cells having this gene or not, overexpress's task is simply to put my target token in the first rank, right? Then I'm afraid my previous guess about why 30M and 95M both got errors is wrong, It is not because some cells having 0 values of 'arid5a' excluded, it may be still beacuse of the dict problems where my targeted gene token encountered errors and finally resulted in the errors 'losing one dimension'->2047 or 4095 ?

And for your saying 'cell'(30M) and 'cls'(95M), that is exactly what I set from the beginning, so I think there is no where to find problems source except dicts or something that I ignored

For the first question, thank you for clarifying. You can directly assess whether no cells have this gene within 2048: you can just check the tokenized dataset to confirm whether any cells have that gene present. The tokenization is a deterministic process.

In your prior message, you wrote that you used "cls_and_gene". We were suggesting you try "cls" to isolate whether it is from the cls or gene embedding.

For the first question, thank you for clarifying. You can directly assess whether no cells have this gene within 2048: you can just check the tokenized dataset to confirm whether any cells have that gene present. The tokenization is a deterministic process.

In your prior message, you wrote that you used "cls_and_gene". We were suggesting you try "cls" to isolate whether it is from the cls or gene embedding.

Hi, Christina! Sorry for my late reply after a period of busy time. I found where errors of overexpress arise. For example, still for 95M pretrained model and for targeted gene 'arid5a', when tokenizing, we set max input to 4096, however, when I checked output of tokenizer——the .arrow dataset, I found almost all the cells' vectoe dimension of input_ids are small than 4096, which means, when we overexpress 'arid5a', for cells which do not have 'arid5a' originally, simply add 'arid5a' to the 0th will add their dimension(1 more than original cell vectors), thus the errors happen:

屏幕截图 2024-10-04 004927.png

For example, this means, when 13th batch processing, a cell which originally has 3027 gene tokens, now be added 'arid5a' token to become 3028 dimension.

It seems that solution is to cut one dimension of perturbated cell, or add one dimension of original cell, which do you think is the better?

Thank you so much!

Thank you for your reply! Could you confirm if you are using the emb_mode of "cls" (not "cls_and_gene") when this occurs? Since the cls cosine similarity is calculated only on that single token, it shouldn't result in a mismatch. It would be helpful to know this information, which is what I had been asking about before. Also, please confirm you are using the most recent version of the code. Thank you!

Thank you for your reply! Could you confirm if you are using the emb_mode of "cls" (not "cls_and_gene") when this occurs? Since the cls cosine similarity is calculated only on that single token, it shouldn't result in a mismatch. It would be helpful to know this information, which is what I had been asking about before. Also, please confirm you are using the most recent version of the code. Thank you!

Thank you for your patience ! I have tried both 'cls' and 'cls_and_gene' with 95M pretrained model and with tokenized dataset adding cls = True, I also tried 'cell_and_gene' with 95M pretrained model and with tokenized dataset without adding cls(=False). 'cls' mode runs normal, but whatever 'cls_and_gene' or 'cell_and_gene' run errors. Besides, I was using the latest version of code, and I have tested for several genes including arid5a(mentioned before) and some endo related genes, all of which except one gene(this gene expressed pretty little in cells, n_detections = 24 ) failed to run through, they all got similar errors——got stuck in some cell (can be told from process bars), and showed mismatch between original embed and perturbed embed.

I carefully checked your code and found there may be logical problems in your 'get_embs()' function in the 'perturber_utils.py'.

Let me use an example to clarify this:

In 'get_embs()', I noticed all cells are added 'pad_token's to make token nums align with model, so we got original_emb and full_perturbrtion_emb, but I did not found how these 'pad_token's got removed.

For cells which do not express 'arid5a' originally, function 'pu.remove_perturbed_indices_set()' in the 'in_silico_perturber.py' will not remove 'arid5a' token embed from original_emd, in contrast, perturbed embed will be removed 'arid5a' by 'perturbation_emb = full_perturbation_emb[ :, 1 + len(self.tokens_to_perturb) : -1, : ]'.

①If the padding tokens are not removed, let me assume that in that error batch, the max token num is 4000, so all cells are added uneven num of padding tokens to align with 4000, and finally, perturbed embed is one less than original embed.

②But if cosine calculation ignore padding, or padding are removed actually, then the logic shoud be right, for example, one cell originally has 3000 gene tokens, and do not express 'arid5a', after overexpress, it got 3001 gene tokens, and after 'get_embs()', original cell do not need to remove 'arid5a' so is still 3000, perturbed cell need to remove so is also 3000. In this situation, code logic is correct and errors should not happen.

So I guess it arise from situation ①, but as you can see, the message says 'tensor a 3027……tensor b 3028', If situation ① is true, I think can found in a certain batch, there is a max token num == 3027 or 3028. But I did not find that.

If you have any spare time, could you please answer for me? Thank you so much!

I tested your codes again , and I have confirmed that 'delete' mode of both 30 and 95 models work very well. Problems still got from 'overexpress' mode in both 30 and 95 models, and I used an example and find that:

if I use 30M tokenized datasets as input, and in 'InSilicoPerturber', if I do not set 'token_dictionary_file', your inner code logic will excute in silico according to 95M token dict, that work do not encounter errors but , we know in fact in silico overexpress the wrong gene. For example, I want to overexpress 'arid5a', whose token in 30M is '16664' and in 95M is'16335', so in fact overexpress a different gene beacuse '16335' in 30M dict denote gene 'dusp21'.

if I use 30M datasets and set 30M token dict, or if I use 95M datasets and set 95M token dict, both of which match the model、the tokenized dataset and all the dicts, however both got errors of 'tensor mismatch size'.

In the above discussion, I checked your code and found that for overexpress mode, there may be logical problems. I hope if you could personally have a try in overexpress any single gene and see if you got the same errors.

Thank you for your discussion - we believe this is due to the padding. We have staged a pull request to fix this and will merge it once we have completed testing it.

Thank you for your discussion - we believe this is due to the padding. We have staged a pull request to fix this and will merge it once we have completed testing it.

Thank you so much for your patience and contribution, sincerely hope that you could fix it smoothly !

Sign up or log in to comment