Errors of InSilicoPerturberStats

#436
by ZYSK-huggingface - opened

When I try code:

from geneformer import InSilicoPerturber
temp=['ENSG00000196843']
for g in temp:
isp = InSilicoPerturber(
perturb_type="delete",#
perturb_rank_shift=None,
genes_to_perturb=[g],#
combos=0,
anchor_gene=None,
model_type="Pretrained",#
num_classes=0,#
emb_mode="cls_and_gene",#
filter_data={"cell_type":["Endo"]},#
cell_states_to_model={"state_key": "age_group", "start_state": "Y", "goal_state": "O", "alt_states": []},
state_embs_dict=state_embs_dict,#
max_ncells=None,#
emb_layer=0,
forward_batch_size=64,#
nproc=24,
clear_mem_ncells=64)#
isp.perturb_data(
"/home/Geneformer-2/gf-20L-95M-i4096/",#gf-12L-95M-i4096
"/data/02_Datasets/02_Geneformer/Homo_sapiens/HeartAtrium/cell_state.dataset",
f"/data/03_Results/02_Geneformer/cell_state_perturbation/del_{g}",
f"del_{g}_cell_state_perturbation")

from geneformer import InSilicoPerturberStats
for g in combined:
ispstats = InSilicoPerturberStats(
mode="goal_state_shift",
genes_perturbed=[g],
combos=0,
anchor_gene=None,
cell_states_to_model={"state_key": "age_group", "start_state": "Y", "goal_state": "O", "alt_states": []}
)
ispstats.get_stats(f"/data/03_Results/02_Geneformer/cell_state_perturbation/del_{g}/",
None,
f"/data/03_Results/02_Geneformer/cell_state_perturbation/del_{g}res/",
f"del
{g}_cell_state_perturbation_res")

Errors occured when running InSilicoPerturberStats:

屏幕截图 2024-10-19 194155.png

I have already obtained results from in silico perturbation:

image.png

Could please help me have a check? Thank you so much !

For the mode="goal_state_shift", only the cell_emb values are relevant. Could you try to move the "gene_embs_dict" pickle file into a different folder and retry running the stats? If this fixes the issues, please let us know so we can adjust the code to account for this potential scenario.

For the mode="goal_state_shift", only the cell_emb values are relevant. Could you try to move the "gene_embs_dict" pickle file into a different folder and retry running the stats? If this fixes the issues, please let us know so we can adjust the code to account for this potential scenario.

Thanks for your advice ! It works when only perturber_stats the 'cell_dmbs_dict'. There is a question, however, about how to use the result. For example, I Iterate 'all' the genes to see which one will more be like to shift cells to gold end. I find that many genes whose shit_to_goal_end values are higher, are low in N_detections instead. So what is the good standard to select important genes?

And interestingly, when I one by one search for articles concerning these genes, some low N_detections(even 1) but hign shift values actually have connections with my interested disease or cell states, but they often not in current filtered cell_type. For example, I filtered Endo cells and got these gene results, some gene with N_detections only 1 but high shift value can be found connections with Other cells in reported articles.

Finally, I want to know what is the meaning of 'sig' and how to comprehend the 'negative value' in 'shift_to_goal_end'?

image.png

Thank you so much !

Thank you! Please see the documentation for detailed description of all column names: https://geneformer.readthedocs.io/en/latest/geneformer.in_silico_perturber_stats.html

We would recommend using statistical significance as the standard to select important genes. This takes into account both the magnitude of the shift and the number of observations that yield statistical power.

ctheodoris changed discussion status to closed

Thank you! Please see the documentation for detailed description of all column names: https://geneformer.readthedocs.io/en/latest/geneformer.in_silico_perturber_stats.html

We would recommend using statistical significance as the standard to select important genes. This takes into account both the magnitude of the shift and the number of observations that yield statistical power.

Thanks for your reply. But for simply single gene or combinated gene perturbation without cell_state mode, is there any way to filter important genes like above? Maybe the 'Cosine_sim_stdev' which consider both the N_detections and the cos_sim?

For the mode="goal_state_shift", only the cell_emb values are relevant. Could you try to move the "gene_embs_dict" pickle file into a different folder and retry running the stats? If this fixes the issues, please let us know so we can adjust the code to account for this potential scenario.

Sorry to bother again, I think the reason why I got so many gene_embs_dict is that I set emb_mode='cell_and_gene'. Since cell state perturbation only need cell_emb in mode 'goal_state_shift', could I simply set emb_mode to 'cell'? Will this influence the statistical tests?

However, when using 'cell_and_gene', output cell_embs_dicts' will yield many meaningless files, like 'in_silico_delete_del_all_dict_cell_embs_0batch57_raw.pickle' with no cotent, and 'in_silico_delete_del_all_dict_cell_embs_0batch-1_raw.pickle' is meaningful. So I have to filter all files with 'batch-1_raw.pickle' to obtain final csv.

Instead, when using 'cell', output cell_embs_dicts' will yield more number of files, and each file has meaningful content, whatever it is '…batch-1_raw.pickle' or not. This make me a little worried but the 'cell' mode seems far more quick.

Thank you again!

Sign up or log in to comment