ctheodoris/Geneformer · Disagreement between 30M and 95M cell state perturbation results

ZYSK-huggingface

6 days ago

•

edited 6 days ago

Hi Christina!

Here I want to show you my check in discussion #451

I use load_from_disk function imported from datasets to check my tokenized dataset

I use gene 'C7' as example:

①in 30M, C7's token is 4115 and its N_detections of 30M_cell_state_perturbation's result are 3628:

I use load_from_disk read 30M tokenized dataset and check how many cells are 'C7' expressed:

(Y30 means my state A cells , which I set for start_state to do perturbate, so 'C7' are checked in these 10785cells)

After checking cells one by one, 'C7''s number is also 3628, which align with perturbation:

②in 95M, however, problems arise
Firstly, in 95M, gene 'C7''s token is 4086 and its N_detections of 95M_cell_state_perturbation's result are 142:

I use load_from_disk read 95M tokenized dataset and check how many cells are 'C7' expressed:

(Here subset_Y_Endo has same meaning with Y30, both 10785 identical cells with different dict tokenized)

After checking cells one by one, 'C7''s number is 6239, which DO NOT align with perturbation:

I am sure that I was using the correct dicts, datasets and parameters. So now I want to do as you say to check the intermediate files. Unfortunatelly, I'm not sure how to check and confirm such as gene 'C7''s perturbation process, could you please give me some detailed instructions? I would be most grateful if you could kindly assist me with this. Thank you in advance for your time and consideration ！

Below is a screenshot of my intermidate files (very very huge)

ctheodoris

Owner 6 days ago

Thank you for following up. Please feel free to follow up within the same discussion and reopen it so it is easier for others to track the discussion in the future. I will respond to this question in your initial discussion 451 since its related.

ctheodoris changed discussion status to closed 6 days ago