Interpretation for mode "aggregate_gene_shifts"

#393

by ZYSK-huggingface - opened Aug 24

Aug 24

Thank you so much for your answer in discussion#391, Christina !

I still wonder what the relationship between the "Affected cell embed" and the "Affected gene name" is for each row. Does it mean that the "affected gene" is in that "affected cell", though I found no relpicated cell embed for those affected gene which got n_detections>1?

Besides it seems that almost all the columns serve for the affected gene, and cell embed got only one column, is my comprehension correct? What further confuses me is that I only select random 1000 cells, but the affected cell embed got numbers far over 1000(like I got over 1000 rows and they all got an affected cell embed, the exact number are 16538 rows) and each row is unique(got no replicated cell embed).

Last but not least, I noticed some absonant part in these results:
①for the first row, the left gene embed id is 16536, but it got no corresponding column of "affected gene name" and "affected ensemble id", is it a mistake or something else?

②for the selected row, you can see the left gene embed id is identical to my perturbated gene id , but the affected gene name is totally different:

I'm sorry for my too much problems but I really appreciate your wonderful job, hope to receive your answer, thank you again !

ctheodoris

Owner Aug 24

•

edited Aug 24

Let’s say you have the following three cells:

Cell 1 has Gene A, B, C, D
Cell 2 has Gene A, B, C, Z
Cell 3 has Gene B, C, D, Z

And you want to understand the effect of deleting Gene A. First, Cell 3 will be filtered out because it doesn’t have Gene A. You will be left with 2 cells. If your max n_cells is more than 2, both will be included. Now if you select the mode cell_and_gene, you will get the embeddings of both the cells and individual genes in response to Gene A being deleted.

First, you compare the original Cell 1 to the Cell 1 with Gene A deleted. You then get the cosine similarities of Gene B in the original and Gene B in the simulated perturbed cell. You get the same value for Gene C and Gene D. Then, if you are using the mean pooling version, you average the embeddings of Gene B, C, and D in the original cell and compare them to their average in the perturbed cell; this is the effect in the cell embedding.

Now you do the same for Cell 2 and you get a value for the cell embedding shift as well as the value for each of the gene shifts B, C, and Z.

In the output csv, the perturbed gene will be Gene A for every row if you provided it as the specific gene to perturb (rather than the “all” option). The average of the cell embedding shifts for Cell 1 and 2 will be annotated in the row where the Affected is cell_emb. Then there will be a row where the Affected is Gene B with 2 detections. There will also be a row where the Affected is Gene C with 2 detections, another row where the Affected is Gene D with 1 detection, and another row where the Affected is Gene Z with 1 detection. In the end you will have 1 row indicating the average of the cell embeddings which will have the same detections as the number of cells since there is a cell embedding for every cell. The gene embedding rows will have variable detections dependent on the number cells in which that gene is found. There will be no Gene A in the Affected column because you cannot calculate the effect of deleting Gene A on Gene A itself as it is absent. You will get more rows than cells because not every cell has the same exact genes.

Hopefully that helps.

ctheodoris changed discussion status to closed Aug 24

ZYSK-huggingface

Aug 24

Let’s say you have the following three cells:

Cell 1 has Gene A, B, C, D
Cell 2 has Gene A, B, C, Z
Cell 3 has Gene B, C, D, Z

And you want to understand the effect of deleting Gene A. First, Cell 3 will be filtered out because it doesn’t have Gene A. You will be left with 2 cells. If your max n_cells is more than 2, both will be included. Now if you select the mode cell_and_gene, you will get the embeddings of both the cells and individual genes in response to Gene A being deleted.

First, you compare the original Cell 1 to the Cell 1 with Gene A deleted. You then get the cosine similarities of Gene B in the original and Gene B in the simulated perturbed cell. You get the same value for Gene C and Gene D. Then, if you are using the mean pooling version, you average the embeddings of Gene B, C, and D in the original cell and compare them to their average in the perturbed cell; this is the effect in the cell embedding.

Now you do the same for Cell 2 and you get a value for the cell embedding shift as well as the value for each of the gene shifts B, C, and Z.

In the output csv, the perturbed gene will be Gene A for every row if you provided it as the specific gene to perturb (rather than the “all” option). The average of the cell embedding shifts for Cell 1 and 2 will be annotated in the row where the Affected is cell_emb. Then there will be a row where the Affected is Gene B with 2 detections. There will also be a row where the Affected is Gene C with 2 detections, another row where the Affected is Gene D with 1 detection, and another row where the Affected is Gene Z with 1 detection. In the end you will have 1 row indicating the average of the cell embeddings which will have the same detections as the number of cells since there is a cell embedding for every cell. The gene embedding rows will have variable detections dependent on the number cells in which that gene is found. There will be no Gene A in the Affected column because you cannot calculate the effect of deleting Gene A on Gene A itself as it is absent. You will get more rows than cells because not every cell has the same exact genes.

Hopefully that helps.

So to my understanding, in this "cell_and_gene"mode, there is no cosine sim shift column for cell embed, but only one row which calculate the average shift of all affected cell embed(that is the first row?), and from the second the row on, all the rows are affected gene embed, and the "affected cell embed" which I thought was the cell embed column is in fact the gene embed( that makes senses), is my understanding correct?

For example, in the below picture highlighted row, gene KCNA11's embed is 1280, it only detected in one cell , so the cosine_sim_stdev is 0. But what does "12198" in the first column means, seems to be the original rank of gene KCNA11 or something else?

Thank you for your patience, I'm so gratefull for your every reply !

ctheodoris

Owner Aug 24

There is no "affected cell embed" label that you mention. There is only “Affected”, where it indicates if the row corresponds to the cell embedding or gene name. In all cases the number reflects the average cosine similarity across all cells. There are many different genes but just 1 value that would correspond to the cell embedding, which is why there are more rows for genes than cells. The number in the first column is just an index and can be ignored.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment