multitask_cell_classification vs cell_classification pipeline
Dear Dr. Theodoris,
Thank you for creating such a great tool! Using the example of task_columns = ["cell_type", "disease_state"] in the tutorial, I would like to ask for your advice on the following scenario.
Suppose I want to differentiate diseased astrocytes from healthy astrocytes. One approach would be to use the multitask_cell_classification pipeline. Another approach, I assume, could be to first differentiate astrocytes and then filter them out before distinguishing between diseased and healthy states, both using the cell_classification pipeline.
What are your thoughts or suggestions on these two different approaches?
Many thanks in advance!
Napert
Thank you for your question and interest in Geneformer! Both approaches are valid. If you are truly only interested in astrocytes as the disease is only affecting astrocytes and they are otherwise homogenous, then it should be sufficient to first filter them out and perform single-task fine-tuning. If there are variable states in the astrocytes you think are also important for the model to learn, such as early vs. late developmental stage or on or off treatment, etc, then it may be useful to include that task for the model to jointly learn these attributes if they could be cross-informative and the disease mechanism could be context-specific (or if you'd like to jointly learn about the disease process in multiple cell types for diseases with multicellular pathology, etc.).