How to preprocess the raw data?
Thanks a lot for your great work Geneformer.
I have a question. If I download a .h5ad file from a gene database website and want to pretrain a new model with it, how to preprocess it? Do I need to do some preprocessing procedures over the .h5ad file, like log, normalize, filter or other means?
Or just transfer the raw .h5ad data to the .loom format, and then use TranscriptomeTokenizer to transfer it to the .arrow format?
Thank you for your interest in Geneformer! The input is raw counts without any normalization or filtering. Please check out the instructions in the example notebook for tokenizing data for further details.
Thanks for your fast reply.
However, by reading the paper, I find some instructions about data processing as follows:
"We excluded cells with high mutational burdens (for example, malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. We only included droplet-based sequencing platforms to assure expression value unit comparability. Overall, 561 datasets were included and stored as uniform files in the .loom HDF5 format including metadata from the original studies as row (feature) and column (cell) attributes described below.
Column cell attributes also included several calculated measurements for each cell: the total number of read counts, the percentage of mitochondrial reads, the number of genes Ensembl-annotated as protein-coding or miRNA genes and whether the cell passed the quality-control metrics we established for scalable filtering of the cells to exclude possible doublets and/or damaged cells. Only cells that passed these filtering metrics were used for downstream analyses in this work. Specifically, datasets were filtered to retain cells with total read counts within 3 s.d. of the mean within that dataset and mitochondrial reads within 3 s.d. of the mean within that dataset. Ensembl-annotated
protein-coding and miRNA genes were used for downstream analysis. Cells with less than seven detected Ensembl-annotated protein-coding or miRNA genes were excluded as the 15% masking used for the pretraining learning objective would not reliably mask a gene in cells with fewer detected genes. Ultimately, 27.4 million (27,406,217) cells passed the defined quality filters. "
Are these procedures part of data preprocessing or something like that, especially the second paragraph? Could you provide code please?
The sections from the Methods that you referenced are related to our inclusion criteria for the pretraining corpus. Genecorpus-30M already reflects the cells that were included by these criteria. These criteria are not constraints for any new data presented to the model. For example, users can use plate-based sequencing platforms, they can use immortalized cell lines, and they can use any filtering criteria they deem appropriate. The example notebook for tokenizing data provides comprehensive instructions for any preprocessing that should (or more importantly should not) be done for data used with Geneformer.