Possible bug in tokenizer (No objects to concatenate)

#449
by Arkimond9620 - opened

Hello,

First I want to say that Geneformer is awesome and it's very interesting =). So I was running the TranscriptomeTokenizer on a .h5ad dataset and I got the following error:

File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/geneformer/tokenizer.py:453, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path, target_sum)
    452 def tokenize_anndata(self, adata_file_path, target_sum=10_000):
--> 453     adata = sum_ensembl_ids(
    454         adata_file_path,
    455         self.collapse_gene_ids,
    456         self.gene_mapping_dict,
    457         self.gene_token_dict,
    458         file_format="h5ad",
    459         chunk_size=self.chunk_size,
    460     )
    462     if self.custom_attr_name_dict is not None:
    463         file_cell_metadata = {
    464             attr_key: [] for attr_key in self.custom_attr_name_dict.keys()
    465         }

File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/geneformer/tokenizer.py:259, in sum_ensembl_ids(data_directory, collapse_gene_ids, gene_mapping_dict, gene_token_dict, file_format, chunk_size)
    256         df_sum.index = data_dup_gene.obs.index
    257         processed_chunks.append(df_sum)
--> 259     processed_chunks = pd.concat(processed_chunks, axis=1)
    260     processed_genes.append(processed_chunks)
    261 processed_genes = pd.concat(processed_genes, axis=0)

File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    379 elif copy and using_copy_on_write():
    380     copy = False
--> 382 op = _Concatenator(
    383     objs,
    384     axis=axis,
    385     ignore_index=ignore_index,
    386     join=join,
    387     keys=keys,
    388     levels=levels,
    389     names=names,
    390     verify_integrity=verify_integrity,
    391     copy=copy,
    392     sort=sort,
    393 )
    395 return op.get_result()

File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    442 self.verify_integrity = verify_integrity
    443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
    447 # figure out what our result ndim is going to be
    448 ndims = self._get_ndims(objs)

File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
    504     objs_list = list(objs)
    506 if len(objs_list) == 0:
--> 507     raise ValueError("No objects to concatenate")
    509 if keys is None:
    510     objs_list = list(com.not_none(*objs_list))

ValueError: No objects to concatenate

After looking a bit through the code, I saw that there is this piece of code:

gene_ids_in_dict = [
            gene for gene in data.var.ensembl_id if gene in gene_token_dict.keys()
        ]
        if collapse_gene_ids is False:
            
            if len(gene_ids_in_dict) == len(set(gene_ids_in_dict)):
                return data
            else:
                raise ValueError("Error: data Ensembl IDs non-unique.")

        # Check for when if collapse_gene_ids is True
        gene_ids_collapsed = [
            gene_mapping_dict.get(gene_id.upper()) for gene_id in data.var.ensembl_id
        ]
        gene_ids_collapsed_in_dict = [
            gene for gene in gene_ids_collapsed if gene in gene_token_dict.keys()
        ]
        if len(set(gene_ids_in_dict)) == len(set(gene_ids_collapsed_in_dict)):
            data.var["ensembl_id_collapsed"] = data.var.ensembl_id.map(gene_mapping_dict)
            return data
(https://huggingface.co/ctheodoris/Geneformer/blob/3a6866963d20ff5ce7c82d2c4544cb2b304017e9/geneformer/tokenizer.py#L209-L228)

I saw in the ensembl_mapping_dict_gc95M.json dictionary that there are some Ensembl_Ids mappings between other Ensembl_Ids.
For example:

ENSG00000274734 : ENSG00000285077

and

ENSG00000004059: ENSG00000004059

What happens when you have a dataset where all the Ensembl_Ids map to themselves except 1? The genes will be present in the gene_ids_in_dict and gene_ids_collapsed_in_dict with a single exception. The one that doesn't map to itself. If you have 100 genes, then gene_ids_in_dict will have 99 elements and gene_ids_collapsed_in_dict 100 (because ENSG00000274734 is not in the tokens but ENSG00000285077 is). Now the sizes are different but there is no duplicate element. This will make the code fail because it will execute the else statement that assumes at least 1 duplicated gene.

Am I right?

Arkimond9620 changed discussion title from Possible bug in preprocessing (No objects to concatenate) to Possible bug in tokenizer (No objects to concatenate)

Thank you so much for pointing out this case! We just merged a change to resolve this.

ctheodoris changed discussion status to closed

Sign up or log in to comment