Possible bug in tokenizer (No objects to concatenate)
Hello,
First I want to say that Geneformer is awesome and it's very interesting =). So I was running the TranscriptomeTokenizer on a .h5ad dataset and I got the following error:
File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/geneformer/tokenizer.py:453, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path, target_sum)
452 def tokenize_anndata(self, adata_file_path, target_sum=10_000):
--> 453 adata = sum_ensembl_ids(
454 adata_file_path,
455 self.collapse_gene_ids,
456 self.gene_mapping_dict,
457 self.gene_token_dict,
458 file_format="h5ad",
459 chunk_size=self.chunk_size,
460 )
462 if self.custom_attr_name_dict is not None:
463 file_cell_metadata = {
464 attr_key: [] for attr_key in self.custom_attr_name_dict.keys()
465 }
File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/geneformer/tokenizer.py:259, in sum_ensembl_ids(data_directory, collapse_gene_ids, gene_mapping_dict, gene_token_dict, file_format, chunk_size)
256 df_sum.index = data_dup_gene.obs.index
257 processed_chunks.append(df_sum)
--> 259 processed_chunks = pd.concat(processed_chunks, axis=1)
260 processed_genes.append(processed_chunks)
261 processed_genes = pd.concat(processed_genes, axis=0)
File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/pandas/core/reshape/concat.py:382, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
379 elif copy and using_copy_on_write():
380 copy = False
--> 382 op = _Concatenator(
383 objs,
384 axis=axis,
385 ignore_index=ignore_index,
386 join=join,
387 keys=keys,
388 levels=levels,
389 names=names,
390 verify_integrity=verify_integrity,
391 copy=copy,
392 sort=sort,
393 )
395 return op.get_result()
File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/pandas/core/reshape/concat.py:445, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
442 self.verify_integrity = verify_integrity
443 self.copy = copy
--> 445 objs, keys = self._clean_keys_and_objs(objs, keys)
447 # figure out what our result ndim is going to be
448 ndims = self._get_ndims(objs)
File ~/miniconda/envs/genomics_env/lib/python3.10/site-packages/pandas/core/reshape/concat.py:507, in _Concatenator._clean_keys_and_objs(self, objs, keys)
504 objs_list = list(objs)
506 if len(objs_list) == 0:
--> 507 raise ValueError("No objects to concatenate")
509 if keys is None:
510 objs_list = list(com.not_none(*objs_list))
ValueError: No objects to concatenate
After looking a bit through the code, I saw that there is this piece of code:
gene_ids_in_dict = [
gene for gene in data.var.ensembl_id if gene in gene_token_dict.keys()
]
if collapse_gene_ids is False:
if len(gene_ids_in_dict) == len(set(gene_ids_in_dict)):
return data
else:
raise ValueError("Error: data Ensembl IDs non-unique.")
# Check for when if collapse_gene_ids is True
gene_ids_collapsed = [
gene_mapping_dict.get(gene_id.upper()) for gene_id in data.var.ensembl_id
]
gene_ids_collapsed_in_dict = [
gene for gene in gene_ids_collapsed if gene in gene_token_dict.keys()
]
if len(set(gene_ids_in_dict)) == len(set(gene_ids_collapsed_in_dict)):
data.var["ensembl_id_collapsed"] = data.var.ensembl_id.map(gene_mapping_dict)
return data
(https://huggingface.co/ctheodoris/Geneformer/blob/3a6866963d20ff5ce7c82d2c4544cb2b304017e9/geneformer/tokenizer.py#L209-L228)
I saw in the ensembl_mapping_dict_gc95M.json
dictionary that there are some Ensembl_Ids mappings between other Ensembl_Ids.
For example:
ENSG00000274734 : ENSG00000285077
and
ENSG00000004059: ENSG00000004059
What happens when you have a dataset where all the Ensembl_Ids map to themselves except 1? The genes will be present in the gene_ids_in_dict and gene_ids_collapsed_in_dict with a single exception. The one that doesn't map to itself. If you have 100 genes, then gene_ids_in_dict will have 99 elements and gene_ids_collapsed_in_dict 100 (because ENSG00000274734 is not in the tokens but ENSG00000285077 is). Now the sizes are different but there is no duplicate element. This will make the code fail because it will execute the else statement that assumes at least 1 duplicated gene.
Am I right?
Thank you so much for pointing out this case! We just merged a change to resolve this.