CellDiffusion: a generative model to annotate single-cell and spatial RNA-seq using bulk references

We introduce the first manuscript of our PhD student Xiaochen Zhang!

CellDiffusion bridges the gap between single-cellspatial and bulk RNA-seq, using a diffusion-based generative model to create virtual cells that mimic the gene expression profiles of real observed cells.

CellDifusion brings: 1 – Integration of multiple data resolutions:  unifying bulk, single-cell and spatial RNA-seq to boost cell type annotation accuracy. 2 – Single-cell data augmentation: producing realistic virtual cells to enhance signals from rare and transitional cell states that tend to get ignored in original sparse data. 3 – Deeper biological insights: uncovering immune subtypes, tissue-specific markers and tumour microenvironments missed by standard tools.

CellDiffusion shows how generative AI can enrich (not replace) biological discovery, leveraging decades of bulk data to power the single-cell studies.

Reference

CellDiffusion: a generative model to annotate single-cell and spatial RNA-seq using bulk references
Xiaochen Zhang, Jiadong Mao, Kim-Anh Lê Cao
Abstract

Annotating single-cell and spatial RNA-seq data can be greatly enhanced by leveraging bulk RNA-seq, which remains a cost-effective and well-established benchmark for characterising transcriptional activity in immune cell populations. However, a major technical hurdle lies in the contrasting properties of these data types: single-cell and spatial data are inherently sparse due to its cell-level sampling scheme, leading to much lower sequencing depth compared to bulk RNA-seq.

We developed CellDiffusion, a generative machine learning (ML) tool that bridges this gap. CellDiffusion generates realistic virtual cells to augment the sparse single-cell and spatial data, improving signals and the representation of rare cell types. The augmented data are more comparable to bulk references, increasing the accuracy of cell type annotation using bulk references and automated ML classifiers.

We benchmarked CellDiffusion on single-cell and spatial datasets from human peripheral blood samples, white adipose tissues, and breast tumours. Our method significantly outperforms state-of-the-art methods such as SingleR, Seurat, and scVI. In addition, CellDiffusion provides critical biological insights, including the identification of novel cell subtypes and their function during cell state transition; the discovery of new marker genes for tissue-resident immune cells, revealing their functional shifts in myeloid populations; and the accurate characterisation of cell subtypes in spatial transcriptomics to decipher tumour microenvironment.