Semisynthetic Simulation for Microbiome Data Analysis

Dr Kris Sankaran (University of Wisconsin) was one of our Belz recipient from the School of Maths & Stats and came for 6-week visit to our lab. We worked on expanding scDesign3 to the simulation of realistic microbiome data in several analytical scenarios and tested our tutorials and resources in 3 face-to-face workshops hosted by the School.

  • Our framework starts from existing data that we use as template for realistic simulations
  • We illustrate different scenarios including power calculation for univariate and multivariate analysis, network inference benchmarking, batch correction up to omics data integration
  • Our online tutorial is available at https://go.wisc.edu/8994yz

Semisynthetic Simulation for Microbiome Data Analysis. Kris SankaranSaritha KodikaraJingyi Jessica LiKim-Anh Lê Cao (2025).

 

Abstract. High-throughput sequencing data lie at the heart of modern microbiome research. Effective analysis of these data requires careful preprocessing, modeling, and interpretation to detect subtle signals and avoid spurious associations. In this review, we discuss how simulation can serve as a sandbox to test candidate approaches, creating a setting that mimics real data while providing ground truth. This is particularly valuable for power analysis, methods benchmarking, and reliability analysis. We explain the probability, multivariate analysis, and regression concepts behind modern simulators and how different implementations make trade-offs between generality, faithfulness, and controllability. Recognizing that all simulators only approximate reality, we review methods to evaluate how accurately they reflect key properties. We also present case studies demonstrating the value of simulation in differential abundance testing, dimensionality reduction, network analysis, and data integration. Code for these examples is available in an online tutorial (https://go.wisc.edu/8994yz) that can be easily adapted to new problem settings.