Schedule for: 25w5324 - Novel Statistical Approaches for Studying Multi-omics Data

Beginning on Sunday, July 13 and ending Friday July 18, 2025

All times in Banff, Alberta time, MDT (UTC-6).

Sunday, July 13
09:00 - 10:00 placeholder (Online)
16:00 - 17:30 Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
20:00 - 22:00 Informal gathering (TCPL Foyer)
Monday, July 14
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:45 - 08:55 Introduction and Welcome by BIRS Staff
A brief introduction to BIRS with important logistical information, technology instruction, and opportunity for participants to ask questions.
(TCPL 201)
08:55 - 09:00 Tian Ge: Introduction by Session Leader (TCPL 201)
09:00 - 09:40 Nilanjan Chatterjee: Minicourse: Emerging Biobanks and Opportunities for Integrative Analysis
Recent advances in genomic technologies and the proliferation of large-scale, ancestrally diverse biobanks have dramatically accelerated the discovery of genetic determinants underlying complex traits and diseases. These rich resources offer unparalleled opportunities to enhance the development of polygenic risk scores (PGS), construct integrative predictive models combining genetic, environmental, and biomarker data, and rigorously assess causal relationships using Mendelian randomization methodologies. In this session, I will review opportunities and challenges for the analysis of biobank data and present recent methodological advancements from my research focused on integrative analysis within individual biobanks and across multiple biobanks and genome-wide association study (GWAS) summary-statistics. Furthermore, I will discuss the unique potential of family-based biobank studies, highlighting how these studies can effectively address confounding due to population stratification, assortative mating, and indirect genetic effects.
(TCPL 201)
09:40 - 10:05 Rui Duan: Unsupervised Integration of Pre-trained Models for Genetic Risk Prediction
The increasing availability of pretrained models, particularly in genetic risk prediction, offers new opportunities to accelerate real-world applications without requiring extensive local training or labeled outcomes. However, effectively using these models in new target populations remains a major challenge due to limited generalizability, data heterogeneity, and the absence of observed phenotypes. We present a general unsupervised ensemble framework that combines multiple pretrained models without needing outcome data from the target population. The framework relies on prediction concordance and incorporates methods to handle variability across models and individuals, as well as the presence of misleading or low-quality models. Both theoretical analysis and empirical evaluations demonstrate that the framework delivers robust and scalable performance across diverse populations.
(TCPL 201)
10:05 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 10:55 Rachel Kember: Transdiagnostic and Disorder-Level GWAS Enhance Precision of Substance Use and Psychiatric Genetic Risk Profiles in African and European Ancestries
Substance use disorders (SUDs) commonly co-occur with mood, anxiety, and psychotic disorders. We used genomic structural equation modeling (gSEM) and GWAS-by-subtraction to examine transdiagnostic and disorder-specific genetic risk across substance use, psychotic, mood, and anxiety disorders in African- (AFR) and European- (EUR) ancestry individuals. In AFR individuals, transdiagnostic genetic factors represented SUDs and psychiatric disorders. In EUR individuals, genetic factors represented SUDs, psychotic disorders, and mood/anxiety disorders. Second-order factor models showed phenotypic and genotypic associations with a broad range of physical and mental health traits. Genetic correlations and phenome-wide association studies (PheWAS) highlighted how common and independent genetic factors for SUD and psychotic disorders were differentially associated with psychiatric, sociodemographic, and medical phenotypes. For example, the component of genetic risk for TUD that operates through the SUD factor was associated with other psychiatric disorders, whereas the component that is independent of the SUD factor was associated with cardiometabolic traits. Combining transdiagnostic and disorder-level genetic approaches can improve our understanding of co-occurring conditions and increase the specificity of genetic discovery, which is critical for identifying more effective prevention and treatment strategies to reduce the burden of these disorders.
(TCPL 201)
10:55 - 11:20 Andy Dahl: Gene-environment interaction effects partly depend on the phenotype measurement scale
The tremendous sample sizes of modern GWAS has led to many discoveries of statistical interactions between genetic and environmental variables (GxE), the majority of which lack clear evidence of biological significance. We hypothesize that many published GxE can be eliminated by transforming the phenotype measurement scale, a well-known problem in statistics without established solutions. Here, we systematically study how two polygenic GxE tests depend on phenotype scale as defined by power transformations: a variance component model (GxE heritability) and a polygenic score model (PGSxE). Simulations confirm that our approach reliably differentiates scale-dependent and -independent interactions when the true simulated scale is a power transformation. We prove a form of GxE is scale-independent, which requires the sign of G's effect to depend on E. In UK Biobank, we surprisingly find GxSex interactions on height (p=0.0004 for GxE heritability, p=2e-35 for PGSxE), but our power transformation approach find that the log transformation eliminates this interaction (p=0.62, 0.39). We also find GxSex effects on testosterone that, in contrast with height, cannot be entirely eliminated on the log scale; intriguingly, female-specific effects on testosterone become detectable on the log scale, demonstrating how choice of scale in GWAS implicitly prioritizes certain individuals. Finally, across 7 environments and 45 complex traits, we find that a phenotype's coefficient of variation strongly predicts its GxE scale dependence. While our results do not undermine GxE effects on principled measurement scales, they do demonstrate that default phenotype scales are liable to pervasive false positive GxE and provide a framework to learn an optimal phenotype scale.
(TCPL 201)
11:20 - 11:45 Xuanyao Liu: Improving GWAS Functional Interpretation with CACTI: A High-Power cQTL Mapping Framework
Mapping chromatin quantitative trait loci (cQTLs) is crucial for elucidating the regulatory mechanisms governing gene expression and complex traits. However, current cQTL mapping methods suffer from limited detection power, particularly at existing sample sizes, and are constrained by peak-calling accuracy. To address these limitations, we present CACTI, a novel method that improves cQTL mapping by leveraging correlations between neighboring regulatory elements. Across diverse histone marks (H3K4me1, H3K4me3, H3K27ac, H3K27me3 and H3K36me3) and cell types, CACTI identifies 51–255% more cQTL signals compared to conventional single-peak–based approaches. Using CACTI, we generate a comprehensive cQTL map for the five histone marks across multiple cell types and perform colocalization analyses with GWAS loci from 44 complex traits. Our results show that CACTI-identified cQTLs explain 6–47% of GWAS loci, which represent 60%-535% more colocalization with single-peak–based cQTLs. 20–75% of colocalized GWAS loci show no colocalization with eQTLs. This underscores CACTI’s unique ability to uncover regulatory mechanisms that would otherwise remain undetected by eQTL analysis alone, significantly improving the functional interpretation of GWAS findings.
(TCPL 201)
11:45 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:40 - 13:45 Sarah Gagliano Taliun: Introduction by Session Leader (TCPL 201)
13:45 - 14:10 Wei Zhou: Efficient computational methods for genetic association studies of disease progression in large biobanks
Genome-wide association studies (GWAS) have been highly successful in identifying genomic loci associated with complex human diseases. However, most GWAS rely on case-control designs based on disease status at a single time point - typically diagnosis - limiting their ability to distinguish genetic factors that influence disease onset from those that affect progression over time. The emergence of large-scale biobanks worldwide, which link electronic health records (EHR) to genomic data for hundreds of thousands of individuals, provides unprecedented opportunities to investigate the genetics of disease trajectories using time-to-event (TTE) phenotypes. Yet, the scale and complexity of these datasets present analytical challenges for TTE-based GWAS. I will present the development of computational methods designed to enable efficient and scalable analysis of TTE phenotypes in biobank-scale data. I will also demonstrate their application within the Global Biobank Meta-analysis Initiative to uncover genetic risk factors of disease onset and progression across various populations.
(TCPL 201)
14:10 - 14:35 Jessica Dennis: Novel approaches to study DNA methylation as a marker of gene-environment interplay in population-based biobanks
DNA methylation (DNAm) is essential to the function of the genome and helps shape the development of many body systems as individuals age. However, it is more challenging to define the relationship between DNAm and disease risk, in part because of the dynamic patterns of DNAm in response to genotype and environmental exposures over the life course. In this talk, I will showcase two different strategies for detecting the effects of gene-environment interplay on DNAm in population-based studies. In the first approach, we investigated the ways in which participants’ genetic sensitivity to the environment moderated the influence of time-varying exposure to childhood maltreatment on DNAm from whole blood in late adolescence. We incorporated interaction terms in the structured life course modeling approach (SLCMA), a framework that systematically compares multiple theoretical models of time-varying exposure-outcome relationships. In the second approach, we detected variance methylation quantitative trait loci (var-mQTL) in umbilical cord blood, and related these to hundreds of prenatal exposures. Var-mQTL compare DNAm variance (as opposed to the mean) across genotype groups, and are uniquely suited to capturing gene-environment interplay. I will conclude by discussing the relative merits of each approach, and opportunities for application in other population-based biobanks.
(TCPL 201)
14:35 - 15:00 Qingling Duan: Integrative multi-omics analyses to uncover mechanisms of asthma and atopy in the CHILD Cohort Study
Early-life exposures such as human milk are known to impact the long-term health of children, however, the underlying biological mechanisms remain poorly understood. My ongoing research investigates the complex and dynamic interactions among mothers, breast milk and infants from the CHILD Cohort Study to determine how variations in each element influence the others. First, we identified maternal genomics and environmental factors (main and interaction effects) associated with inter-individual variations in bioactive components of breast milk (e.g., oligosaccharides, fatty acids and microbiota). Then, we observed that exposure to variable human milk components modulate risk of asthma and atopy in children differently depending on their polygenic risk, potentially through effects on the gut microbiota of the infants. Moreover, machine-learning models integrating maternal milk components with infant genomics and gut microbiota predicted respiratory and atopic outcomes during childhood. Finally, mendelian randomization analysis suggests a causal link between exposure to variable human milk components and risk of childhood asthma and atopy. Our findings suggest potential mechanisms by which early life exposures such as human milk may impact childhood health, which could facilitate the development of personalized risk reduction profiles and intervention strategies.
(TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 15:55 Osvaldo Espin-Garcia: Leveraging the All of Us Research Platform for Genomic Analysis of Accelerometer-defined Physical Activity Levels
Osteoarthritis (OA) is a chronic disabling disease affecting around 500 million people worldwide. There is currently no cure for OA and intervention strategies that mitigate disease burden are needed. Evidence suggests that increased levels of physical activity (PA) have beneficial effects for OA and it is often prescribed as first line of treatment. Furthermore, it has been hypothesized that genetics may inform PA patterns and guide PA interventions. Thus, identifying genetic loci associated with PA patterns in OA populations (or at-risk of developing it) can aid in customizing potential PA interventions. Traditionally, PA has been investigated using self-reported information which may be affected by recall, perception, or desirability biases. To circumvent these issues, wearable devices such as accelerometers are increasingly used. Meanwhile, genome-wide association studies (GWAS) have allowed investigators to identify loci associated with a variety of traits. These investigations have particularly benefited from well-characterized large-scale studies. In this talk, I will describe my recent experience leveraging data from the All of Us Research Program, an initiative aiming to make advances in tailoring medical care to the individual's genetics. The main objective of this ongoing work is to identify associations between genomic data and accelerometer-defined PA levels.
(TCPL 201)
15:55 - 16:20 Linda Kachuri: Characterizing the genetic basis of protein biomarkers to improve cancer detection
Circulating proteins can provide a dynamic readout of health status and serve as indicators of disease susceptibility. Many plasma proteins are heritable, with large-scale genomic studies uncovering thousands of protein quantitative trait loci (pQTL). I will discuss two avenues for leveraging proteogenomic data: (1) genetic correction of non-causal disease biomarkers and (2) prioritization of etiologically relevant proteins and putative drug targets. Genetic sources of variation in non-causal biomarkers introduce noise that makes observed values less informative for early detection. Using prostate specific antigen (PSA), as an example, I will show how polygenic scores (PGS) can be used as instruments for recalibrating individual PSA levels and improving the specificity of PSA as a cancer screening tool. Next, I will discuss how integration of transcriptome-wide association studies (TWAS) with pQTL-based analyses can strengthen causal inference and identify proteins implicated in glioma susceptibility.
(TCPL 201)
16:20 - 16:45 Daniel Taliun: Building and applying custom reference panels tailored to population-based biobanks
Genotype imputation is a well-established and cost-effective method for increasing genome coverage and statistical power in genome-wide association studies (GWAS). Today, researchers have at their disposal a series of large haplotype reference panels comprising hundreds of thousands of sequenced whole genomes, such as the NHLBI’s TOPMed or the UK Biobank, to impute their genotyping array or low-depth genome sequencing data. Although large, these reference panels may not be entirely representative of the study population, resulting in low imputation quality for population-specific genetic variants and a loss of statistical power. Using the CARTaGENE population-based biobank (N samples ~30,000) in Quebec, Canada, and whole-genome sequencing information from trios, we evaluate methods for creating a population-specific panel that complements existing ones and apply them to genome-wide association scans in 42 continuous traits. The population-specific panel enabled the imputation of ~180,000 additional variants per genome compared to TOPMed, enriching the dataset with population-specific variants. It resulted in more than 70 additional genome-wide statistically significant associations. We also showcase the construction and application of a population-specific HLA-imputation panel, which enabled us to quantify the pleiotropic effects of HLA-B*08:01 on thyroxine and blood cell levels, providing valuable insights into the molecular mechanisms of thyroid disease.
(TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Tuesday, July 15
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:55 - 09:00 Jingjing Yang: Introduction by Session Leader (TCPL 201)
09:00 - 09:40 Michael Epstein: Minicourse: Identifying condition-related cell-cell communication events using supervised tensor analysis.
Numerous tools have been developed to infer active cell-cell communication (CCC) events, which are essential for understanding biological processes and diseases. However, existing downstream methods for assessing the relationships between CCC events and biological conditions lack clear interpretation, fail to adjust for confounders, and ignore dependencies among CCC events. To address these limitations, we introduce STACCato, a Supervised Tensor Analysis tool designed to identify Condition-related Cell-cell communication events. STACCato employs a tensor-based regression model to enable statistical inference related to the relationships between biological conditions (e.g., disease status, tissue types) and specific CCC events, while adjusting for confounders and CCC dependencies. Through extensive simulations and real-world applications on scRNA-seq datasets of lupus and autism, we demonstrate that STACCato consistently provides improved inference of condition-related CCC events compared to alternative methods. This is joint work with Drs. Qile Dai and Jingjing Yang.
(TCPL 201)
09:40 - 10:05 Yun Li: Investigating spatial omics data with StarTrail and STimage-1K4M
Spatial omics technologies revolutionize studies of tissue functions. However, existing methods fail to capture localized, sharp changes characteristic of critical events such as tumor development. We present StarTrail, a gradient based method that powerfully defines rapidly changing regions and detects “cliff genes”, genes exhibiting drastic expression changes at highly localized or disjoint boundaries. StarTrail, filling important gaps in current literature, enables deeper insights into tissue spatial architecture. We also introduce STimage-1K4M, a comprehensive dataset designed to bridge this gap by providing transcriptomic features for sub-tile images. STimage-1K4M contains 1,149 images and 4,293,195 pairs of sub-tile images and gene expressions. STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis.
(TCPL 201)
10:05 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 10:55 Jingshu Wang: Causal mediation analysis for time-varying heritable risk factors with Mendelian Randomization.
Understanding the causal mechanisms of diseases is crucial in clinical research. When randomized experiments are unavailable, Mendelian Randomization (MR) leverages genetic mutations to mitigate confounding. However, most MR analyses assume static risk factors, oversimplifying dynamic risk factor effects. The framework of life-course MR addresses this, but struggles with limited GWAS cohort sizes and correlations across time points. We propose FLOW-MR, a computational approach that estimates causal structural equations for temporally ordered traits using only GWAS summary statistics at multiple time points. FLOW-MR enables inference on direct, indirect, and path-wise causal effects, demonstrating superior efficiency and reliability, especially with noisy data. I'll also present the application of FLOW-MR to uncover a childhood-specific protective effect of BMI on breast cancer and to understand the evolving impacts of BMI, systolic blood pressure, and cholesterol on stroke risk.
(TCPL 201)
10:55 - 11:20 Nicholas Mancuso: Efficient count-based models improve power and robustness for large-scale single-cell eQTL mapping.
Population-scale single-cell transcriptomic technologies (scRNA-seq) enable characterizing variant effects on gene regulation at the cellular level (e.g., single-cell eQTLs; sc-eQTLs). However, existing sc-eQTL mapping approaches are either not designed for analyzing sparse counts in scRNA-seq data or can become intractable in extremely large datasets. Here, we propose jaxQTL, a flexible and efficient sc-eQTL mapping framework using highly efficient count-based models given pseudobulk data. Using extensive simulations, we demonstrated that jaxQTL with a negative binomial model outperformed other models in identifying sc-eQTLs, while maintaining a calibrated type I error. We applied jaxQTL across 14 cell types of OneK1K scRNA-seq data (N=982), and identified 11-16% more eGenes compared with existing approaches, primarily driven by jaxQTL ability to identify lowly expressed eGenes. We observed that fine-mapped sc-eQTLs were further from transcription starting site (TSS) than fine-mapped eQTLs identified in all cells (bulk-eQTLs; P=1×10−4) and more enriched in cell-type-specific enhancers (P=3×10−10), suggesting that sc-eQTLs improve our ability to identify distal eQTLs that are missed in bulk tissues. Overall, the genetic effect of fine-mapped sc-eQTLs were largely shared across cell types, with cell-type-specificity increasing with distance to TSS. Lastly, we observed that sc-eQTLs explain more SNP-heritability (h2) than bulk-eQTLs (9.90 ± 0.88% vs. 6.10 ± 0.76% when meta-analyzed across 16 blood and immune-related traits), improving but not closing the missing link between GWAS and eQTLs. As an example, we highlight that sc-eQTLs in T cells (unlike bulk-eQTLs) can successfully nominate IL6ST as a candidate gene for rheumatoid arthritis. Overall, jaxQTL provides an efficient and powerful approach using count-based models to identify missing disease-associated eQTLs.
(TCPL 201)
11:20 - 11:45 Qiongshi Lu: The blurry line between genes and environments: Insights from GWAS of family members’ phenotypes
Genome-wide association study (GWAS) methodologies have become quite standard for complex trait genetic research. Today, a modern GWAS typically correlates a phenotype with tens of millions of genetic variants in large cohorts of millions of individuals to reveal genotype-phenotype associations. However, this seemingly standard approach can give largely biased and/or confounded results in various applications. In this talk, I will discuss a new study design which associates genetic data of a cohort with their family members’ phenotypes. That is, the genotypic and phenotypic variables in the GWAS are collected from different individuals. Through three separate applications, focusing on offspring, parental, and spousal phenotypes, I will discuss several challenges and new insights in genetic nurture, ascertainment bias, and assortative mating. The phenotypes discussed in this talk will include socioeconomic outcomes, neurodegenerative disease risk, as well as human partner choice.
(TCPL 201)
11:45 - 12:00 Group Photo
Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo!
(TCPL Foyer)
12:00 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:40 - 13:45 Xiang Zhou: Introduction by Session Leader (TCPL 201)
13:45 - 14:10 Mengjie Chen: SEDA: A Sequence-Based Deep Learning Framework for RNA Epigenomic Data Analysis
We have developed SEDA, an explainable deep learning framework that leverages sequence information for RNA-omics data preprocessing. SEDA’s dual-branch architecture disentangles true biological signals from experimental noise by jointly optimizing signal shape (via Kullback-Leibler divergence) and intensity (via mean squared error). Using ARTR-seq data for RBFOX2 as a case study, SEDA achieved a median correlation of 0.68 on a held-out test set when predicting raw signal profiles. Feature attribution with Integrated Gradients revealed biologically meaningful sequence features—such as the TGCAGT motif—driving RBFOX2 binding, highlighting the model’s interpretability and analytical power. By replacing traditional preprocessing steps like peak calling, normalization, and batch effect correction with a streamlined, data-driven approach, SEDA offers a promising alternative for RNA epigenomics analysis.
(TCPL 201)
14:10 - 14:35 Wei Sun: Integrating scRNA-seq and scATAC-seq to infer gene regulatory network
Recent developments in gene knockout (KO) studies in human cells, such as those conducted by The Impact of Genomic Variation on Function (IGVF) Consortium and the Molecular Phenotypes of Null Alleles in Cells (MorPhiC) Consortium, promise large-scale perturbation studies. These studies utilize molecular read-outs, including single-cell RNA sequencing (scRNA-seq) and ATAC sequencing (bulk or single-cell ATAC-seq) data. As members of the MorPhiC consortium, we propose to infer gene regulatory networks by integrating scRNA-seq and ATAC-seq data from both wild-type and various KO genotypes. I will present our work in two directions. One is to mainly use gene expression data to construct Directed Acyclic Graphs (DAGs) while exploiting wildtype and KO samples. The other is to combine scRNA-seq and ATAC-seq to infer transcription factor activities and use such inferred activities to construct gene regulatory network.
(TCPL 201)
14:35 - 15:00 Hongkai Ji: An analytical framework for high-plex multimodal epigenome profiling data
We developed Hi-Plex CUT&Tag, a technology for simultaneously profiling of genome-wide binding locations of epigenetic modulators, transcription factors, and chromatin-associated proteins. Hi-Plex CUT&Tag enables robust detection of protein co-localization events across over 600 pairs of proteins using 36 barcoded monoclonal antibodies – all within a single experiment. An analytical framework is developed to comprehensively detect protein colocalizations, analyze their combinatorial patterns, and model their relationship with gene expression. Using this framework, we identified numerous novel bivalent histone modification events, epigenetic-context-dependent transcriptional regulation, and specific chromatin mark combinations with significant impacts on gene regulation. Furthermore, single-cell Hi-Plex CUT&Tag enables the analysis of synergistic interactions between chromatin mark pairs at the single-cell level, providing unprecedented resolution for studying chromatin dynamics.
(TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 15:55 Claus Ekstrom: Data-driven causal pathway discovery for multi-omics data
Multi-omics analyses offer exciting opportunities to explore the complex molecular architecture of biological systems. Identifying multi-omics pathways may enable us to better understand causal omics-related risk factors and pleiotropic processes. However, most current analytical approaches are confirmatory in nature, relying on predefined models or known pathways, which limits their capacity to uncover novel biological mechanisms. I propose an exploratory, data-driven framework that integrates causal discovery algorithms with multi-omics data to infer hidden biological pathways without prior model specification. Building on the principles of constraint-based causal inference, we extend the PC algorithm to accommodate the layered and often temporally structured nature of multi-omics datasets. This approach enables the identification of putative causal relationships across genomics, transcriptomics, proteomics, and metabolomics, facilitating the reconstruction of mechanistic pathways directly from observational data by leveraging the central dogma of molecular biology. The resulting causal graphs serve as hypothesis-generating tools, guiding further experimental validation and model refinement. I will demonstrate the utility of our method using both simulated and publicly available datasets.
(TCPL 201)
15:55 - 16:20 Iuliana Ionita-Laza: Detecting context-dependent QTLs via whole-genome quantile regression
Genome-wide association studies (GWAS) for biomarkers and molecular phenotypes can lead to clinically relevant discoveries. Numerous lines of evidence from both model organisms and human studies suggest that genetic associations can be highly heterogeneous, dynamic and context dependent. Despite twenty years of GWAS, most studies are based on statistical models that fail to account for such heterogeneity. In this talk I will discuss alternative approaches based on quantile regression (QR) models that naturally extend linear regression models to the analysis of the entire conditional distribution of a phenotype of interest. I will introduce a novel and computationally efficient whole-genome quantile regression technique, Regenie.QRS, for biobank-scale GWAS data with genetic structure. I will show applications to biomarkers and molecular QTLs in eGTEx and UK biobank. Time permitting, I will also discuss applications of QR to uncertainty quantification for polygenic risk score prediction.
(TCPL 201)
16:20 - 16:45 Jean Yee Hwa Yang: Multi-omics and chimpanzee data reveal signatures of subclinical CAD and resilience
Cardiovascular diseases (CVDs) remain a leading cause of global morbidity and mortality, with an estimated half a billion people continuing to be affected. Disease prevention and early risk prediction are key strategies in reducing CVD prevalence. However, detecting associations between specific pathophysiological mechanisms and coronary plaque remains challenging due to individual diversity and varying risk factors. Here, we examine a cohort of multi-omics data, including layers of genomics, scRNA-seq, metabolomics, and lipidomics data, to identify subcohorts of individuals with subclinical CAD and distinct molecular signatures. We then demonstrate that utilising these subcohorts improves the prediction of subclinical CAD and aids in the discovery of novel biomarkers for early disease detection. Additionally, by integrating omics data from chimpanzees and examining interspecies differences, we can identify potential biological signatures associated with resilience to CAD.
(TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Wednesday, July 16
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:55 - 09:00 Mingyao Li: Introduction by Session Leader (TCPL 201)
09:00 - 09:25 Jichun Xie: A new generative foundation model for harmonized, comprehensive analysis of single-cell and spatial transcriptomics
Massive single-cell and spatial transcriptomics datasets have been rapidly accumulating over the past decade. However, transcript distributions often vary across different platforms, gene panels, batches, tissues, and disease states. Some of these variations reflect genuine biological signals, while others represent technical noise. To effectively leverage these vast resources—retaining true biological variation while removing unwanted technical variability—we curated and integrated publicly available data, assembling over 80 million cells from diverse single-cell and spatial transcriptomics sources. Using this extensive dataset, we developed a generative foundation model that synergizes artificial intelligence with statistical modeling. This model projects mosaic-patterned cell data from various platforms and gene panels into a unified embedding space, enabling automatic cell annotation and facilitating multiple downstream analyses. One immediate and impactful application of the model is the imputation of genes not measured by spatial transcriptomics platforms with restricted gene panels, such as Xenium and MERFISH. Additionally, the model generates informative gene weights that can guide the optimization of future gene panel designs. The generative framework inherently quantifies uncertainty in transcript counts, providing a valuable measure that can extend to the quantification of uncertainty in subsequent analyses. Overall, this generative foundation model offers a robust and interpretable platform for harmonizing and comprehensively analyzing single-cell and spatial transcriptomics data.
(TCPL 201)
09:25 - 09:50 Kyle Coleman: Multi-modal spatial omics modeling at cellular resolution with MISO
TBA
(TCPL 201)
09:50 - 10:20 Coffee Break (TCPL Foyer)
10:20 - 10:45 Tae Hyun Hwang: AI-Driven Integration of 3D Pathology and Spatial Transcriptomics for Precision Oncology
TBA
(TCPL 201)
10:45 - 11:10 Linghua Wang: Unveiling Tumor Ecosystems with Integrative Spatial Transcriptomics and Digital Pathology
TBA
(TCPL 201)
11:10 - 11:35 Derek Oldridge: Spatialomics and AI-driven image analysis in translational pathology research
TBA
(TCPL 201)
11:35 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:30 - 17:30 Free Afternoon (Banff National Park)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Thursday, July 17
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
08:55 - 09:00 Christine Peterson: Introduction by Session Leader (TCPL 201)
09:00 - 09:40 Michael Wu: Minicourse: Microbiome data integration
TBA
(TCPL 201)
09:40 - 10:05 Jing Ma: Network-based integration of microbiome and metabolomic data
Correlation networks are commonly used to infer associations between microbes and metabolites. The resulting p-values are then corrected for multiple comparisons using existing methods such as the Benjamini & Hochberg (BH) procedure to control the false discovery rate (FDR). However, most existing methods for FDR control assume the p-values are weakly dependent. Consequently, they can have low power in recovering microbe-metabolite association networks that exhibit important topological features, such as the presence of densely associated modules. We propose a novel inference procedure that is both powerful for detecting significant associations in the microbe-metabolite network and capable of controlling the FDR. Power enhancement is achieved by modeling latent structures in the form of a bipartite stochastic block model. We develop a variational expectation-maximization (EM) algorithm to estimate the model parameters and incorporate the learned graph in the testing procedure. In addition to FDR control, this procedure provides a clustering of microbes and metabolites into modules, which is useful for interpretation. We demonstrate the merit of the proposed method in simulations and an application to bacterial vaginosis.
(TCPL 201)
10:05 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 10:55 Kris Sankaran: Modular software for mediation analysis of microbiome data
Mediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regressions, this analysis quantifies how complementary data relate to one another and respond to treatments. Despite these advances, existing software’s rigid assumptions often result in users viewing mediation analysis as a black box. We designed the multimedia R package to make advanced mediation analysis techniques accessible, ensuring that statistical components are interpretable and adaptable. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, bootstrap confidence interval construction, and sensitivity analysis, enabling experimentation with various mediator and outcome models while maintaining a simple overall workflow. The software includes modules for regularized linear, compositional, random forest, hierarchical, and hurdle modeling, making it well-suited to microbiome data. Our case study revisits a study of the microbiome and metabolome of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. In addition to summarizing the package, we will explain the software design patterns that we drew inspiration from and how they could inform reproducible multi-omics integration more generally. A gallery of examples and reference page can be found at https://go.wisc.edu/830110.
(TCPL 201)
10:55 - 11:20 Jun Chen: Compositional sparse canonical correlation analysis for microbiome multi-omics data integration
Sparse canonical correlation analysis (sCCA) is a powerful approach for integrating high-dimensional datasets by identifying subsets of features that capture the strongest associations in the data. In microbiome studies, understanding the interactions between different layers of microbiome omics and their connections with host omics is a key research goal. However, traditional sCCA methods often overlook the compositional nature of microbiome data, making them less effective for this application. To address this gap, we propose a novel sCCA framework specifically designed for integrating microbiome data with other high-dimensional omics datasets. Our approach explicitly accounts for the compositional structure inherent to microbiome sequencing data and incorporates prior structural information, such as the grouping patterns among bacterial taxa. We demonstrate the effectiveness of our method by integrating taxonomic compositional data with metabolomics data from an adenoma microbiome study.
(TCPL 201)
11:20 - 11:45 Rebecca Deek: Benchmarking and improving statistical methods for microbiome multiomics integration
A growing number of epidemiological microbiome studies are adopting a multi-view perspective with additional sequencing of the host genome, transcriptome, metabolome, or proteome. These microbiome multiomics studies allow for a more in-depth understanding of the functional role the microbiome plays in human-host health, often in terms of their metabolomics and proteomics interplay and capacity. Although, there remains a statistical and computational bottleneck in analyzing data from such studies due to the limited number of specialized methodologies. Furthermore, little is known about the portability of general data integration methods to the multiomics setting. This work benchmarks several state-of-the-art methods to compare, associate, and integrate microbiome multiomics data. Best practices and the need for new microbiome-specific methodologies are discussed. Finally, a novel distance-based test for association designed for microbiome-metabolomics studies is introduced. Performance of the proposed method, compared to existing methods, is illustrated using simulations and real data sets.
(TCPL 201)
11:45 - 13:30 Lunch
Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
13:40 - 13:45 Jingjing Yang: Introduction by Session Leader (TCPL 201)
13:45 - 14:10 Jingyi Jessica Li: scDesignPop: a flexible framework for generating realistic population-scale single-cell RNA-seq data
Single-cell RNA sequencing (scRNA-seq) combined with genotyping across hundreds of individuals has enabled the discovery of genetic associations at single-cell resolution for various human diseases. While this has driven the development of new computational methods for analyzing large-scale data, tools for evaluating these methods remain limited, and generating experimental data at scale is cost-prohibitive. To address this gap, we present scDesignPop, a flexible statistical framework for simulating realistic scRNA-seq data at the population scale. scDesignPop incorporates cell- and individual-level covariates, experimental conditions, and genetic effects from known expression quantitative trait loci (eQTLs). We evaluated scDesignPop using two large-scale studies—the OneK1K and CLUES cohorts—across several qualitative criteria and over a dozen quantitative metrics. Our results demonstrate that scDesignPop achieves higher simulation quality than splatPop, an existing population-scale simulator, particularly with the OneK1K cohort, while also supporting more complex modeling scenarios with the CLUES cohort. In addition to generating realistic gene expression and preserving gene-gene correlations within cell types, scDesignPop enables key applications including eQTL power analysis at cell-type resolution, privacy protection by mitigating eQTL-based re-identification via linking attacks, simulation of scRNA-seq data for novel individuals using real or synthetic genotypes, and the creation of positive and negative control datasets for cell-type-specific eQTL discovery.
(TCPL 201)
14:10 - 14:35 Ni Zhao: Identifying unmeasured heterogeneity in microbiome data via quantile thresholding (QuanT)
Background: Microbiome data, like other high-throughput data, suffer from technical heterogeneity stemming from differential experimental designs and processing. In addition to measured artifacts such as batch effects, there is heterogeneity due to unknown or unmeasured factors, which lead to spurious conclusions if unaccounted for. With the advent of large-scale multi-center microbiome studies and the increasing availability of public datasets, this issue becomes more pronounced. Current approaches for addressing unmeasured heterogeneity in high-throughput data were developed for microarray and/or RNA sequencing data. They cannot accommodate the unique characteristics of microbiome data such as sparsity and over-dispersion. Results: Here, we introduce Quantile Thresholding (QuanT), a novel non-parametric approach for identifying unmeasured heterogeneity tailored to microbiome data. QuanT applies quantile regression across multiple quantile levels to threshold the microbiome abundance data and uncovers latent heterogeneity using thresholded binary residual matrices. We validated QuanT using both synthetic and real microbiome datasets, demonstrating its superiority in capturing and mitigating heterogeneity and improving the accuracy of downstream analyses, such as prediction analysis, differential abundance tests, and community-level diversity evaluations. Conclusions: We present QuanT, a novel tool for comprehensive identification of unmeasured heterogeneity in microbiome data. QuanT's distinct non-parametric method markedly enhances downstream analyses, serving as a valuable tool for data integration and comprehensive analysis in microbiome research.
(TCPL 201)
14:35 - 15:00 Danilo Bzdok: Designing machine learning paradigms towards multi-omics fusion
Increasing specialization in our societies has also pushed the sciences to segregate into always more specific branches. Due to the advent of big data, many of these largely isolated silos have now independently become more data rich than ever before. Machine learning is an opportunity to build bridges between several levels of biology usually studied in isolation. For example, I will present first results on how i) macroscopic brain imaging, ii) cellular and even organelle gene expression from single cell genomics as well as iiii) molecular protein detection via proteomics can be brought to the same table in a target phenotype like Alzheimer’s disease. Importantly, we show that such data fusion is possible and valuable even if the different omics data come from nonoverlapping participants. Such model frameworks will be instrumental in fully integrating the amassing data troves in biology and medicine.
(TCPL 201)
15:00 - 15:30 Coffee Break (TCPL Foyer)
15:30 - 15:55 Sarah Gagliano Taliun: Exploring and visualizing stratified genome-wide association study summary statistics using PheWeb 2.0
As sample size increases as does the diversity of cohort participants, stratified genome-wide association studies (GWAS) are becoming more commonplace. However, the lack of functionalities in the web-based tools for interacting with these results is currently hindering researchers from advancing knowledge of ancestry and sex on the genetics of complex human diseases and traits. Here we introduce PheWeb 2.0, a completely rewritten enhanced version of our original interactive web-browser now with API functionalities, which offers intuitive and efficient support for stratified GWAS results. Specifically, PheWeb 2.0 accepts GWAS summary statistics of the same trait from various stratifications (e.g. sex-stratified, ancestry-stratified) and allows for intuitive side-by-side user-determined visualization of the association results of two stratifications at a time through an interactive Miami plot. Additionally, PheWeb 2.0 now supports summary statistics with genetic variant by sex interaction information.
(TCPL 201)
15:55 - 16:20 Qihuang Zhang: Identifying cell density marker genes with DenMark: a Bayesian hierarchical marked point process model
Imaging-based spatial transcriptomics (ST) provides a unique opportunity to investigate the relationship between cell distribution and gene expression across tissues. However, rigorous statistical methods capturing this relationship remain underdeveloped due to the intrinsic randomness in cell locations and substantial computational demands. To address this gap, we introduce DenMark, a Bayesian hierarchical marked point process (MPP) model explicitly designed to jointly characterize cell intensity and associated gene expression, facilitating the identification of density-specific genes. Our model incorporates fixed effects to capture correlations between cell density and gene expression and employs latent random fields to account for residual spatial variation not explained by direct density-expression relationships. To mitigate computational challenges inherent in such complex spatial models, we implement the Hilbert space Gaussian process (HSGP) method. This approach provides an efficient low-rank approximation, significantly enhancing computational feasibility within the MPP framework. We demonstrate the utility of our model using SRT data from mouse brain tissues, identifying several genes associated with cellular density across different cell types.
(TCPL 201)
17:30 - 19:30 Dinner
A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building.
(Vistas Dining Room)
Friday, July 18
07:00 - 08:45 Breakfast
Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building.
(Vistas Dining Room)
09:00 - 10:00 Group Discussion & Wrapping-up (TCPL 201)
10:00 - 10:30 Coffee Break (TCPL Foyer)
10:30 - 11:00 Checkout by 11AM
5-day workshop participants are welcome to use BIRS facilities (TCPL ) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 11AM.
(Front Desk - Professional Development Centre)
12:00 - 13:30 Lunch from 11:30 to 13:30 (Vistas Dining Room)