Tuesday, February 13 |
07:00 - 08:30 |
Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room) |
08:25 - 08:30 |
Theme of the day: Generative AI and Fairness (TCPL 201) |
08:30 - 09:30 |
Haoda Fu: Generative AI on Smooth Manifolds: A Tutorial ↓ Generative AI is a rapidly evolving technology that has garnered significant interest lately. In this presentation, we’ll discuss the latest approaches, organizing them within a cohesive framework using stochastic differential equations to understand complex, high-dimensional data distributions. We’ll highlight the necessity of studying generative models beyond Euclidean spaces, considering smooth manifolds essential in areas like robotics and medical imagery, and for leveraging symmetries in the de novo design of molecular structures. Our team’s recent advancements in this blossoming field, ripe with opportunities for academic and industrial collaborations, will also be showcased. (TCPL 201) |
09:30 - 10:00 |
Bin YU: What is uncertainty in today's practice of data science? ↓ Uncertainty quantification is central to statistics, and a corner-stone for building trust in data conclusions for any real-world data problem. The current practice of statistics formally addresses uncertainty arising from sample-to-sample-variability under a generative stochastic model, which is unfortunately often not model-checked enough in today’s practice of statistics. In a data science life cycle (DSLC) that each data analysis goes through in practice, there are many other important sources of uncertainty. In this talk, we discuss uncertainty sources in a DLSC from human judgment calls through the lens of the Predictability-Computability-Stability (PCS) framework and documentation for veridical (truthfully) data science. In particular, we will formally address two additional sources from data cleaning/preprocessing and model/algorithm choices so that more trustworthy or reproducible data-driven discoveries can be achieved. (Online) |
10:00 - 10:30 |
Coffee Break (TCPL Foyer) |
10:30 - 11:30 |
Lightning session: Lloyd Elliott; Bei Jiang; Wenlong Mou; Deshan Perera; Qingrun Zhang; ↓ # Lloyd Elliot: Teaching Machine Learning using Data for Good
Interest in degrees and courses on data science, machine learning and statistics has increased greatly over the past fifteen years. In addition to teaching technical and theoretical aspects, University events and programs support this interest through hackathons and case studies in which real datasets are examined (sometimes in collaboration with corporate or NGO or charity co-organizers). This raises two questions about fairness: 1) In the case of corporate co-organizers, or real datasets derived from commercial data, with respect to insights and deliverables developed by students, where is the line between pedagogy and uncompensated student labour? 2) How can we make a positive impact in the world through setting case studies as teachers? For five years, I have taught "Learning from Data Science." I will discuss my own thoughts on these questions, and provide my own insights based on my own setting of case studies.
# Bei Jiang: Online Local Differential Private Quantile Inference via Self-normalization
Based on binary inquiries, we developed an algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing, our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence intervals without the need for nuisance parameters to be estimated. Our proposed method can be conducted fully online, leading to high computational efficiency and minimal storage requirements with O(1) space. We also proved an optimality result by an elegant application of one central limit theorem of Gaussian Differential Privacy (GDP) when targeting the frequently encountered median estimation problem. With mathematical proof and extensive numerical testing, we demonstrate the validity of our algorithm both theoretically and experimentally.
# Wenlong Mou: A decorrelation method for general regression adjustment in randomized experiments
Randomized experiments are the gold standard for estimating the effect of an intervention, while the efficiency of estimation can be further improved using regression adjustments. Standard regression adjustment involves bias due to sample re-use, and this bias leads to behavior that is sub-optimal in the sample size, and/or imposes restrictive assumptions. In this talk, I present a simple yet effective decorrelation method that circumvents these issues. Among other results, I will highlight sharp non-asymptotic guarantees satisfied by the estimator, under very mild assumptions.
# Deshan Perera: CATE: An accelerated and scalable solution for large-scale genomic data processing through GPU and CPU-based parallelization
The power of the statistical tests that quantify the evolution of a genome are strengthened by larger sample sizes. However, the increased sample sizes create a significant demand on computational resources resulting in longer compute times. Parallelization, especially using the Graphical Processing Unit (GPU) can alleviate this burden. NVIDIA’s CUDA GPUs are becoming commonplace in solving genetic algorithms with the aim of reducing computational time. So far, such potential of high scale parallelization has not been realized in molecular evolution analyses. CATE (CUDA Accelerated Testing of Evolution) is such a software solution. It is a scalable program built using NVIDIA’s CUDA platform together with an exclusive file hierarchy to process six frequently used evolutionary tests, namely: Tajima’s D, Fu and Li's D, D*, F and F*, Fay and Wu’s H and E, McDonald–Kreitman test, Fixation Index, and Extended Haplotype Homozygosity. CATE is composed of two main innovations. A file organization system coupled with a novel multithreaded search algorithm called Compound Interpolation Search and the large-scale parallelization of the algorithms using the GPU, CPU and SSD. Powered by these implementations CATE is magnitudes faster than standard tools. For instance, CATE processes all 54,849 human genes for all 22 autosomal chromosomes across the five super populations present in the 1000 Genomes Project in less than thirty minutes while counterpart software took 3.62 days. This proven framework has the potential to be adapted for GPU-accelerated large-scale parallel computations of many evolutionary and genomic analyses. GitHub repository: https://github.com/theLongLab/CATE GitHub Wiki: https://github.com/theLongLab/CATE/wiki Published in Methods in Ecology and Evolution: https://doi.org/10.1111/2041-210X.14168
# Qingrun Zhang: eXplainable representation learning via Autoencoders revealing Critical genes
Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the “latent variables” in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL’s broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose “Critical genes”, defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene’s contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions. (TCPL 201) |
10:30 - 10:49 |
Lloyd Elliott: Teaching Machine Learning using Data for Good ↓ Interest in degrees and courses on data science, machine learning and statistics has increased greatly over the past fifteen years. In addition to teaching technical and theoretical aspects, University events and programs support this interest through hackathons and case studies in which real datasets are examined (sometimes in collaboration with corporate or NGO or charity co-organizers). This raises two questions about fairness: 1) In the case of corporate co-organizers, or real datasets derived from commercial data, with respect to insights and deliverables developed by students, where is the line between pedagogy and uncompensated student labour? 2) How can we make a positive impact in the world through setting case studies as teachers? For five years, I have taught "Learning from Data Science." I will discuss my own thoughts on these questions, and provide my own insights based on my own setting of case studies. (TCPL 201) |
10:49 - 11:01 |
Bei Jiang: Online Local Differential Private Quantile Inference via Self-normalization ↓ Based on binary inquiries, we developed an algorithm to estimate population quantiles under Local Differential Privacy (LDP). By self-normalizing, our algorithm provides asymptotically normal estimation with valid inference, resulting in tight confidence intervals without the need for nuisance parameters to be estimated. Our proposed method can be conducted fully online, leading to high computational efficiency and minimal storage requirements with O(1) space. We also proved an optimality result by an elegant application of one central limit theorem of Gaussian Differential Privacy (GDP) when targeting the frequently encountered median estimation problem. With mathematical proof and extensive numerical testing, we demonstrate the validity of our algorithm both theoretically and experimentally. (TCPL 201) |
11:01 - 11:13 |
Wenlong Mou: A decorrelation method for general regression adjustment in randomized experiments ↓ Randomized experiments are the gold standard for estimating the effect of an intervention, while the efficiency of estimation can be further improved using regression adjustments. Standard regression adjustment involves bias due to sample re-use, and this bias leads to behavior that is sub-optimal in the sample size, and/or imposes restrictive assumptions. In this talk, I present a simple yet effective decorrelation method that circumvents these issues. Among other results, I will highlight sharp non-asymptotic guarantees satisfied by the estimator, under very mild assumptions. (TCPL 201) |
11:13 - 11:20 |
Deshan Perera: CATE: An accelerated and scalable solution for large-scale genomic data processing through GPU and CPU-based parallelization ↓ The power of the statistical tests that quantify the evolution of a genome are strengthened by larger sample sizes. However, the increased sample sizes create a significant demand on computational resources resulting in longer compute times. Parallelization, especially using the Graphical Processing Unit (GPU) can alleviate this burden. NVIDIA’s CUDA GPUs are becoming commonplace in solving genetic algorithms with the aim of reducing computational time. So far, such potential of high scale parallelization has not been realized in molecular evolution analyses. CATE (CUDA Accelerated Testing of Evolution) is such a software solution. It is a scalable program built using NVIDIA’s CUDA platform together with an exclusive file hierarchy to process six frequently used evolutionary tests, namely: Tajima’s D, Fu and Li's D, D*, F and F*, Fay and Wu’s H and E, McDonald–Kreitman test, Fixation Index, and Extended Haplotype Homozygosity. CATE is composed of two main innovations. A file organization system coupled with a novel multithreaded search algorithm called Compound Interpolation Search and the large-scale parallelization of the algorithms using the GPU, CPU and SSD. Powered by these implementations CATE is magnitudes faster than standard tools. For instance, CATE processes all 54,849 human genes for all 22 autosomal chromosomes across the five super populations present in the 1000 Genomes Project in less than thirty minutes while counterpart software took 3.62 days. This proven framework has the potential to be adapted for GPU-accelerated large-scale parallel computations of many evolutionary and genomic analyses. GitHub repository: https://github.com/theLongLab/CATE GitHub Wiki: https://github.com/theLongLab/CATE/wiki Published in Methods in Ecology and Evolution: https://doi.org/10.1111/2041-210X.14168 (TCPL 201) |
11:20 - 11:33 |
Qingrun Zhang: eXplainable representation learning via Autoencoders revealing Critical genes ↓ Machine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the “latent variables” in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL’s broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose “Critical genes”, defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene’s contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably, Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET) and a cancer-specific database (COSMIC), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions. (TCPL 201) |
11:45 - 13:30 |
Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room) |
13:30 - 14:30 |
Sanmi Koyejo: Algorithmic Fairness; Why it’s hard and why it’s interesting (Tutorial) ↓ In only a few years, algorithmic fairness has grown from a niche topic to a major component of machine learning and artificial intelligence research and practice. As a field, we have had some embarrassing mistakes, yet our understanding of the core issues, potential impacts, and mitigation approaches has grown. This tutorial presents a range of recent findings, discussions, questions, and partial answers in the space of algorithmic fairness in recent years. While this tutorial will not attempt a comprehensive overview of this rich area, we aim to provide the participants with some tools and insights and to explore the connections between algorithmic fairness and a broad range of ongoing research efforts in the field. We will tackle some of the hard questions that you may have about algorithmic fairness, and hopefully address some misconceptions that have become pervasive. (TCPL 201) |
14:30 - 15:00 |
Joshua Snoke: De-Biasing the Bias: Methods for Improving Disparity Assessments with Noisy Group Measurements ↓ Health care decisions are increasingly informed by clinical decision support algorithms, but concern exits that these algorithms, trained using machine learning, may perpetuate or increase racial and ethnic disparities in the administration of health care resources. Clinical data often has the systemic feature that it does not contain any racial/ethnic information or contains erroneous and poor measures of race and ethnicity. This can lead to potentially misleading or insufficient assessments of algorithmic bias in clinical settings. We present novel methods to assess and mitigate potential bias in algorithmic machine learning models used to inform clinical decisions when race and ethnicity information is missing or poorly measured. We provide theoretical bounds on the statistical bias for a set of commonly used fairness metrics, and we show how these bounds can be estimated in practice and that they hold under a set of simple assumptions. Further, we provide a method for sensitivity analysis to estimate the range of potential disparities when the assumptions do not hold. We show that these methods for accurately estimating disparities can be extended to post-algorithm adjustments to enforce common definitions of fairness. We provide a case study using inferred race and ethnicity from the Bayesian Surname Information Geocoding (BISG) algorithm to estimate disparities in a clinical algorithm used to inform osteoporosis treatment decisions. With these novel methods, a policy maker can understand the range of potential disparities resulting from the use of a given algorithm, even when race and ethnicity information is missing, and make informed decisions regarding the safe implementation of machine learning for supporting clinical decisions. (TCPL 201) |
15:00 - 15:30 |
Coffee Break (TCPL Foyer) |
15:30 - 16:00 |
Giles Hooker: A Generic Approach to Stabilized Model Distillation ↓ Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable ``student” model to mimic the predictions made by the black box ``teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough corpus of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed for a specific student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the average loss. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a corpus size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. (TCPL 201) |
16:00 - 16:30 |
Danica Sutherland: Conditional independence measures for fairer, more reliable models ↓ Several notions of algorithmic fairness and techniques for out-of-distribution generalize amount to enforcing the independence of model outputs Φ(X) from a protected attribute, domain identifier, or similar Z, conditional on the true label Y. Much work in this area assumes discrete Y and Z, and struggle to handle complex predictions (e.g. object localization from images) and/or complex conditioning (e.g. handling fairness with respect to the combination of many attributes). We present a kernel-based technique for measuring conditional dependence for continuous Y and Z that is well-suited to learning complicated Φ(X) through stochastic gradient methods, called the Conditional Independence Regression CovariancE (CIRCE), both in settings where Y and Z are continuous but relatively “simple,” as well as when we must learn a structure on those variables as well. We will also discuss the use of this and related measures for statistical testing. (TCPL 201) |
16:30 - 17:30 |
Hao Zhang: Group discussions (TCPL 201) |
17:30 - 19:30 |
Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in Vistas Dining Room, top floor of the Sally Borden Building. (Vistas Dining Room) |
19:30 - 21:00 |
Informal gathering (Other (See Description)) |