Statistics & Data Science Ph.D. Candidate
Image
When
Noon – 2 p.m., April 15, 2026
Where
Title: Statistical Methods for Measuring Data Similarity and Transfer Learning
Abstract: This dissertation addresses the problem of learning from multiple datasets, where the fundamental challenge is how to extract and transfer shared information across datasets to enhance predictive performance while avoiding the negative effects of dataset-specific differences. In particular, we tackle a few critical questions: How do we measure similarity between datasets for supervised learning tasks? How do we optimally leverage similar datasets to improve learning? How do we control negative transfer when datasets exhibit substantial differences?
To answer these questions, we develop novel statistical learning frameworks with theoretical results and efficient computational algorithms. Through comprehensive numerical experiments on both simulated and real-world datasets, we demonstrate performance of our proposed methods and compare to existing approaches. Under this unified theme, this research work consists of the following three major parts.
In the first part, we propose a new metric, the Cross-Learning Score (CLS), to measure similarity between datasets in supervised learning contexts. Unlike existing methods that only compare marginal feature distributions, CLS considers the conditional distribution of the response given covariates and quantifies similarity by evaluating how well models trained on one dataset perform on the other. This provides a more meaningful measure of similarity for prediction tasks. We also develop theoretical results and practical estimation methods and demonstrate advantages of CLS over existing approaches through comprehensive numerical studies.
In the second part, we develop a new multi-source transfer learning method based on CLS. In particular, a new learning method, ART-CLS, is proposed to assign adaptive weights to each source dataset according to its CLS that measures its similarity to the target data. The weighting scheme is designed such that the more similar a source dataset is to the target dataset, the higher its assigned weight, and conversely, the more dissimilar, the lower its weight. This approach naturally differentiates the contribution of each source dataset based on its similarity to the target dataset, thereby enhancing knowledge transfer from similar datasets while minimizing negative transfer from dissimilar ones.
In the third part, we consider the problem of creating comparable subsets from two heterogeneous subgroups in one dataset, which has broad applications in comparing average treatment effects across subgroups in clinical studies. We employ propensity scores to align subjects from different subgroups to properly evaluate treatment effects and make meaningful comparisons. We applied the method to the VITAL trial, which was conducted to evaluate the protective effect of omega-3 supplementation against heart attack. The findings suggest that omega-3 intake can significantly reduce the risk of heart attack in African American participants but not in other subgroups.
Overall, this dissertation demonstrates that carefully measuring and accounting for differences between datasets is essential for reliable data integration, transfer learning, and scientific discovery.