Statistics & Data Science Ph.D. Candidate
Image
When
9:30 – 11:30 a.m., May 4, 2026
Where
Title: Statistical Methods for Nanopore Sequencing Data Analysis
Abstract: Nanopore Sequencing is a long-read sequencing technology that infers nucleotide sequences by measuring electrical signals generated during molecular translocation through a nanopore. It plays an increasingly important role in genomics and molecular biology through long-read sequencing, modification detection, and real-time analysis. Its data present distinctive statistical challenges due to complex mean structure, dependence, and measurement noise. This dissertation develops statistical methodology for analyzing Nanopore Sequencing data, with contributions spanning raw signal modeling, sequence decoding, and downstream genomic inference.
The first theme of this dissertation studies a statistical problem motivated by raw nanopore signals: how to reliably test and estimate serial dependence in time series with frequent mean shifts. We develop a novel and robust approach for testing autocorrelation under such settings, together with rate-optimal estimators for autocovariances. Theoretical guarantees, including asymptotic results and minimax risk bounds, are established for the proposed methods. Numerical studies and real-data analyses reveal statistically significant positive short-range dependence in Nanopore Sequencing data.
The second theme focuses on basecalling, the task of translating raw electrical signals into nucleotide sequences. We investigate how training data design influences deep learning–based basecallers in the presence of nucleotide modifications, and show that incorporating diverse modified signals improves generalization to previously unseen modifications. These findings highlight the importance of robust representation learning for accurate and adaptable basecalling.
The final theme considers downstream genomic inference for long-read sequencing data. We develop LRDE, a hurdle negative binomial framework for differential expression analysis that accounts for excess zeros and overdispersion while borrowing information across genes to improve estimation stability under limited replicates. Simulation studies and analyses of ONT and Pacific Biosciences datasets demonstrate improved accuracy and robustness relative to existing methods.
zoom link is https://arizona.zoom.us/j/83393936618