You are here

Colloquium Series

Spring 2020

Spring 2020 Colloquia will take place Mondays, 3:30pm in Math 501, unless otherwise noted.

  • February 3, 2020 - Speaker - Ian McKeague, Department of Biostatistics, Columbia University

Title:  Functional data analysis for activity profiles from wearable devices

Abstract: This talk introduces a nonparametric framework for analyzing physiological sensor data collected from wearable devices. The idea is to apply the stochastic process notion of occupation times to construct activity profiles that can be treated as monotonically decreasing functional data.

Whereas raw sensor data typically need to be pre-aligned before standard functional data methods are applicable, activity profiles are automatically aligned because they are indexed by activity level rather than by follow-up time. We introduce a nonparametric likelihood ratio approach that makes efficient use of the activity profiles to provide a simultaneous confidence band for their mean (as a function of activity level), along with an ANOVA type test. These procedures are calibrated using bootstrap resampling. Unlike many nonparametric functional data methods, smoothing techniques are not needed. Accelerometer data from subjects in a U.S. National Health and Nutrition Examination Survey (NHANES) are used to illustrate the approach. The talk is based on joint work with Hsin-wen Chang (Academia Sinica, Taipei).

  • March 2, 2020 - Speaker - Yichuan Zhao, Department of Professor Mathematics and Statistics, Georgia State University

Title:  Penalized Empirical Likelihood for the Sparse Cox Model

Abstract: ​ The current penalized regression methods for selecting predictor variables and estimating the associated regression coefficients in the Cox model are mainly based on partial likelihood. In this paper, an empirical likelihood method is proposed for the Cox model in conjunction with appropriate penalty functions when the dimensionality of data is high. Theoretical properties of the resulting estimator for the large sample are proved. Simulation studies suggest that empirical likelihood works better than partial likelihood in terms of selecting correct predictors without introducing more model errors. The well-known primary biliary cirrhosis data set is used to illustrate the proposed empirical likelihood method.  

This is joint work with Dongliang Wang and Tong Tong Wu.

Fall 2019

Fall 2019 colloquia will take place Mondays, 3:30 to 4:30pm in MATH 501, unless otherwise noted.

  • September 9, 2019 - Speaker - Hui Zou, Department of Statistics, University of Minnesota

Title:  A nearly condition-free fast algorithm for Gaussian graphical model recovery

Abstract: Many methods have been proposed for estimating Gaussian graphical model. The most popular ones are the graphical lasso and neighborhood selection, because the two are computational very efficient and have some theoretical guarantees. However, their theory for graph recovery requires some very stringent structure assumption (a.k.a. the irrepresentable condition). We argue that replacing the lasso penalty in these two methods with a non-convex penalty does not fundamentally remove the theoretical limitation, because another structure condition is required. As an alternative, we propose a new algorithm for graph recovery that is very fast and easy to implement and enjoys strong theoretical properties under basic sparsity assumptions.

  • October 14, 2019 - Speaker - Jian Liu, Associate Professor, Systems and Industrial Engineering, University of Arizona

Title: Functional Data Analytics for Detecting Bursts in Water Distribution Systems

Abstract: Bursts in water distribution systems (WDSs) are a special type of short-term, high-flow water loss that can be a significant component of a system’s water balance. Since WDSs are usually deployed underground, bursts are difficult to be detected before their catastrophic results are observed on the ground surface. Continuous hydraulic data streams collected from automatic meter reading and advanced metering infrastructure systems make it possible to detect bursts in WDS based on data analytics. Existing methods based on conventional statistical process control charts may not be effective, as the temporal correlations imbedded in the data streams are not explicitly considered. In this seminar, new control charts for burst detection based on functional data analysis will be presented. Both Phase-I and Phase-II monitoring schemes are investigated. The temporal correlations are modeled from empirical data streams continuously collected from the same WDS. Their statistical properties are studied to reflect system inherent uncertainties induced by customers’ daily use without bursts. The bursts are detected by comparing the new hydraulic data stream to the inherent uncertainties through statistical control charting. The new method will significantly reduce the rate of false alarm and miss detection. The effectiveness of the proposed method is demonstrated with a case study based on numerical simulation of a real-world WDS.

  • November 12, 2019 - Speaker - Dennis Lin, Department of Statistics, Penn State University **NOTE** Tuesday at 1:00pm in Math 501

Title: Interval Data: Modeling and Visualization

Abstract: Interval-valued data is a special symbolic data composed of lower and upper bounds of intervals. It can be generated from the change of climate, fluctuation of stock prices, daily blood pressures, aggregation of large datasets, and many other situations. Such type of data contains rich information useful for decision making. The prediction of interval-valued data is a challenging task as the predicted lower bounds of intervals should not cross over the corresponding upper bounds. In this project, a regularized artificial neural network (RANN) is proposed to address this difficult problem. It provides a flexible trade-off between prediction accuracy and interval crossing. Empirical study indicates the usefulness and accuracy of the proposed method. The second portion of this project provides some new insights for visualization of interval data.  Two plots are proposed—segment plot and dandelion plot.  The new approach compensates the existing visualization methods and provides much more information.  Theorems have been established for reading these new plots.  Examples are given for illustration.

  • December 2, 2019 - Speaker - Wenguang Sun, Associate Professor, Dept of Data Sciences and Operations, USC

Title: Large-Scale Estimation and Testing Under Heteroscedasticity

Abstract: The simultaneous inference of many parameters, based on a corresponding set of observations, is a key research problem that has received much attention in the high-dimensional setting. Many practical situations involve heterogeneous data where the most common setting involves unknown effect sizes observed with heteroscedastic errors. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale inference. The first part of my talk addresses the selection bias issue in large-scale estimation problem by introducing the “Nonparametric Empirical Bayes Smoothing Tweedie” (NEST) estimator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity via a generalized version of Tweedie’s formula. The second part of my talk focuses on a parallel issue in multiple testing. We show that there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity–adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The common message in both NEST and HART is that the variance structure, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power in both shrinkage estimation and multiple testing problems.

  • December 9, 2019 - Speaker - Vladimir Minin, Department of Statistics, UC Irvine

Title: Statistical challenges and opportunities in stochastic epidemic modeling

Abstract: Stochastic epidemic models describe how infectious diseases spread through populations. These models are constructed by first assigning individuals to compartments (e.g., susceptible, infectious, and removed) and then defining a stochastic process that governs the evolution of sizes of these compartments through time. Stochastic epidemic models and their deterministic counterparts are useful for evaluating strategies for controlling the infectious disease spread and for predicting the future course of a given epidemic. However, fitting these models to data turns out to be a challenging task, because even the most vigilant infectious disease surveillance programs offer only noisy snapshots of the number of infected individuals in the population. Such indirect observations of the infectious disease spread result in high dimensional missing data (e.g., number and times of infections) that needs to be accounted for during statistical inference. I will demonstrate that combining stochastic process approximation techniques with high dimensional Markov chain Monte Carlo algorithms makes Bayesian data augmentation for stochastic epidemic models computationally tractable. I will present examples of fitting stochastic epidemic models to  incidence time series data collected during outbreaks of Influenza and Ebola viruses.

Spring 2019

Spring 2018 colloquia will take place Mondays, 3:20 - 4:30pm, in ENR2 S395.

  • April 1, 2019 - Speaker - Yunpeng Zhao, School of Mathematical and Natural Sciences, ASU

Title:  Network Structure Inference from Grouped Data

Abstract: Statistical network analysis typically deals with inference concerning various parameters of an observed network. In several applications, especially those from social sciences, behavioral information concerning groups of subjects are observed.  In such data sets, even though a network structure is present it is not typically observed. These are referred to as implicit networks. In this presentation, we describe a model-based framework to uncover the implicit network structure and address related inferential questions. We also describe extensions of the methodology to time series of grouped observations.

  • April 15, 2019 - Speaker - Jianqiang Cheng, Assistant Professor Systems and Industrial Engineering, University of Arizona

Title: Computationally Efficient approximations for Distributionally Robust Optimization

 Abstract:  Distributionally robust optimization (DRO) has gained increasing popularity because it offers a way to overcome the conservativeness of robust optimization without requiring the specificity of stochastic optimization. On the computational side, many practical DRO instances can be equivalently (or approximately) formulated as semidefinite programming (SDP) problems via conic duality of the moment problem. However, despite being theoretically solvable in polynomial time, SDP problems in practice are computationally challenging and quickly become intractable with increasing problem size. In this talk, we adopt the principal component analysis (PCA) approach to solve DRO problems with different types of ambiguity sets. We show that the PCA approximation yields a relaxation of the original problem and derive theoretical bounds on the gap between the original problem and its PCA approximation. Furthermore, an extensive numerical study shows the strength of the proposed approximation method in terms of solution quality and runtime.

  • April 29, 2019 - Speaker - Yves Lussier, Department of Medicine, University of Arizona

  • March 18, 2019 - Speaker - Anthony Howell, Associate Professor, School of Economics, Peking University

Professor Howell is a candidate for a faculty position, Assistant Professor, Spatial Statistics and Quantitative Methods.

Fall 2018

Fall 2018 colloquia will take place Mondays, 3:30 - 4:30pm, in ENR2 S395.

  • December 3, 2018 - Speaker - Hidehiko Ichimura, Professor of Economics University of Arizona

Title: Locally Robust Semiparametric Estimation

 Abstract:  We give a general construction of debiased/locally robust/orthogonal (LR) moment functions for GMM, where the derivative with respect to first step nonparametric estima- tion is zero and equivalently first step estimation has no effect on the influence function. This construction consists of adding an estimator of the influence function adjustment term for first step nonparametric estimation to identifying or original moment conditions. We also give numerical methods for estimating LR moment functions that do not require an explicit formula for the adjustment term.

LR moment conditions have reduced bias and so are important when the first step is machine learning. We derive LR moment conditions for dynamic discrete choice based on first step machine learning estimators of conditional choice probabilities.

We provide simple and general asymptotic theory for LR estimators based on sample splitting. This theory uses the additive decomposition of LR moment conditions into an identifying condition and a first step influence adjustment. Our conditions require only mean square consistency and a few (generally either one or two) readily interpretable rate conditions.

LR moment functions have the advantage of being less sensitive to first step estimation. Some LR moment functions are also doubly robust meaning they hold if one first step is incorrect. We give novel classes of doubly robust moment functions and characterize double robustness. For doubly robust estimators our asymptotic theory only requires one rate condition.

  • November 5, 2018 - Speaker - Huiyan Sang, Associate Professor, Department of Statistics, Texas A&M University

Title: Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets

 Abstract: Spatial regression models have been widely used to describe the relationship between a response variable and some explanatory variables over a region of interest under the assumption that the responses are spatially correlated. Nearly all existing work assumes the regression coefficients to be constant or smoothly varying over the region. In this article, we propose a spatially clustered coefficient regression model to capture the spatially varying pattern, especially clustering pattern in the effect of explanatory variables. In many applications especially with large spatial datasets, it is of great interest to practitioners to identify such clusters that allow straightforward interpretations of local associations among variables. This method incorporates spatial neighboring information through a carefully constructed regularization to automatically detect change points in space and to achieve computational scalability.  Numerical studies show that it works very effectively not only in capturing clustered coefficients, but also smoothly varying coefficients because of its strong local adaptivity. This flexibility allows the researchers to explore various spatial structures in regression coefficients. Theoretical properties of the new estimator are also established. The method is applied to explore the relationship between temperature and salinity of sea water in the Atlantic basin, which provides insightful information on the evolution of individual water masses and the pathway and strength of meridional overturning circulation in oceanography.

Bio: Dr. Huiyan Sang received her Ph.D in Statistics from Duke University in 2008, and B.S. in Mathematics and Applied Mathematics from Peking University, China, in 2004.  She joined the faculty at Texas A&M University in 2008, where she is currently an Associate Professor in the Department of Statistics. Her research focuses on spatial statistics, extreme values and computational methods for large datasets.

  • October 1, 2018 - Speaker - Ning Hao, Assistant Professor, Department of Mathematics, University of Arizona

Title: A super scalable algorithm for short segment detection

Abstract: In many applications such as copy number variant detection, the goal is to identify short segments on which the observations have different means or medians from the background. Those segments are usually short and hidden in a long sequence, and hence are very challenging to find.  In this talk, we will introduce a super scalable short segment detection algorithm. This nonparametric method clusters the locations where the observations exceed a threshold for segment detection. It is computationally efficient and does not rely on Gaussian noise assumption. Moreover, we propose a framework to assign significance levels for detected segments. We demonstrate the advantages of our proposed method by theoretical, simulation, and real data studies. This talk is based on a work joint with Yue Niu, Feifei Xiao and Heping Zhang.

Bio: Ning Hao is an assistant professor in Mathematics Department and Statistics GIDP.  He received his B.S. in Mathematics from Peking University, and Ph.D. from the Department of Mathematics at Stony Brook University. He spent one year as a research associate at the statistics lab, Princeton University before joining University of Arizona. His research interests include high dimensional statistical learning, change-point detection and bioinformatics.


  •  September 10 -Speaker - Xiaoxiao Sun, Assistant Professor in the Department of Epidemiology and Biostatistics, University of Arizona.

Title:  Optimal Penalized Function-on-Function Regression

Abstract:  Many scientific studies collect data where the response and predictor variables are both functions of time, location, or some other covariate. Understanding the relationship between these functional variables is a common goal in these studies. Motivated from two real-life examples, we present a function-on-function regression model that can be used to analyze such kind of functional data. Our estimator of the 2D coefficient function is the optimizer of a form of penalized least squares where the penalty enforces a certain level of smoothness on the estimator. Our first result is the representer theorem which states that the exact optimizer of the penalized least squares actually resides in a data-adaptive finite-dimensional subspace although the optimization problem is defined on a function space of infinite dimensions. This theorem then allows us an easy incorporation of the Gaussian quadrature into the optimization of the penalized least squares, which can be carried out through standard numerical procedures. We also show that our estimator achieves the minimax convergence rate in mean prediction under the framework of function-on-function regression. Extensive simulation studies demonstrate the numerical advantages of our method over the existing ones, where a sparse functional data extension is also introduced. The proposed method is then applied to our motivating examples of the benchmark Canadian weather data and a histone regulation study.

BIO:  Xiaoxiao Sun, Ph.D., is an Assistant Professor in the Department of Epidemiology and Biostatistics at the Mel and Enid Zuckerman College of Public Health. His research focus is developing theoretically justifiable and computationally efficient methods for complex and big data arising in data-rich areas, such as genomics, social media, and neuroscience. In his previous research, he has developed several statistical methods for time course omics data. He is currently focusing on building the novel computational framework to analyze super-large datasets efficiently. He earned his Ph.D. in Statistics from the University of Georgia in 2018. Prior to receiving his Ph.D. degree, he obtained BS and MS in Statistics from the Central University of Finance and Economics in Beijing.


Spring 2018

Spring 2018 colloquia will take place Mondays, 3pm -4pm, in ENR2 S395.

The dates for the 2018 colloquia are:

  • February 5 - Yuhong Yang, Faculty Member in the School of Statistics, University of Minnesota
  • March 12 - Jacob Bien, Assistant Professor of Data Sciences and Operations, USC
  • April 2 - Xiaoming Huo, Professor at the Stewart School of Industrial & Systems Engineering at Georgia Institute of Technology
  • April 30 - Qiang Zhou, Assistant Professor, Systems & Industrial Engineering, University of Arizona
  • September 3
  • October 1
  • November 5
  • December 3

More details will be posted in the coming weeks.

Statistics GIDP Colloquium: Monday, April 30, 2018

Speaker: Qiang Zhou, University of Arizona

Dr. Qiang Zhou is an Assistant Professor at the Department of Systems and Industrial Engineering, University of Arizona, and a faculty member of the UA Statistics GIDP. Before moving to UA, he was an Assistant Professor at the Department of Systems Engineering and Engineering Management, City University of Hong Kong during 2012~2016. Dr. Zhou received his B.S. degree in Automotive Engineering and the M.S. Degree in Mechanical Engineering from Tsinghua University, Beijing, China, in 2005 and 2007, M.S. degree in Statistics, and Ph.D. degree in Industrial Engineering from the University of Wisconsin-Madison, in 2010 and 2011.His research focuses on advanced industrial data analytics for engineering decision making and system performance improvement. Application areas of his research include energy storage systems, semiconductor manufacturing, nanomaterial fabrication, telecommunications, etc.

Title: Pairwise Metamodeling of Multivariate Output Computer Models

Abstract:  Gaussian process (GP) is a popular method for emulating deterministic computer simulation models. Its natural extension to computer models with multivariate outputs employs a multivariate Gaussian process (MGP) framework. Nevertheless, with significant increase in the number of design points and the number of model parameters, building an MGP model is a very challenging task. Under a general MGP model framework with nonseparable covariance functions, we propose an efficient meta-modeling approach featuring a pairwise model building scheme. The proposed method has excellent scalability even for a large number of output levels. Some properties of the proposed method have been investigated and its performance has been demonstrated through several numerical examples. Experimental designs for building such metamodels will also be briefly discussed.

Statistics GIDP Colloquium: Monday, April 2, 2018

Speaker: Xiaoming Huo, Georgia Tech

Xiaoming Huo is a professor at the Stewart School of Industrial & Systems Engineering at Georgia Tech. Huo's research interests include statistics, computing, and data science. He has made numerous contributions on topics such as sparse representation, wavelets, and statistical problems in detectability. He is a senior member of IEEE since May 2004. He won the Georgia Tech Sigma Xi Young Faculty Award in 2005. His work has led to an interview by Emerging Research Fronts in June 2006 in the field of Mathematics - every two months, one paper is selected. Huo is a fellow of ASA and an AE for Technometrics. He is the executive director of TRIAD: Transdisciplinary Research Institute for Advancing Data Science (, and an Associate Director of the Master of Science in Analytics (

Title: Statistical and Numerical Efficiency

Abstract:  I will describe two recent projects. Both integrates statistics and computing. In the first project, we study how to generate a statistical inference procedure that is both computational efficient and having theoretical guarantee on its statistical performance. Test of independence plays a fundamental role in many statistical techniques. Among the nonparametric approaches, the distance-based methods (such as the distance correlation based hypotheses testing for independence) have numerous advantages, comparing with many other alternatives. A known limitation of the distance-based method is that its computational complexity can be high. In general, when the sample size is n, the order of computational complexity of a distance-based method, which typically requires computing of all pairwise distances, can be O(n^2). Recent advances have discovered that in the univariate cases, a fast method with O(n log n) computational complexity and O(n) memory requirement exists. In this talk, I introduce a test of independence method based on random projection and distance correlation, which achieves nearly the same power as the state-of-the-art distance-based approach, works in the multivariate cases, and enjoys the O(n K log n) computational complexity and O(max{n,K}) memory requirement, where K is the number of random projections. Note that saving is achieved when K < n/ log n. We name our method a Randomly Projected Distance Covariance (RPDC). The statistical theoretical analysis takes advantage of some techniques on random projection, which are rooted in contemporary machine learning. Numerical experiments demonstrate the efficiency of the proposed method, in relative to several competitors.

In the second project, we study how the non-convex penalization can be unified under the framework of difference-of-convex (DC), which is a subfield of optimization. We then argue that many existing statistical procedures can be treated as special cases of DC. Theory on both the numerical efficiency and the statistical optimality can be derived.

Statistics GIDP Colloquium: Monday, March12, 2018

Speaker: Jacob Bien, USC

Title: Title: High-Dimensional Variable Selection When Features are Sparse

Abstract: It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events.  This leads to design matrices in which a large number of columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from biology (e.g., rare species) to natural language processing (e.g., rare words).  We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis.  We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response.  An application to online hotel reviews demonstrates the gain in accuracy achievable by proper treatment of rare words.  This is joint work with Xiaohan Yan.

Statistics GIDP Colloquium: Monday, February 5, 2018

Speaker: Yuhong Yang, University of Minnesota

Title: Treatment Allocations Based on Multi-Armed Bandit Strategies

Abstract: In practice of medicine, multiple treatments are often available to treat individual patients. The task of identifying the best treatment for a specific patient is very challenging due to patient inhomogeneity. Multi-armed bandit with covariates provides a framework for designing effective treatment allocation rules in a way that integrates the learning from experimentation with maximizing the benefits to the patients along the process.

In this talk, we present new strategies to achieve asymptotically efficient or minimax optimal treatment allocations. Since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean treatment outcome functions (in terms of the covariates) but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance and show its strong consistency. When the mean treatment outcome functions are smooth, rates of convergence can be studied to quantify the effectiveness of a treatment allocation rule in terms of the overall benefits the patients have received.  A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in treatment outcome function modeling and a theoretical guarantee of the overall treatment benefits. Numerical results are given to demonstrate the performance of the new strategies.

The talk is based on joint work with Wei Qian

Fall 2017

Fall 2017 colloquia will take place Mondays, 3pm-4pm,in ENR2 S395. 

Statistics GIDP Colloquium: Monday, September 11, 2017

Speaker: Yue (Selena) Niu, University of Arizona 

Title: Reduced Ranked Linear Discriminant Analysis

Abstract: Many high dimensional classification techniques have been
developed recently. However, most works focus on only the binary
classification problem. Available classification tools for the multi-class
cases are either based on over-simplified covariance structure or
computationally complicated. In this talk, following the idea of reduced
ranked linear discriminant analysis, we introduce a new dimension
reduction tool with the flavor of supervised principal component analysis.
The proposed method is computationally efficient and can incorporate the
correlation structure among the features. We illustrate our methods by
simulated and real data examples.

Statistics GIDP Colloquium: Monday, October 2, 2017

Speaker: Yehua Li, Iowa State University

Title: Nested Hierarchical Functional Data Modeling and Inference for the Analysis of Functional Plant Phenotypes

Abstract:  In a plant science Root Image Study, the process of seedling roots bending in response to gravity is recorded using digital cameras, and the bending rates are modeled as functional plant phenotype data. The functional phenotypes are collected from seeds representing a large variety of genotypes and have a three-level nested hierarchical structure, with seeds nested in groups nested in genotypes. The seeds are imaged on different days of the lunar cycle, and an important scientific question is whether there are lunar effects on root bending. We allow the mean function of the bending rate to depend on the lunar day and model the phenotypic variation between genotypes, groups of seeds imaged together, and individual seeds by hierarchical functional random effects. We estimate the covariance functions of the functional random effects by a fast penalized tensor product spline approach, perform multi-level functional principal component analysis (FPCA) using the best linear unbiased predictor of the principal component scores, and improve the efficiency of mean estimation by iterative decorrelation. We choose the number of principal components using a conditional Akaike Information Criterion and test the lunar day effect using generalized likelihood ratio test statistics based on the marginal and conditional likelihoods. We also propose a permutation procedure to evaluate the null distribution of the test statistics. Our simulation studies show that our model selection criterion selects the correct number of principal components with remarkably high frequency, and the likelihood-based tests based on FPCA have higher power than a test based on working independence.

Statistics GIDP Colloquium: Monday, November 6, 2017 CANCELLED

Statistics GIDP Colloquium: Monday, November 20, 2017

          Speaker: Xiaotong Shen, University of Minnesota

           Title:   Personalized Prediction and Recommender Systems

Abstract Personalized prediction predicts a user's preference for a large number of items through user-specific as well as content-specific information, based on a very small amount of observed preference scores. In a sense, predictive accuracy depends on how to pool the information from similar users and items.  Two major approaches are collaborative filtering and content-based filtering.  Whereas the former utilizes the information on users that think alike for a specific item, the latter acts on characteristics of the items that a user prefers, on which two kinds of recommender systems Grooveshark and Pandora are built. In this talk, I will review some recent advances in latent factor modeling and discuss various issues as well as scalable strategies based on a ``divide-and-conquer'' algorithm.


Statistics GIDP Colloquium: Monday, December 4, 2017

Speaker: Edward Bedrick, University of Arizona

Title: Data Reduction Prior to Inference: Is it Sensible to Use Principal Component Scores to Make Group Comparisons in a Student's t-test or ANOVA?

Abstract: There has been a significant recent development of statistical methods, for inference with high-dimensional data.  Despite these developments, which includes research by faculty at the UofA, biomedical researchers and computational scientists often use a simple two-step step process to analyze high dimensional data. First, the dimensionality is reduced using a standard technique such as principal component analysis, followed by a group comparison using a t-test or analysis of variance. In this talk, I will try to untangle a number of issues associated with this approach, stating with the simplest but most vexing question (since this is left unstated)- what hypothesis is being tested? I will use a combination of approaches, including asymptotics, analytical construction of worst case scenarios, and simulation based on actual data to address whether this approach is sensible. Although asymptotics will consider a non-sparse setting, some discussion of implications in sparse problems will be given. 

Spring 2017


Statistics GIDP Colloquium: MondayMay 1, 2017.

Speaker: Ming Hu, Lerner Research Institute, Cleveland Clinic; 

Title: Statistical Methods, Computational Tools and Visualization of Hi-C Data​

Abstract: Harnessing the power of high-throughput chromatin conformation capture (3C) based technologies, we have recently generated a compendium of datasets to characterize chromatin organization across human cell lines and primary tissues. Knowledge revealed from these data facilitates deeper understanding of long range chromatin interactions (i.e., peaks) and their functional implications on transcription regulation and genetic mechanisms underlying complex human diseases and traits. However, various layers of uncertainties and complex dependency structure complicate the analysis and interpretation of these data. We have proposed hidden Markov random field (HMRF) based statistical methods, which properly address the complicated dependency issue in Hi-C data, and further leverage such dependency by borrowing information from neighboring pairs of loci, for more powerful and more reproducible peak detection. Through extensive simulations and real data analysis, we demonstrate the power of our methods over existing peak callers. We have applied our methods to the compendium of Hi-C from 21 human cell lines and tissues, and further develop an online visualization tool to facilitate identification of potential target gene(s) for the vast majority of non-coding variants identified from the recent waves of genome-wide association studies.

3:00 pm - 4:00 pm, Mathematics Building, room 501.


Statistics GIDP Colloquium: Friday, April 7, 2017.

Speaker: Sunder Sethuraman, Department of Mathematics, University of Arizona;

Title:  Consistency of modularity clustering and Kelvin's tiling problem

Abstract:  Given a graph, the popular `modularity' clustering method specifies a partition of the vertex set as the solution of a certain optimization problem.  In this talk, we will discuss consistency properties, or scaling limits, of this method with respect to random geometric graphs constructed from n i.i.d. points, V_n = \{X_1, X_2, . . . ,X_n\}, distributed according to a probability measure supported on a bounded domain in R^d. A main result is the following:  Suppose the number of clusters, or partitioning sets of V_n, is bounded above, then we show that the discrete optimal modularity clusterings converge in a specific sense to a continuum partition of the underlying domain, characterized as the solution of a `soap bubble', or `Kelvin'-type shape optimization problem.

3:00 pm - 4:00 pm, Mathematics Building, room 501.


Statistics GIDP Colloquium: Friday, March 3, 2017.

Speaker: Ming-Hung (Jason) Kao, Associate Professor, School of Mathematical and Statistical Sciences, Arizona State University;

Title: Experimental Designs for Functional Brain Imaging with fMRI

Abstract. Functional magnetic resonance imaging (fMRI) experiments are widely conducted in many fields for studying functions of the brain. One of the important first steps of such experiments is to select a good experimental design to allow for a valid and precise statistical inference. However, the identification and construction of high-quality fMRI designs can be quite challenging. In this talk, we introduce some methods for constructing good fMRI designs, and discuss the statistical optimality of these designs.

3:00 pm - 4:00 pm, Mathematics Building, room 501.


Statistics GIDP Colloquium: Friday, February 3, 2017.

Speaker: Ge Yong, Management Information Systems, University of Arizona;

Title: Point-of-Interest Recommendations in Location-based Social Networks

Abstract. With the rapid development of Location-based Social Network (LBSN) services, a large number of Point-Of-Interests (POIs) have been available, which consequently raises a great demand of building personalized POI recommender systems. A personalized POI recommender system can significantly assist users to find their preferred POIs and help POI owners to attract more customers. However, it is very challenging to develop a personalized POI recommender system because a user's check-in decision making process is very complex and could be influenced by many factors such as social network, geographical position, and the dynamics of user preferences. In the first part of this talk, we propose to divide the whole recommendation space into two parts: social friend space and user interest space, and we develop models for each space for generating recommendations. In the second part of this talk, we introduce a new ranked based method for implicit feedback-based recommendation. To evaluate the proposed methods, we conduct extensive experiments with many state-of-the-art baseline methods and evaluation metrics on the real-world data sets.

Bio. Dr. Yong Ge is an assistant professor at MIS Dept. of UoA. He received his Ph.D. in Information Technology from Rutgers, The State University of New Jersey in 2013, the M.S. degree in Signal and Information Processing from the University of Science and Technology of China (USTC) in 2008, and the B.E. degree in Information Engineering from Xi'an Jiao Tong University in 2005. He received the ICDM-2011 Best Research Paper Award, Excellence in Academic Research (one per school) at Rutgers Business School in 2013, and the Dissertation Fellowship at Rutgers University in 2012. He has published prolifically in refereed journals and conference proceedings, such as IEEE TKDE, ACM TOIS, ACM TKDD, ACM TIST, ACM SIGKDD, and IEEE ICDM. His work have been supported by UoA, NSF and NIH.

3:00 pm - 4:00 pm, Mathematics Building, room 501.

Fall 2016

Statistics GIDP Colloquium: Wednesday, September 7, 2016.

  • Speaker: Matti Morzfeld, Department of Mathematics, University of Arizona
  • Title: U2 can UQ -- Projects and Life in Uncertainty
  • Abstract: 
    I will give an overview about mathematical and computational problems I face when I combine numerical models and data. I will first review basic tools such as Bayes' rule and importance sampling, then explain what difficulties arise when using these tools, and then present two specific applications.
    The first application uses low-dimensional models to describe and predict reversals of the geomagnetic dipole, the second uses adaptive importance sampling to solve a parameter estimation problem in combustion modeling, leveraging parallelism of DOE's super computers.
  • 3:00 pm - 4:00 pm, Mathematics Building, room 402.


  • Statistics GIDP Colloquium: Wednesday, October 5, 2016.
  • Speaker: Han Xiao, Dept of Statistics and Biostatistics, Rutgers University
  • Title: On the maximum cross correlations under high dimension
  • Abstract: Multiple time series often exhibits cross lead-lag relationship among its component series. It is very challenging to identify this type of relationship when the number of series is large. We study the lead-lag relationship in the high dimensional context, using the maximum cross correlations and some other variants. Asymptotic distributions are obtained. We also use moving blocks bootstrap to improve the finite sample performance.
  • 3:00 pm - 4:00 pm, Mathematics Building, room 501.
  • This talk will be preceded by a graduate student lunch - contact Kristina Souders ( for information.


  • Statistics GIDP Colloquium: Wednesday, November 2, 2016.
  • Speaker: Haiquan Li, Assistant Professor, Director for Translational Bioinformatics, Department of Medicine, University of Arizona; 
  • Title: Scattered disease-linked variants and convergent functions: discovery from big data integration
  • Abstract: Genome-wide association studies (GWAS) has identified thousands of disease-linked single nucleotide polymorphisms (SNP) in the human genome. Most of them have a small effect size (OR<1.4) and locate independently across multiple chromosomes. It remains unclear how they collectively cause the diseases due to the issue of missing heritability. Classic tests of genetic interactions suffer from insufficient power. Here, we will present an integrative approach that leverages several omics datasets to obtain additional information beyond genotypes and thus reducing the number of hypotheses. We combine traditional semantic similarity for genes’ functions and very deep network permutations (100K times) to quantify the empirical significance of downstream function similarity of any pair of SNPs. This approach enabled us to discover a fundamental biological mechanism for complex diseases:  SNPs associated with the same disease are more likely to associate with the same downstream genes or functionally similar genes than unrelated diseases (OR>12). We also found 40-50% of prioritized SNP-pairs have significant genetic interactions from three independent GWAS datasets. These results provide new biological interpretation to genetic interactions and a “roadmap” of disease mechanisms emerging from GWAS SNPs, especially those out of coding regions.
  • 3:00 pm - 4:00 pm, Mathematics Building, room 501.


  • Statistics GIDP Colloquium: Wednesday, December 7, 2016.
  • Speaker: Timothy Hanson, Professor, Department of Statistics, University of South Carolina
  • Title: A unified framework for fitting Bayesian semiparametric models to arbitrarily censored spatial survival data
  • Abstract: A comprehensive, unified approach to modeling arbitrarily censored spatial survival data is presented for the three most commonly-used semiparametric models: proportional hazards, proportional odds, and accelerated failure time. Unlike many other approaches, all manner of censored survival times are simultaneously accommodated including uncensored, interval censored, current-status, left and right censored, and mixtures of these. Left truncated data are also accommodated leading to models for time-dependent covariates.  Both georeferenced (location observed exactly) and areally observed (location known up to a geographic unit such as a county) spatial locations are handled. Variable selection is also incorporated.  Model fit is assessed with conditional Cox-Snell residuals, and model choice carried out via LPML and DIC.  Baseline survival is modeled with a novel transformed Bernstein polynomial prior. All models are fit via new functions which call efficient compiled C++ in the R package spBayesSurv. The methodology is broadly illustrated with simulations and real data applications.  An important finding is that proportional odds and accelerated failure time models often fit significantly better than the commonly-used proportional hazards model.
  • 3:00 pm - 4:00 pm, Mathematics Building, room 501.


Spring 2016

  • Statistics GIDP Colloquium: Wednesday, February 3, 2016.
  • Speaker: Walt Piegorsch, PhD, University of Arizona, GIDP;
  • Title: Model uncertainty in environmental risk assessment
  • Abstract: Estimation of low-dose ‘benchmark’ points in environmental risk analysis is discussed. Focus is on the increasing recognition that model uncertainty and misspecification can drastically affect point estimators and confidence limits built from limited dose-response data, which in turn can lead to imprecise risk assessments with uncertain, even dangerous, policy implications. Some possible remedies are mentioned, including use of parametric (frequentist) model averaging over a suite of potential dose-response models, and nonparametric dose-response analysis via isotonic regression.  An example on formaldehyde toxicity illustrates the calculations.
  • 12:00 pm - 1:00 pm, Physics and Atmospheric Sciences Building, room 314.


  • Statistics GIDP Colloquium: Wednesday, March 2, 2016. 
  • Speaker: Zhaoxia Yu, PhD, University of California Irvine, Dept of Statistics;
  • Title: Strategies on Identifying Gene-Gene Interactions
  • Abstract: Characterizing gene-gene interactions is of fundamental importance in unraveling the etiology of complex human diseases. However, due to the ultra high-dimensional nature of the problem, the degree to which genes jointly affect disease risk is largely unknown. Two major obstacles toward this goal are the enormous computational effort and heavy burden of multiple testing in testing gene-gene interactions. In this talk I will discuss several strategies using three examples. In this first example we derived close-form and consistent estimates of interaction parameters for case-control data. The derived Wald tests gave very similar results with the gold standard but were ten times faster. In a study of multiple sclerosis, we identified interactions within the major histocompatibility complex region. In the second example, we used information that is independent of interaction testing to prioritize gene-gene pairs for case-parents design. The application of this strategy provided suggestive evidence for interactions between two genomic regions: the major histocompatibility complex region on chr 6 and the killer-cell immunoglobulin-like receptor region on chr19. In the last example, we borrowed information across distinct but similar diseases. We found that genes interacting in multiple sclerosis also interacted with each other in type 1 diabetes.
  • 12:00 pm - 1:00 pm, Physics and Atmospheric Sciences Building, room 314.


  • Statistics GIDP Colloquium: Wednesday, April 6, 2016. 
  • Speaker: Jie Chen, PhD, Georgia Regents University, Dept of Biostatistics & Epidemiology;
  • Title: Change point models in the Bayesian Perspective and their applications in CNV study
  • Abstract: Biomedical researchers now use advanced technologies, such as the comparative genomic hybridization (CGH), the array-based comparative genomic hybridization (aCGH), and the high throughput next generation sequencing (NGS), to conduct DNA copy number experiments for detecting DNA copy number variations (CNVs) as cancer development, genetic disorders, and many other diseases are usually relevant to CNVs on the genome.   Identifying boundaries of CNV regions on a chromosome or a genome can be viewed as a change point problem of detecting signal/intensity changes presented in the genomic data.  The analysis of high throughput genomic data for possible changes has become one of the most recent viable applications of statistical change point analysis. In this talk, I present several change point models suitable to formulate different data types resulting from the aCGH and the NGS technologies and provide Bayesian solutions to these models.  Applications of these methods to tumor cell line data will also be given.
  • 12:00 pm - 1:00 pm, Physics and Atmospheric Sciences Building, room 314.


  • Statistics GIDP Colloquium: Monday, May 2, 2016. 
  • Speaker: Bikas Sinha, PhD, Retired Faculty, Indian Statistical Institute, Kolkata, India
  • Title: Mixture Experiments: Theory and Applications
  • Abstract: This is a review talk dealing briefly with mixture models, standard mixture designs and optimal mixture experiments. Some application areas will be highlighted. 
  • 12:00 pm - 1:00 pm, Physics and Atmospheric Sciences Building, room 314.



Fall 2015

  • Statistics GIDP Colloquium: Wednesday, September 2, 2015, 2014. Speaker: Neng Fan, Assistant Professor, Systems and Industrial Engineering Department, University of Arizona. Title: Learning from Data with Uncertainties via Data-Driven Optimization
  • 12:00 pm - 1:00 pm, Saguaro Hall 114.
  • Abstract: In the last several decades, many advanced technologies have been developed to collect and store data continuously, and data and decisions are more strongly linked together than ever before. In most cases, the data includes a lot of uncertainties, such as missing or incomplete information, measurement errors, noise, etc. Traditional machine learning methods for decisions are dealing with the exact information of data. Only to some extent, the data uncertainty, modeled by some support sets, mean or moment values, has been considered for robust decisions. In this talk, we discuss statistical models for data uncertainties and data-drive optimization approaches for decision-making under uncertainty, especially in the case of big data. Some robust and chance-constrained optimization models and algorithms for support vector machines will be introduced, and numerical experiments will be performed to validate the proposed approaches. 


  • Statistics GIDP Colloquium: Wednesday, September 30, 2015. Speaker: Clayton Morrison, Associate Professor, School of Information, University of Arizona.
  • Title: Finding Structure in Time: Inferring Structured Latent Sequences and Activity Descriptions
  • 12:00 pm - 1:00 pm, Modern Languages Building 410.
  • Abstract: Humans excel at understanding complex dynamic histories, recognizing relevant context and using that context to interpret events that are sometimes hierarchically and recursively structured.  Our research group has found the tools of Bayesian nonparametric modeling and inference well suited for approaching several aspects of these problems.  In this talk I present ongoing work on two applications that require methods for inferring structurally rich representations of time series: identifying context relevant to interpreting biochemical reactions described in cancer biology research papers, and constructing descriptions of coordinated activities from observations in video.


  • Statistics GIDP Colloquium: Wednesday, November 4, 2015. Speaker: Professor Avelino Arellano, Jr., Dept of Atmospheric Science, University of Arizona. 
  • Title: Towards Seamless Prediction of Chemical Weather
  • 12:00 pm - 1:00 pm, Modern Languages Building 410.
  • Abstract


  • Statistics GIDP Colloquium: Wednesday, December 2, 2015. Speaker: Professor Gen Li, Department of Biostatistics, Mailman School of Public Health, Columbia University.  
  • Title: Supervised Principal Component Analysis and Extensions

  • Abstract: It is increasingly common to have heterogeneous data sets measured on the same set of samples. Integrative analysis of multi-source data promises to reveal a more comprehensive picture of the underlying truth than individual analysis. In this talk, I will introduce a novel integrative dimension reduction framework called the Supervised Principal Component Analysis (SupPCA). The research is motivated by applications where people are interested in the low rank structure of some primary data while auxiliary variables are also available on the same set of samples. The proposed method can make use of the extra information in the auxiliary data to accurately extract underlying structures that are more interpretable. The model is formulated in a hierarchical fashion using latent variables, and subsumes many existing models as special cases. The asymptotic properties of parameter estimation are derived. We also extend the framework to accommodate special features, such as high-dimensional data, functional data, and multi-modal data. Applications to bioinformatics and business analytics problems demonstrate the advantage of the proposed methodology. 

  • 12:00 pm - 1:00 pm, Modern Languages Building 410.

Last updated 23 Jan 2020