Is the Statistics GIDP the Right Program For You?

This may be the most exciting time ever to choose a career path in Statistics. You are about to decide if graduate education in statistics is a part of your own personal career path. On these pages, we will help you make an important decision “What program is right for me?” by addressing the following questions about Statistics and the University of Arizona.

In the information age, we have seen ever-growing sources of Big Data, and the ever-increasing number of researchers engaged in Small Data projects, including health informatics data, agricultural and environmental genomic data, satellite and telescope data, medical imaging, crowdsourcing from online communities, and business intelligence among the myriad of Data Analytical sources. With the desire for more and better data, Data Science and Statistics is a critical factor in advancing every aspect of human activity and research. The ease of data gathering has resulted in massive stockpiles of publically accessible data in commerce and research for every area of study including for example: genomic and other omic analyses, personalized medicine, astronomy and cosmology, planetary sciences, climate modeling, politics and legal policy, environmental impacts, economics and finance, traffic engineering, materials analysis, image processing, resource optimization and every form of informatics. The list is endless. Data Science and Statistics is the common core underpinning in all of these areas; it provides the validity on which progress is measured for every research and commercial field.

The centrality of statistical thinking is now accepted both by the public, by government, by industry and by researchers from a broad range of endeavors. This viewpoint has become common wisdom in a world of big data. Indeed, following the 2013 International Year of Statistics, several Statistics professional societies convened a Futures of Statistical Sciences Workshop.

The report from this meeting makes the following observations.

Statistics can be most succinctly described as the science of uncertainty. While the words “statistics” and “data” are often used interchangeably by the public, statistics actually goes far beyond the mere accumulation of data. The role of a statistician is:

To design the acquisition of data in a way that minimizes bias and confounding factors and maximizes information content
To verify the quality of the data after it is collected
To analyze data in a way that produces insight or information to support decision-making

These processes always take into explicit account the stochastic uncertainties present in any real-world measuring process, as well as the systematic uncertainties that may be introduced by the experimental design. This recognition is an inherent characteristic of statistics, and this is why we describe it as the “science of uncertainty,” rather than the “science of data.”

This discussion and others highlight that the challenges in both data management and data analysis have required statisticians to embrace new approaches to meet the broadening reliance on those who have advanced skills in statistics. The necessity for such skills as a part of a collaborative environment is well summarized in the Introduction to the Frontiers of Massive Data Analysis.

Inference is the problem of turning data into knowledge, where knowledge often is expressed in terms of variables (e.g., a patient’s general state of health, or a shopper’s tendency to buy) that are not present in the data per se, but are present in models that one uses to interpret the data. Statistical principles are needed to justify the inferential leap from data to knowledge, and many difficulties arise in attempting to bring these principles to bear on massive data. Operating in the absence of these principles may yield results that are not useful at best or harmful at worst. In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge but which actually is not. Moreover, it can be quite difficult to know that this has happened.

Even when the initial data set is big, it is important to acknowledge that the inferential questions are based on the frequently arduous task of refining and cleaning data so that in the end inference, be it hypothesis testing, estimation, prediction or classification, will be performed on very precisely developed small data sets that answer targeted research questions. In these circumstances, using best available methods - parametric versus nonparametric methods, Bayesian versus classical approaches, supervised versus unsupervised learning, survival analysis versus logistic regression, or fixed effects versus random effects versus mixed effects models - can be “make or break” for a project.

It is for you to decide what areas of human endeavor are right for you and how the knowledge you gain during your graduate education will bring you closer to that goal. Enjoy exploring these pages and send your questions to the Program Coordinator, Melanie Bowman at gradstats@math.arizona.edu.