“End? No, the journey doesn’t end here.”
— J.R.R. Tolkien
After reading this chapter you will be able to:
- Understand the roadmap to continued education about models and the
So you’ve completed STAT 420, where do you go from here? Now that you understand the basics of linear modeling, there is a wide world of applied statistics waiting to be explored. We’ll briefly detail some resources and discuss how they relate to what you have learned in STAT 420.
RStudio has recently released version 1.0! This is exciting for a number of reasons, especially the release of
R Notebooks combine the RMarkdown you have already learned with the ability to work interactively.
In this textbook, much of the data we have seen has been nice and tidy. It was rectangular where each row was an observation and each column was a variable. This is not always the case! Many packages have been developed to deal with data, and force it into a nice format, which is called tidy data, that we can then use for modeling. Often during analysis, this is where a large portion of your time will be spent.
R community has started to call this collection of packages the Tidyverse. It was once called the Hadleyverse, as Hadley Wickham has authored so many of the packages. Hadley is writing a book called
R for Data Science which describes the use of many of these packages. (And also how to use some to make the modeling process better!) This book is a great starting point for diving deeper into the
R community. The two main packages are
tidyr both of which are used internally in RStudio.
In this course, we have mostly used the base plotting methods in
R. When working with tidy data, many users prefer to use the
ggplot2 package, also developed by Hadley Wickham. RStudio provides a rather detailed “cheat sheet” for working with
ggplot2. The community maintains a graph gallery of examples.
Use of the
manipulate package with RStudio gives the ability to quickly change a static graphic to become interactive.
RStudio has made it incredibly easy to create data products through the use of Shiny, which allows for the creation of web applications with
R. RStudio maintains an ever-growing tutorial and gallery of examples.
In the ANOVA chapter, we briefly discussed experimental design. This topic could easily be its own class, and is currently an area of revitalized interest with the rise of A/B testing. Two more classic statistical references include Statistics for Experimenters by Box, Hunter, and Hunter as well as Design and Analysis of Experiments by Douglas Montgomery. There are several
R packages for design of experiments, listed in the CRAN Task View.
Using models for prediction is the key focus of machine learning. There are many methods, each with its own package, however
R has a wonderful package called
caret, Classification And REgression Training, which provides a unified interface to training these models. It also contains various utilities for data processing and visualization that are useful for predictive modeling.
Applied Predictive Modeling by Max Kuhn, the author of the
caret package, is a good general resource for predictive modeling, which obviously utilizes
R. An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani is a gentle introduction to machine learning from a statistical perspective which uses
R and picks up right where this course stops. This is based on the often referenced The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. Both are freely available online.
While it probably isn’t the best tool for the job,
R now has the ability to train deep neural networks via TensorFlow.
In this class we have only considered independent data. What if data is dependent? Time Series is the area of statistics which deals with this issue, and could easily span multiple courses.
The primary textbook for STAT 429: Time Series Analysis at the University of Illinois that is free is:
- Time Series Analysis and Its Applications: With R Examples by Shumway and Stoffer
When performing time series analysis in
R you should be aware of the many packages that are useful for analysis. It should be hard to avoid the
zoo packages. Often the most difficult part will be dealing with time and date data. Make sure you are utilizing one of the many packages that help with this.
In this class, we have worked within the frequentist view of statistics. There is an entire alternative universe of Bayesian statistics.
Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan by John Kruschke is a great introduction to the topic. It introduces the world of probabilistic programming, in particular Stan, which can be used in both
R and Python.
R will be called a “slow” language, for two reasons. One, because many do not understand
R. Two, because sometimes it really is. Luckily, it is easy to extend
R via the
Rcpp package to allow for faster code. Many modern
R packages utilize
Rcpp to achieve better performance.
Also, don’t forget that previously in this book we have outlined a large number of
R resources. Now that you’ve gotten started with
R many of these will be much more useful.
If any of these topics interest you, and you would like more information, please don’t hesitate to start a discussion on the forums!