Data is meaningless without analysis. With today’s explosive growth of data, scientists must be able to tame large swaths of information to transform raw bytes into meaningful, actionable research artifacts and insights.
Enter the R software environment. R is an open source package made for statistical and data exploration. It has effectively become the de facto standard for statistical analysis gaining a substantial following since its inception. In fact, R is so popular that university statistics courses have started to offer classes with R as the backbone. As a motivating example, Western Michigan University has mandated for computer science undergraduates to take “Introduction to Statistics using R,” as a requirement towards graduation.
While R is seen as a swiss army knife for statistical analysis, it is still the responsibility of the student to check the underlying assumptions of a statistical routine. In other words, R doesn’t prevent users from blindly applying functions without checking assumptions such as normality. With a little understanding of statistics, R becomes a powerful tool for data analysis.
Peter Gustafson, assistant professor of Mechanical and Aeronautical Engineering at Western Michigan University, is a big proponent for using such statistical software packages in research. Gustafson observes that, “many published papers with statistical analyses do not present enough information to be convincing.” He further notes that, “it would require a little more effort [with R] to provide enough information to make these much more credible.”
In short, Gustafson provides a simple guideline for publishing statistically meaningful results. His seven step guideline is summarized as follows.
1. Select a model and collect pre-experiment information. A mathematical model is constructed for measuring a particular phenomena. An example model could be linear or quadratic. Pre-experiment information is akin to collecting a small number of data points from previous experiments.
2. Select a hypothesis and choose an alpha level. A hypothesis must be selected before data is examined. The alpha level determines the level of accuracy, e.g., an alpha level of 0.05 would result in only one false positive in twenty independent replications of the experiment.
3. Design an experiment, setting a beta level and choosing n. The beta level corresponds to type II errors, while n corresponds to the sample size. A type II error is a false negative in statistical analysis, and therefore, beta is typically minimized. An experimenter usually desires for n to be high but experimental costs prohibit such collections of large samples.
4. Collect the data. Data must be collected in a controlled manner, and the experimenter must note any abnormal behaviors.
5. Run the analysis. The power of R shines in this stage as the experimenter has several tools available at his disposal. Gustafson recommends using multiple statistical procedures to test the same hypothesis. When these tests disagree with each other, the data either violates model assumptions or there may be type I or type II errors. Further study is recommended at this point.
6. Check the model assumptions. This step is critical in evaluating the integrity of any statistical procedure. Gustafson advises the experimenter to “check that errors are normally distributed, symmetrically distributed, or have no outliers.”
7. Report the results, with supporting evidence. Finally, following the aforementioned six steps, the experimenter can be confident with the experiment. Conclusions are justified through sound statistical analysis.
For more information, Gustafson provides a motivating example full with code listings using R for a simple regression example.
R is available in source or as a pre-combined binary for many systems covering Microsoft Windows, Apple OS X, and UNIX. Released under the GPLv2 license, users are free to modify the code as long as the licensing remains in tact.