Statistical methods are used at each step in an applied machine learning project.
This means it is important to have a strong grasp of the fundamentals of the key findings from statistics and a working knowledge of relevant statistical methods.
Unfortunately, statistics is not covered in many computer science and software engineering degree programs. Even if it is, it may be taught in a bottom-up, theory-first manner, making it unclear which parts are relevant on a given project.
In this post, you will discover some top introductory books to statistics that I recommend if you are looking to jump-start your understanding of applied statistics.
I own copies of all of these books, but I don’t recommend you buy and read them all. As a start, pick one book, but then really read it.
Let’s get started.
This section is divided into 3 parts; they are:
- Popular Science
- Statistics Textbooks
- Statistical Research Methods
Popular science books on statistics are those books that wrap up the important findings from statistics, like the normal distribution and the central limit theorem in stories and anecdotes.
Do not overlook these types of books.
I read them all the time even though I’ve pawed through statistics textbooks. The reasons I recommend them are:
- They’re a quick and fun to read.
- They often give a fresh perspective on dry material.
- They’re for the lay audience.
They will help show you why a working knowledge of statistics is important in a way that you will be able to connect to your specific needs in applied machine learning.
There are many great popular science books on statistics; the three I would recommend are:
Written by Charles Wheelan.
For those who slept through Stats 101, this book is a lifesaver. Wheelan strips away the arcane and technical details and focuses on the underlying intuition that drives statistical analysis. He clarifies key concepts such as inference, correlation, and regression analysis, reveals how biased or careless parties can manipulate or misrepresent data, and shows us how brilliant and creative researchers are exploiting the valuable data from natural experiments to tackle thorny questions.
Written by Leonard Mlodinow.
With the born storyteller’s command of narrative and imaginative approach, Leonard Mlodinow vividly demonstrates how our lives are profoundly informed by chance and randomness and how everything from wine ratings and corporate success to school grades and political polls are less reliable than we believe.
Written by Nate Silver.
Drawing on his own groundbreaking work, Silver examines the world of prediction, investigating how we can distinguish a true signal from a universe of noisy data. Most predictions fail, often at great cost to society, because most of us have a poor understanding of probability and uncertainty. Both experts and laypeople mistake more confident predictions for more accurate ones. But overconfidence is often the reason for failure. If our appreciation of uncertainty improves, our predictions can get better too. This is the “prediction paradox”: The more humility we have about our ability to make predictions, the more successful we can be in planning for the future.
Do you have a favorite popular science book on statistics?
Let me know in the comments below.
(Softer) Statistics Textbooks
You need a solid reference text.
A textbook contains the theory, the explanations, and the equations for the methods you need to know.
Do not read these books cover to cover; rather, once you know what you need, dip into these books to learn about those methods.
In this section, I have included a mixture of books including (in order) a proper statistics textbook, a text for those with a non-math background, and a book for those with a programming background.
Pick one book that suits your background.
Written by Larry Wasserman.
The book includes modern topics like non-parametric curve estimation, bootstrapping, and classification, topics that are usually relegated to follow-up courses. The reader is presumed to know calculus and a little linear algebra. No previous knowledge of probability and statistics is required. Statistics, data mining, and machine learning are all concerned with collecting and analysing data.
Written by Timothy C. Urdan.
This introductory textbook provides an inexpensive, brief overview of statistics to help readers gain a better understanding of how statistics work and how to interpret them correctly. Each chapter describes a different statistical technique, ranging from basic concepts like central tendency and describing distributions to more advanced concepts such as t tests, regression, repeated measures ANOVA, and factor analysis. Each chapter begins with a short description of the statistic and when it should be used. This is followed by a more in-depth explanation of how the statistic works. Finally, each chapter ends with an example of the statistic in use, and a sample of how the results of analyses using the statistic might be written up for publication. A glossary of statistical terms and symbols is also included. Using the author’s own data and examples from published research and the popular media, the book is a straightforward and accessible guide to statistics.
Written by Peter Bruce and Andrew Bruce (Author)
Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.
Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.
What is your favorite statistics textbook?
Let me know in the comments below.
Statistical Research Methods
Once you have the foundations under control, you need to know what statistical methods to use in different circumstances.
A lot of applied machine learning involves designing and executing experiments, and statistical methods are required for effectively designing those experiments and interpreting the results.
This means that you require a solid grasp of statistical methods in research context.
This section provides a few key books on this topic.
It is hard to find good books on this topic that are not too theoretical or focused on the proprietary SPSS software platform. The first book is highly recommend and general, the second uses the free R platform, and the last is a classic textbook on the topic.
Written by Paul R. Cohen.
Computer science and artificial intelligence in particular have no curriculum in research methods, as other sciences do. This book presents empirical methods for studying complex computer programs: exploratory tools to help find patterns in data, experiment designs and hypothesis-testing tools to help data speak convincingly, and modeling tools to help explain data. Although many of these techniques are statistical, the book discusses statistics in the context of the broader empirical enterprise. The first three chapters introduce empirical questions, exploratory data analysis, and experiment design. The blunt interrogation of statistical hypothesis testing is postponed until chapters 4 and 5, which present classical parametric methods and computer-intensive (Monte Carlo) resampling methods, respectively. This is one of few books to present these new, flexible resampling techniques in an accurate, accessible manner.
Written by Roy Sabo and Edward Boone.
This textbook will help graduate students in non-statistics disciplines, advanced undergraduate researchers, and research faculty in the health sciences to learn, use and communicate results from many commonly used statistical methods. The material covered, and the manner in which it is presented, describe the entire data analysis process from hypothesis generation to writing the results in a manuscript. Chapters cover, among other topics: one and two-sample proportions, multi-category data, one and two-sample means, analysis of variance, and regression. Throughout the text, the authors explain statistical procedures and concepts using a non-statistical language. This accessible approach is complete with real-world examples and sample write-ups for the Methods and Results sections of scholarly papers. The text also allows for the concurrent use of the programming language R, which is an open-source program created, maintained and updated by the statistical community. R is freely available and easy to download.
Written by George E. P. Box, J. Stuart Hunter, and, William G. Hunter.
Rewritten and updated, this new edition of Statistics for Experimenters adopts the same approaches as the landmark First Edition by teaching with examples, readily understood graphics, and the appropriate use of computers. Catalyzing innovation, problem solving, and discovery, the Second Edition provides experimenters with the scientific and statistical tools needed to maximize the knowledge gained from research data, illustrating how these tools may best be utilized during all stages of the investigative process. The authors’ practical approach starts with a problem that needs to be solved and then examines the appropriate statistical methods of design and analysis.
Do you have a favorite book on statistical research methods?
Let me know in the comments below?
You need to have a grounding in statistics to be effective at applied machine learning.
This grounding does not have to come first, but it needs to happen some time on your journey.
I think your path through statistics should start with a book, but really must involve a lot of practice. It is an applied field. I recommend developing code examples for every key concept that you learn along the way
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Have you read any great books on statistics?
Let me know in the comments below.