Think Stats: Probability and Statistics for Programmers
Think Stats is an introduction to Probability and Statistics for Python programmers. It emphasizes simple techniques you can use to explore real data sets and answer interesting questions. This book presents a case study using data from the National Institutes of Health. Readers are encouraged to work on a project with real datasets.
If you have basic skills in Python, you can use them to learn concepts in probability and statistics. Think Stats is based on a Python library for probability distributions (PMFs and CDFs). Many of the exercises use short programs to run experiments and help readers develop understanding.
Most introductory books don’t cover Bayesian statistics, but Think Stats is based on the idea that Bayesian methods are too important to postpone. By taking advantage of the PMF and CDF libraries, it is possible for beginners to learn the concepts and solve challenging problems.
It takes a computational approach, which has several advantages:
- Students write programs as a way of developing and testing their understanding. For example, they write functions to compute a least squares fit, residuals, and the coefficient of determination. Writing and testing this code requires them to understand the concepts and implicitly corrects misunderstandings.
- Students run experiments to test statistical behavior. For example, they explore the Central Limit Theorem (CLT) by generating samples from several distributions. When they see that the sum of values from a Pareto distribution doesn’t converge to normal, they remember the assumptions the CLT is based on.
- Some ideas that are hard to grasp mathematically are easy to understand by simulation. For example, we approximate p-values by running Monte Carlo simulations, which reinforces the meaning of the p-value.
- Using discrete distributions and computation makes it possible to present topics like Bayesian estimation that are not usually covered in an introductory class. For example, one exercise asks students to compute the posterior distribution for the ‘German tank problem,’ which is difficult analytically but surprisingly easy computationally.
- Because students work in a general-purpose programming language (Python), they are able to import data from almost any source. They are not limited to data that has been cleaned and formatted for a particular statistics tool.