Choon Yong-Meaning From Data

Author of Book: Professor Michael Starbird
Date Read: October 12, 2024

Book Report

Book Report # 43 – Meaning From Data: Statistics Made Clear
Begin: 9/2/2024
Finish: 10/12/2024
Title: Meaning From Data: Statistics Made Clear
Author: Professor Michael Starbird
University of Texas at Austin.

Why I choose to read this book:
To learn more about statistics so I can tutor inmate students who are taking the Merced College Statistics Class.

What I learned from this book:
The trouble wit data is that data do not arrive with meaning. Data are value free and useless or actually misleading until we learn to interpret their meaning appropriately. Statistics provides the conceptual and procedural tools for drawing meaning from data. One of the great ideas of modern quantitative analysis of our world is that the uncertain and the unknown can be described quantitatively. Two major challenges of Statistics are: 1) How can we describe and draw meaning from a collection of data when we know all the pertinent data? 2) How can we infer information about the whole population when we know data about only some of the population (a sample)? Statistics is becoming increasingly important as technological advances continues to bring large data sets and more detailed techniques of analysis within the range of practicality.
Describing Data and Inferring Meaning:
Articles about politics, elections, world conflict, economics, business, and all of these centrally involve data and interpretation of data. The fundamental challenge for statistics is to assemble data and to interpret them to provide meaning. Statistics and Data, share a grammatical issue. Are they singular or plural? Data is the plural of datum, a single piece of information. Statistics can be singular or Plural, depending on the meaning, Statistics is the study of data, but Statistics are bits of information.
Data and Distributions:
It is common to summarize a collection of data with a single number. Mean is obtained by adding up all the numbers and dividing it by the number of data points. Median is the middle of the ordered list. Five values- minimum, first quantile, median, third quantile and maximum, give a five number summary of the data. Data items that lie far outside the values between the first and third quantile are called – outliers. Graphical representation of data can help us see distribution of data and patterns. It can show us aspects of a distribution: shape, center and spread. We can use Box-Plots, Histograms and scatter Plots to see distribution and correlations.
Inference – How close? How Confident?
The principle of Statistical inference, how we use information about just some members of a population to infer the information about the whole population. Analyses of randomness and probability allow us to quantify our confidence in extrapolation from some of the data to the whole population, Randomness and probability are the cornerstone of all methods of testing hypothesis.
Describing Dispersion or Measuring Spread:
To describe a set of data, we have to confront the challenge of taking a list of numbers and putting some structure on them through which we can garner meaning. The Mean and the Median are both measures of central tendency. The Median (middle number of ordered list) is not affected by outliers. The Mean is affected significantly by outliers. Th most common measure of dispersion or spread of data is the Standard Deviation (SD). The SD is the square root of the average squared distance from data points of the mean. SD is affected significantly by outliers. Histogram gives us a good visual sense of the distribution, including how spread out the data are. The five number summary (minimum, maximum, first and third quantiles and Median) and associated Box Plot give some sense of how the data are spread out. The SD is a numerical measure of roughly how far the data are on average from the mean.
Model of Distributions:
The big picture of how to describe a distribution was to describe three things: the shape, the center and the spread. The simplest shape that a distribution can have is a flat line. These distribution are called uniform distribution f(x)=c, a constant. Poisson Distribution, the shape of the Poisson Histogram is skewed right. Specific families of distributions were models’ by formula, including: uniform, Poisson, exponential and Binomial Distributions. The basic strategy for describing the shape of a set of data is to find a mathematical model and approximate the Histogram of data we have.
The Bell Curve:
The most famous shape of distribution, is the bell-shaped curve, which is called the Gaussian or the Normal Distribution. No matter what the shape of the population data with which we start, the distribution of the sample means will converge to a normal curve. The observation is known as the Central Limit Theorem. The old name for Gaussian Distribution id the Error Distribution. Gaussian distribution can differ in their spread. Setting mean = 0 and SD=1 gives the Standard Normal curve. Gaussian Distribution that is very tight (SD is small) or for one that spread out (SD is large). The proportion of the population whose values differ from the mean value by less than 2 times the SD is 95%. The 3 SD proportion is 99.7%. The number of SD away from the mean is called the Z-score. If we know the mean and SD of a normal distribution, then we know that about 68% of the date will be within one SD from the means: 95% within two SD from the mean: and 97% within three SD from the mean.
Correlation and Regression:
Scatter plot gives a visual sense of the relationship between two variables; correlations, which gives a quantitative measure of the strength of the linear relationship. There is a relation between the attribute measured by the first number and the attribute measured by the second number. If they move together, they are correlated. Statistics correlation indicates an association but does not prove that there is a causal relationship between the variables. One of the misuses of Statistical information is to mistakenly infer cause and effect from correlations.
Probability: Workhorse for Inference:
Probability is the bridge between the two items of : 1)Describing data when we know all the data and 2)Inferring characteristics of the whole population from a sample. Probability is the study of randomness.
Samples – The Few, The Chosen:
Several examples of potential sampling pitfalls including: bias, sample size that are too small, and receiving dishonest responses to questions that may be controversial or sensitive. Randomness is the key component of obtaining a representative sample. The term population refers to the whole collection of people or things being considered. A sample is a subset of the total population that we are investigating. Central characteristic of good sampling involves randomness rather than intent. The basic purpose of getting data from a sample is to infer information about the whole population.
Hypothesis Testing:
Hypothesis testing is one of the workhorse of statistical inference. The phrase reject the null hypothesis in the way of saying that the hypothesis about the world in unlikely to be true. Comparing a hypothesized state of the world with the experimental data that we gather is a fundamental strategy of statistical inference.
Confidence Intervals – How Close? How Sure?:
A key to both confidence intervals and hypothesis testing comes from the Central Limit Theorem, which tells us how likely it is that the mean of the sample is close to the mean of the population.
Design of Experiments – Thinking Ahead:
The challenge of experimental design, to make sure that we gather the data in such a way that we are able to draw the meaning from the data, that we are able to use the techniques of Hypotheses testing and confidence inference, to make the mathematical kind of deduction that make logical sense and that allows us to actually infer from the data that ideas of interest.

How will the Book contribute to my success upon my release:
This book on Statistics will allow me to tutor my Statistics students. Statistics will allow me to use my analytical and critical thinking skills to deduce correlation of data and making sense of statistical information. This information will be shared with communities which I hope to volunteer my services.