Statistics, quite simply, is about learning from sample data.[1] You face a group of individuals – perhaps people, but maybe cans of tomatoes, or automobiles, or fish in a lake, or even something as nebulous as calendar weeks. This group is the population of interest to you. There is something you would like to know about this population: How likely are the people to try a new product you are thinking of bringing to the market? Are the cans properly sealed? What determines the cost of keeping the cars in working condition? How many fish are there? What will demand for your product be in the weeks to come? The answer to your question will guide you in making a decision.
If you could simply collect data from all the members of your population, you would know what you need to know. However, there can be many reasons why this might not be possible. It might be too expensive: If the potential purchasers of your product are all the adult consumers in the United States, the sheer size of the population makes contacting every individual prohibitively costly. It may be that collecting data does direct damage: If you open all the cans of tomatoes to test the contents, you have nothing left to sell. More subtly, the population is often somewhat ill-defined. If you manage a fleet of automobiles, you might consider the population of interest to be cars actually in your fleet in recent months, together with cars potentially in your fleet in the near future. In this case, some members of the population are not directly accessible to you.
For any of these reasons, you might find yourself unable to examine all members of the population directly. So, you content yourself with collecting data from a sample of individuals drawn from the population. Your hope is that the sample is representative of the population as a whole, and therefore anything learned from the sample will give you information concerning the entire population, and will consequently help you make your decisions.
All statistical studies are carried out by following some statistical procedure, and every statistical procedure has three elements: You must specify how the sample data will be collected and how much data will be collected, and what you’ll do with the data once it’s in hand.
As a simple example, consider the following estimation procedure:
If you were facing a decision problem in which the “best” decision depended on the population mean income, you might now use your estimate to guide your decision.
In the example above, the estimate we obtain might – if we are incredibly lucky – be precisely equal to the population mean. However, it will probably be the case that the sample is not perfectly representative of the population, and our sample mean is somewhat different from the true population mean. The possibility that the sample might fail to be perfectly representative is called exposure to sampling error. How far off is our estimate from the truth? Using only the data at hand, we can’t say. (If we could, we’d simply correct the estimate!) Instead, we focus our attention on the procedure used to make the estimate, and we determine how exposed to sampling error we were in using that procedure. Dive down to see how this is done.
When we make an estimate, we summarize our exposure to sampling error by using a standard “language” to report our result. We say:
“I conducted a study to estimate {something} about {some population}. My estimate is {some value}. The way I went about making this estimate, I had {a large chance} of ending up with an estimate within {some small amount} of the truth.”
For example, “I conducted a study to estimate the mean gross income over the past year of current subscribers to our monthly magazine. My estimate is $65,230. The way I went about making this estimate, I had a 95% chance of ending up with an estimate within $1,500 of the truth.”
Notice how parsimonious this language is. We don’t bore the listener with unnecessary detail concerning the data-collection procedure. Instead, we cut directly to the important issue: How exposed to sampling error were we when we carried out the procedure? How much can our procedure (and, perforce, the estimate we derived using it) be trusted?
Sometimes we wish to examine statistical evidence, and determine whether it supports or contradicts a claim that has been made (or that we might wish to make) concerning the population. This is done in a somewhat asymmetric fashion, analogous to the approach taken in the British system of criminal justice (adopted throughout most of the modern world): We take a statement, presume it to be “innocent,” i.e., true, and ask how strongly the evidence contradicts our initial assumption.
Typically, we only do this if some evidence weighs against the statement, and our statistical analysis determines how strongly the evidence contradicts the original statement. This is done by calculating the probability that the procedure we carried out would – in a world where the statement really is true – provide such contradictory evidence purely due to sampling error.
This probability, called the significance level of the sample data with respect to the statement, is then interpreted. If it is relatively large, then we conclude that the evidence against the statement is weak, since we must acknowledge that, in a presumed world in which the statement is true, our studies would frequently still provide such evidence purely due to our exposure to sampling error. However, if this probability is small, we conclude that the evidence at hand is quite different from that we would expect to see if the statement were true, i.e., we conclude that the evidence strongly argues against the statement’s truth, and we lean towards finding the statement “guilty.”
Just as in a criminal trial, we never conclude that the statement is “innocent” – at most, we find it “not guilty.” In other words, our analysis leaves us in one of two camps: We have strong evidence that the original statement is false, or we do not have such evidence. Therefore, if we wish to make an affirmative case for a claim, we are forced to take the opposite of that claim as the statement we put on trial. Only in this way might we conclude, at the end, that the data – if strong evidence against the claim on trial – serves to support the original claim.
With the two languages of statistics in place, one is ready to use the most important (to managers) of statistical “tools,” regression analysis. This does what its name implies: It “goes back” from something of primary interest – for example, the market value of a piece of residential real estate – to a set of determining factors – such as lot size, interior living space, number of bedrooms, and (of course) location. It untangles the effects of the various factors, providing a clear view of the role of each in the overall relationship.
One reason for performing a regression analysis is to be able to make predictions in individual cases. A county tax assessor, or a real estate agency, may wish to appraise the market value of a home that has not changed possession for many years.
Another reason is to estimate the impact of some particular factor in the overall relationship. For example, one might wonder what the addition of an exercise room, or a living-room fireplace, typically adds to the resale value of one’s property.[2]
In either of these applications, the language of estimation comes into play, being used to say how much a prediction, or an estimate of the impact of some factor, can be trusted. As well, the language of hypothesis testing is used when investigating whether there is evidence that some factor truly has an effect – for example, that after differences in experience and work assignments are taken into account, there is still evidence that a firm’s compensation scheme rewards one demographic group differently from another.
[1] Actually, this is the focus of what is sometimes called “inferential statistics.” In contrast, “descriptive statistics” refers to useful ways of summarizing data at hand without direct reference to an underlying population.
[2] Regression analysis cannot, by itself, demonstrate that an added feature causes an increase in resale value. However, it can estimate the typical difference in resale value of comparable homes that differ only in the presence or absence of the feature. We could then use our judgment and experience to assert that the resale price difference is due to the feature.