When a researcher is designing an upcoming study involving human subjects, the number of subjects to be included in the study is one of the primary considerations. At one extreme, the researcher might decide to include only one subject in their study. Such a simple dataset would have the great benefit of being very easy to understand and summarize; after all, there is only one data point. However, it would be difficult for the researcher to be confident that the subject in the study was a good representative of all possible subjects. There is a reasonable chance that the chosen subject is exceptional in some way and therefore not very representative of the overall population. In other words, any results from this single-subject study would be difficult to generalize to a larger population of people. At the other extreme, the experiment designer might decide to include every member of the large population in their study. This would have the great benefit of being perfectly generalizable, as every member of the population was considered in the study.

However, this approach suffers from two major drawbacks. First, such a large dataset would be difficult to describe and understand in its entirety. Second, and perhaps more importantly, collecting data from an entire population of individuals almost always requires too much time and money to be practical. In fact, it may not even be possible to access all of the members of the population that the researcher is interested in. There is a tradeoff when choosing the number of subjects in your study: generalizability is traded against the ease of describing and collecting data. As a consequence, researchers almost always choose a number of subjects that represents the best compromise for their particular situation. Because the number of subjects lies between the two extremes, the resulting datasets are neither easily describable nor perfectly generalizable.

The sole purpose of statistics is to resolve these difficulties, hopefully allowing the researcher the best of both worlds: easy and accurate description as well as generalizable conclusions. Statistics is the practice of using defensible techniques to either describe or make inferences about a set of data. Both of these needs arise from the unfortunate fact that datasets are almost always of a size that is incompatible with our needs as researchers. The need for descriptive statistics arises when a dataset is too big to be easily described and understood. In this case, we look for ways to summarize the dataset that preserve the important characteristics of the dataset. In a sense, the data become more portable through this process. We need inferential statistics when we are trying to use our data to come to conclusions. Unfortunately, we rarely want to come to conclusions about the individuals that provided the dataset in the first place. Instead, we typically want to come to more general conclusions about a larger group of people, often an entire population of individuals.

Samples vs. Populations, and Statistics vs. Parameters

A population is a very large (in fact, effectively infinite in size) group of individuals that share one or more important characteristics. For example, males in the United States, females in Great Britain, elephants, and incandescent light bulbs are each populations. In the large majority of cases, researchers want to describe or make inferences about populations. An autism researcher, for example, might want to determine the proportion of US males that lie on the autism spectrum. This population is enormous and unwieldy, though, which makes it difficult to study directly. The autism researcher doesn’t have the time or money to test the entire population of US males, and therefore is forced to test a subset of the population instead. This subset is called a sample and will serve as a substitute for the population.

Characteristics of the sample (for example, the proportion of the sample that lies on the autism spectrum) are called statistics. Characteristics of the population (for example, the proportion of the population of US males that lies on the autism spectrum) are called parameters. We use descriptive and inferential statistics to describe and makes inferences about population parameters. For example, if we observe that 56% of our sample lies on the autism spectrum, we might conclude that approximately 56% of the population of US males lies on the autism spectrum. In other words, we might use a descriptive statistic to estimate a population parameter.

Sample statistics are usually represented with letters from the Roman alphabet (\overline{X},s, etc.), and population parameters are represented with greek letters (\mu,\sigma, etc.).

Categorised in: Statistics