Population vs Sample | Definitions, Comparison & Formulas
In research, the terms population and sample describe who you are studying. A population is the entire group you care about. A sample is a smaller group of individuals who you actually collect data from to make inferences, or educated guesses, about the population.
Before an election, pollsters can’t survey the whole population (every registered voter). Instead, they collect data from a sample (a small group from the population) to predict the election outcome.
If the sample is representative, it can provide a good guess of how the population will vote. If it’s biased (maybe only certain types of people respond to polls), the sample may give inaccurate results.
Population and sample definitions
Population and sample both refer to who or what you are studying. However, they’re not quite the same—the population is every possible individual you could include in your study, whereas the sample is the smaller group you actually collect data from.
Population
When you are conducting research, the population is all individuals or entities you care about. A population can be broad (e.g., all adult men in North America) or narrow (e.g., all geophysics majors currently enrolled at Stanford University). A population can also comprise nonhuman things, like all the products manufactured on an assembly line or wheat crops in Saskatchewan, Canada.
How you define your population will depend on who and what you want to study. Ensuring a representative sample of this population will help produce results that have external validity and can be generalized.
Sample
The goal of collecting a sample is to make educated guesses (or inferences) about a population. Collecting data from every individual in a population is often impractical or impossible, so researchers generally collect data from a subgroup of individuals called a sample. This sample is meant to represent the population.
How a researcher selects a sample from a population is a nontrivial decision. The researcher must balance practicality with how representative their sample is. If people with certain characteristics are more likely to be sampled than others, the sample may not reflect the population of interest—this is known as sampling bias. Different sampling methods exist that vary in their convenience and susceptibility to sampling bias.
Scenario 1
A doctor wants to know if one of their patient’s iron levels are low. They draw a small blood sample from their patient and analyze it to draw conclusions about all the blood in the patient’s body.
- Population: all the blood in the patient’s body
- Sample: the small amount of blood drawn
Scenario 2
A researcher wants to compare iron levels in adult runners and non-runners. They collect blood samples from people in each group and compare the results.
- Population: the blood of all adult runners and non-runners
- Sample: the blood samples from individuals tested in each group
These examples show how the population and sample shift depending on the research question.
When to use a sample vs a population
Whether you collect data from a sample versus a population depends on your research goals, available resources, and the size of the group you’re studying.
When to use population data
If the group you want to study is small or if complete data are readily available, collecting data from the entire population can eliminate sampling bias and provide accurate results.
- National census: Governments collect data from every citizen to obtain demographic information, allocate resources, and plan services.
- Standardized academic testing: A school board may administer a standardized test to all students to assess overall performance or identify learning gaps.
- Quality control: For critical products like medical devices, every unit might be inspected to ensure safety and accuracy.
- Employee surveys: In a smaller company, all employees might be surveyed to gauge employee satisfaction.
When to use sample data
If the group you want to study is large, geographically dispersed, or hard to reach, you might instead collect data from a representative sample. If your sample is chosen carefully, the associated data can be used to make accurate inferences about the population the sample represents.
- Public opinion polls: Rather than survey every citizen, pollsters collect data from a representative sample to estimate approval ratings or voting intentions.
- Marketing research: Companies collect sample data from their target demographic to understand preferences or predict product performance.
- Clinical trials: Researchers will test a new drug on a sample of patients to test its efficacy and safety.
- Empirical research: Researchers across a broad range of fields (e.g., psychology, ecology, sociology) often collect data from a sample to explore real-world phenomena.
When analyzing data, it is important to distinguish between population and sample, as populations are described by parameters whereas samples are described by statistics.
Population parameters vs sample statistics and formulas
Once you’ve collected data from your sample or population, you can calculate various measures to describe these groups. We use different terms to refer to measures that describe a population vs. a sample.
- A parameter is a measure calculated from population data.
- A statistic is a measure calculated from a sample.
Statistics are used to make educated guesses (or inferences) about parameters. Different symbols (and in some cases different formulas) are used for measures like size, mean, standard deviation, and variance.
Measure | Definition | Population Parameter | Sample Statistic |
---|---|---|---|
Size | How many individuals are included | ||
Mean | The average of all measurements | ||
Standard deviation | How spread out the data are from the mean, in the original data units | ||
Variance | How spread out the data are from the mean, in original data units squared | ||
Standard error | How precisely a sample statistic estimates a population parameter | N/A | or * *if you don’t have |
Standard deviation and variance both describe the spread of the data—how much they are spread out or clustered around the mean. The spread of values in a sample (when divided by n) is generally less than that of a population. Sample standard deviation and variance calculated this way are therefore biased estimates of population standard deviation and variance—in other words, these statistics systematically underestimate population parameters.
Dividing by “n – 1” instead of “n” corrects for this underestimation, because dividing by a smaller number makes the result bigger.
No correction is required for the mean because sample mean is already an unbiased estimate of the population mean—it does not consistently over- or underestimate population mean. As long as the sample size is large enough (typically n ≥ 30), if you repeatedly draw samples, the mean of this sampling distribution will converge upon the population mean. This is called the central limit theorem.
Frequently asked questions about population vs. sample
- What are sample statistics vs population parameters?
-
In statistics, population parameters are characteristics that describe a population (such as mean, standard deviation, and variance). They are calculated using the data from every member of the group you want to learn about, so they provide a completely accurate description of that population.
Sample statistics, on the other hand, are calculated from a sample (a subset of the population). Sample statistics provide an estimate of population parameters, but because they do not include data from every member of the population, they may be biased or inaccurate.
- What is sampling?
-
Sampling is the process of selecting a subset of individuals (a sample) from a larger population.
Because it’s often not feasible to collect data from every individual in a population, researchers study a sample instead. The goal is to use this sample to make predictions (or inferences) about the broader population.
For example, if you want to study consumer attitudes towards a brand, you might survey a subset of customers rather than every single one.
There are different sampling methods that can be used to select a sample.
- What is sampling bias?
-
Sampling bias occurs when some individuals in the population are more likely to be included in the population than others. This can limit how well results generalize to the broader population.
Sampling methods like probability sampling help reduce sampling bias because every individual in the population has a known, non-zero chance of being included in the sample. However, it’s difficult to eliminate sampling bias entirely, so results from a sample should always be interpreted with caution.
- How do I know if my data are from a population or a sample?
-
Knowing whether your data are from a population or a sample is key to properly analyzing or interpreting your results.
If your data are from a subset of the group you are studying, your data represent a sample. If instead your data have been collected from every single individual you are interested in studying, your data are from a population.
If you are analyzing data you did not collect yourself, consider how likely it is that the researchers who collected this data gathered measurements from every single individual they were interested in studying.
Researchers generally collect data from a smaller group and use the results to make inferences about the population, so there’s a good chance that these data are from a sample rather than a population.