Understanding Statistical Terminology
Using statistics in any way requires some level of familiarity and understanding of its terminology. Below we have compiled a glossary of common statistical terms that you are likely to come across in your research or reading. If you have suggestions of additional terms to add, please feel free to reach us at email@example.com.
Aggregate data: statistical summaries of data, meaning that the data have been analyzed in some way. Example: the number of unemployment claims filed in Colorado in a given week is a total or aggregate of individual claims filed in that week.
Association: the concept that there is a relationship between two or more variables that can be defined statistically.
Big data: a popular term used across academia, industry, and other arenas to describe the increased availability of all types of data. Big data is typically described as being huge in volume, high in velocity (how fast it is created), and diverse in variety.
Categorical variable: an observable characteristic that describes subjects by categories. Also called a discrete or nominal variable. Example: Female vs. Male Infant Mortality Rate of Belarus where the mortality rate is numerical and gender is a categorical variable.
Causality: generally, the concept that outcomes are the direct result of certain events or actions that have taken place. It is often discussed alongside "association" and "correlation," which are used to describe whether there is a relationship between variables and the strength of that relationship. For further explanation on how all three terms relate to one another, check out the video below!
Source: (2018). How do I test for causality? [Video]. SAGE Research Methods Video: Practical Research and Academic Skills https://www.doi.org/10.4135/9781526443090
Correlation: the measure to which two variables demonstrate a linear relationship to one another (e.g., positive and negative correlation). Correlation is a form of association. Correlation is often used very casually, but it is important to note that correlation between two variables does not mean that the presence of one causes a change in the other. Repeat after us: correlation does not equal causation.
Data: fundamentally, data = information. We typically use the term to refer to numeric files that are created and organized for analysis. There are two types of data: aggregate and microdata.
Data aggregation: a collection of datapoints and datasets.
Data analytics: generally used to refer to the techniques and tools required to analyze massive amounts of data.
Database: a collection of data organized for research and retrieval. Example: American Community Survey by the Census Bureau.
Data point or datum: singular of data, generally refers to a single data value. Example: 25,114 billion BTU of aviation gasoline was consumed by the transportation sector in the US in 2012.
Dataset: a term used loosely to refer to a collection of related data items. This term is used very loosely. For instance, the entire Census 2010 can be considered a dataset as can any individual table published within the Census 2010 such as Table P20, "Households by Presence of People Under 18 Years by Household Type by Age of People Under 18 Years."
Derived statistics: statistics calculated on the basis of other statistics. Example: the crime rate is a derived statistic based on the number of crimes committed in relation to the population of the area under investigation.
Descriptive statistics: counts, averages (means), percentages, and so on that summarize the quantitative information obtained during the data collection effort. These simplify raw, observed data points into understandable and meaningful information, but does not state anything beyond the observed data points. Descriptive statistics are usually contrasted with inferential statistics. Check out the video below to learn how the different types of statistics relate to one another!
Source: (2020). Understanding Statistics [Video]. SAGE Publications
Indicator: typically used as a synonym for statistics that describe something about the socioeconomic environment of a society, such as per capita income, unemployment rate, or median years of education.
Inferential statistics: statistics used to draw inferences about a population based on information collected on a sample of the population. Example: the Census Bureau collects data from a sample of the US population in conducting the American Community Survey, and then uses a series of statistical tests to create estimates of what the statistic would be for the entire nation. Information on the calculations used (t-tests, ANOVA, regression, etc.) are typically found in the technical documentation on the survey methodology.
Median value: the "middle" value of a dataset when all values are sorted from lowest to highest. The median value is valuable because it minimizes the impact of very low or very high outliers on the dataset's average.
Microdata: individual response data obtained in surveys and censuses—these are data points directly observed or collected from a specific unit of observation. Also known as raw data. ICPSR is an excellent resource for obtaining microdata files. Watch this video on microdata from the US Census Bureau to learn more:
Source: (2020). DATA GEMS: What is Microdata and Why Should I Use It? [Video]. U.S. Census Bureau https://www.youtube.com/watch?v=5MzRsT9Ofug
Percent: a proportional measure that compares a portion of a total to the actual total. This helps you understand the relationship between a slice to the whole. It is often used when you want to assess how significant a portion or amount is to an established total. Example: Percentage of high school students by grade that have ever drank alcohol.
Quantitative data/variables: information that can be handled numerically. Example: the number of US consumers who purchased personal care products and services. Virtually all the data in Data Planet would be considered quantitative.
Qualitative data/variables: information that refers to the quality of something. Ethnographic research, participant observation, open-ended interviews, and so on, may collect qualitative data. However, often there is some element of the results obtained via qualitative research that can be handled numerically (e.g., how many observations, number of interviews conducted).
Ratio: proportional measure that compares the difference between numbers from different categorical variables. It is often used when you want to understand and compare the relationship between two distinct groups. For example, males outnumber females at least 2:1 in the US Senate.
Sample: a slice of a population that is being observed, surveyed, or studied to estimate data and information about the entire population. To learn more about what a sample is and the various terms related to it, check out our video below!
Source: (2018). What is a sample? [Video]. SAGE Research Methods Video: Practical Research and Academic Skills https://www.doi.org/10.4135/9781526443168
Secondary data: information or data collected for others to analyze and use for their own research purposes. This is the most common form of data people encounter or use in their day-to-day. For instance, the Bureau of Economic Analysis (BEA) provides a wealth of secondary data for researchers and the public to analyze local economies and help inform decision-making.
Statistic: a numerical summary of data that has been analyzed in some way to describe some characteristic, or status, of a variable, such as a count or a percentage. Example: total nonfarm job openings over time.
Survey: a data collection method using a population of people that are studied or interviewed at a particular point in time for the purposes of making inferences or conclusions about the population. The most well-known example in the US is the decennial US census survey, which the federal government is legally required to administer under the US constitution and has been conducting since 1790.
Time series data: data points recorded in chronological order. Example: Gross Domestic Product of Lebanon, 2005-2020.
Variable: a characteristic that can change or vary. Examples include anything that can be measured, such as the number of mining operations in Alabama.
Cramer, D., & Howitt, D. (2004). The SAGE dictionary of statistics (Vols. 1-0). SAGE Publications, Ltd. https://doi.org/10.4135/9780857020123
Jupp, V. (2006). The SAGE dictionary of social research methods. SAGE Publications, Ltd. https://doi.org/10.4135/9780857020116
Herzog, D. (2015). Data literacy: A user's guide. SAGE Publications, Inc. https://doi.org/10.4135/9781483399966
Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures & their consequences. SAGE Publications Ltd. https://doi.org/10.4135/9781473909472
Vogt, W. P. (2005). Dictionary of statistics & methodology. SAGE Publications, Inc. https://doi.org/10.4135/9781412983907