Last Updated on September 28, 2021
Note: this post is a work in progress.
The purpose of this post is to provide some guidelines to aid in inferring causation in social science research. At the moment, the post just defines confounding variables, distinguishes them from other third-variables (e.g. mediators and colliders), and provides some examples of statistical techniques that can be used to control for confounding variables. In the future, I hope to cover in more detail regression analysis, causal diagrams, multicollinearity, and other concepts important to understand to infer causation in the social sciences.
Causation and Confounding
Most people are familiar with the phrase “correlation does not imply causation.” This phrase is true because, for any two correlated variables X and Y, there are three possible explanations of the correlation (assuming the association isn’t due to pure randomness):
- X causes Y.
- Y causes X.
- Z causes X and Z causes Y.
Note that these explanations are not mutually exclusive. Because the association between X and Y can be explained without X having a causal influence (if explanation 2 or 3 is true), we cannot reasonably infer that X is causal merely from the fact that X is correlated with Y. In order to show that X has a causal influence on Y, there are more conditions that one must demonstrate before the causal inference can be made. One must show the following three conditions (page 146):
- There is an empirical association between X and Y.
- X occurs before Y.
- The association between X and Y is not spurious.
These requirements roughly correspond to the three possible explanations I mentioned earlier. If one fulfills requirement 2 (showing that X occurs before Y), then that rules out possible explanation 2 (that Y causes X). And if one fulfills requirements 3 (showing that the X-Y association is not spurious), then that rules out possible explanation 3 (that Z causes X and Z causes Y). Thus, if one fulfills requirements 2 and 3, then that’s sufficient evidence that explanation 1 is true, i.e. that X causes Y.
The first two requirements are fairly easy to demonstrate. The more difficult requirement to fulfill is the third condition. To show that the correlation between two variables X and Y are not spurious, one must show that the correlation is not the result of confounding. That is, one must show that the association between X and Y is not the result of a third variable (called a “confounder” variable) Z which causes both X and Y.
Before addressing the methods for ruling out confounding explanations of an association, we first need to know how to identify a possible confounder variable. How do we know whether a given third variable, Z, is a confounder of the association between X and Y? One might think that this can be answered by determining whether the association between X and Y persists after controlling for Z. If there is a lingering association, then Z does not confound the association (at least the lingering portion of the association). But if there is no lingering association, then Z is said to “account” (in a statistical sense) for the association between X and Y and is therefore a confounder.
This approach is only half right. If the association between X and Y persists after controlling for Z, then that’s evidence that Z does not confound the association between X and Y (at least not all of the association). But if the association between X and Y does persist after controlling for Z, then that may not be sufficient evidence to infer that Z does confound the association. The problem with this inference is that Z may statistically account for the association between X and Y without being a confounder. For example, Z might statistically account for the association because it’s a mediator instead of a confounder (explained below). We cannot determine whether a given third variable is a confounder or mediator merely via statistical analysis of a set of correlations. Instead, to know whether a variable is a confounder or mediator, we must appeal to prior domain knowledge about the causal relationship between this variable and the two variables under consideration (“prior” in the sense that the knowledge is independent of the set of correlations we have before us). This knowledge may come from the most prominent theoretical models in the field or from an understanding of the mechanisms of the variables under study.
So we need a way to know whether a given variable is a plausible confounder of a relationship between two other variables, and we cannot determine this via correlations. Ananth et al. (2017) [archived] gives some requirements to determine whether a third variable confounds an association:
Confounding bias occurs when there is a failure to adjust for common causes of both the exposure and the outcome. The criteria for confounding are that the third variable (the confounder, C) should be casually associated with both the exposure (X) and the outcome (Y), and C is not on the causal pathway between X and Y.
In other words, there are three requirements in order for a third variable C to be a confounder of the association between X and Y:
- C is causally associated with X.
- C is casually associated with Y.
- C is not on the causal pathway between X and Y.
Requirement (3) states that C cannot be “an intermediate variable that lies on a causal pathway from exposure to outcome”. In other words, C must not be an “intermediate variable” (or mediator variable, as described below) from X to Y. Jager et al. (2008) [archived] also gives three requirements for a variable to be a “potential confounder”. Because they give conditions for a potential, rather than actual, confounder, their three conditions are similar to the three conditions above, with the only difference being that the first two conditions require mere statistical association rather than causal association.
As stated earlier, just because a third variable Z statistically accounts for an association between X and Y does not imply that Z is a confounder. Therefore, if one attempts to control for confounding by controlling for all available third variables, this approach can introduce bias if the third variable is a mediator or collider rather than a confounder. These kinds of variables are described as follows:
- If Z is a mediator variable, then controlling for Z may result in an underestimate of the causal effect of X on Y. In other words, if the causal path from X to Y involves Z as an intermediate variable, then controlling for Z is not appropriate because it will block the very causal effect that we intend to estimate. As Gaskell et al. (2020) note, “if putative confounders are in fact mediators or colliders, such control will instead introduce bias”.
- If Z is a collider variable, then controlling for Z may result in an overestimate of the causal effect of X on Y. In other words, if Z is an effect of both X and Y, then controlling for Z can produce an association between X and Y even if X has no effect on Y. For a brief overview of collider bias with examples, see here [archived].
Mediator variables may go by different names. Some researchers refer to them as “intermediate variables” (Ananth et al. 2017) or mechanisms. Either way, the advice is the same: do not control for mediator variables if the goal is to quantify the total causal influence of one variable on another variable (controlling for mediators may be useful if the goal is to quantify the direct vs indirect causal influence of a variable, i.e. to decompose the causal influence into pathways mediated by a proposed intermediate variable and pathways not mediated by that variable). For a good, fairly accessible overview of confounder, mediator, and collider variables, see this introduction to causal diagrams by Gaskell et al. (2020) [archived]. It provides many examples of the types of variables and how controlling for mediators and colliders can introduce bias. For more examples of good and bad controls during causal inference, see Cinelli et al. (2020) [archived]. For a primer in causal inference in statistics, see this short book [archived] by Judea Pearl.
Understanding the differences between confounders and mediators is essential to drawing causal inferences from statistical data reported by studies. See Fergusson et al. (2005) [doi] for a study that illustrates the importance of recognizing the distinction between these two concepts. In this study, researchers found that, prior to controlling for various risk factors, IQ measured at ages 8-9 was significantly related to adulthood personal adjustment problems. However, after controlling for a number of risk factors (e.g., childhood conduct problems, attentional problems, and socioeconomic disadvantage), the association between childhood IQ and these outcomes was no longer statistically significant. So one might infer that IQ was not causal after all, i.e. one might conclude that the associations with IQ were confounded by the risk factors. However, IQ may actually cause the risk factors that the researchers controlled for, so the reduction in the association between childhood IQ and adulthood success may not imply a comparable reduction in the causal impact of IQ. The researchers express this point in their discussion (page 856):
[T]hese results raise important questions about the processes which link early conduct problems and IQ. Three explanations seem possible…If the association is explained by common genetic, social, family and related factors then the association between IQ and later adjustment is non-causal and reflects the consequences of common factors. If IQ, in some way, influences predisposition to conduct problems then the effects of IQ on later adjustment are causal and are mediated via early adjustment. Finally, if the association between IQ and conduct problems arises because conduct problems lead in some way to a lower measured IQ, the association between IQ and later adjustment is non-causal and reflects the common influence of early adjustment on both later adjustment and measured IQ.
The gold standard for addressing confounding and demonstrating causation in science are randomized control trials. However, randomized control trials are fairly rare in social science for technical and ethical reasons. Instead, social science tends to rely on other methods to make causal inference (e.g., “natural experiments”). Many of these methods are mentioned in a paper on causal inference by Antonakis et al. (2010). In this section, I’ll review some of the common methods used to address confounding in social science research.
The most common method to address confounding (in cognitive ability research) involves statistically controlling (or “adjusting” or “conditioning”) for certain confounders. There are a number of methods to statistically control for confounders. Two common methods involve regression analysis and stratification, both of which are mentioned in two articles about addressing confounding (Normand et al. 2005; McDermott and Miller 2008). Normand et al. (2005) [archived] briefly describe regression analyses as follows:
Regression uses the data to estimate how confounders are related to the outcome and produces an adjusted estimate of the intervention effect. It is the most commonly used method for reducing confounding in cohort studies. The outcome of interest is the dependent variable, and the measures of baseline characteristics (such as age and sex) and the intervention are independent variables. The choice of method of regression analysis (linear, logistic, proportional hazards, etc) is dictated by the type of dependent variable. For example, if the outcome is binary (such as occurrence of hip fracture), a logistic regression model would be appropriate; in contrast, if the outcome is time to an event (such as time to hip fracture) a proportional hazards model is appropriate.
Regression analyses estimate the association of each independent variable with the dependent variable after adjusting for the effects of all the other variables. Because the estimated association between the intervention and outcome variables adjusts for the effects of all the measured baseline characteristics, the resulting estimate is called the adjusted effect. For example, regression could be used to control for differences in age and sex between two groups and to estimate the intervention effect adjusted for age and sex differences.
Like all methods to reduce confounding, the article notes that regression analyses rest on a number of assumptions for the results to be valid. The second method described is stratification. Stratification is described as follows:
Stratification is a process in which the sample is divided into subgroups or strata on the basis of characteristics that are believed to confound the analysis. The effects of the intervention are then measured within each subgroup. The goal of stratification is to create subgroups that are more balanced in terms of confounders. If age and sex were confounders, then strata based on age and sex could be used to control for confounding. The intervention effect is calculated by working out the difference in average outcomes between the intervention and comparison groups within each stratum. It is important to determine whether the relation between the intervention and outcome differs across strata. If the effect estimates are the same across strata, a summary estimate can be calculated by pooling the individual estimates.5 However, substantial differences in estimates across strata suggest effect modification, and a summary estimate should not be calculated.
One limitation with statistical controls is that we only have access to a finite amount of variables that we can control in a regression or stratification analysis. We can never be certain that we’ve controlled for all confounding variable. As McDermott and Miller (2008) notes, even “when we’ve controlled for a long list of confounds, we can never be certain that the association between variables is causal in nature. There’s always the possibility of other confounds that we haven’t considered” (page 140). While statistical controls can never prove that an association is not confounded by third variables, these methods are still useful because they can reasonably raise our confidence that an association is causal by showing that the association is robust against controls for a growing number of plausible candidate confounders. Our confidence be reasonably raised even more as the findings from statistical controls are corroborated with other methods (see below).
Note: I haven’t read most of most of these sources, but I’ve skimmed through all of them and they all seem useful. As I flesh out this post, I’ll read more from these sources and remove the ones that seem redundant.
- Greenland et al. (2010). “Causal Diagrams for Epidemiologic Research“
- Williamsn et al. (2014). “Introduction to causal diagrams for confounder selection“
- Gaskell et al. (2020). “An Introduction to Causal Diagrams for Anesthesiology Research“
- Cinelli et al. (2021). “A Crash Course in Good and Bad Controls“
- Morgan and Winship (2007). Counterfactuals and Causal Inference Methods and Principles for Social Research
- Pearl et al (2016). Causal Inference in Statistics: A Primer
- Hernán and Robins (2020). Causal Inference: What If
- Antonakis et al. (2010). “On making causal claims: A review and recommendations“
- Ananth et al. (2017). “Confounding, Causality and Confusion: The Role of Intermediate Variables in Interpreting Observational Studies in Obstetrics“