Interpreting Effect Sizes

Last Updated on October 12, 2023

Note: this post is a work in progress.

The effect size is a statistic that quantifies the magnitude of the association between two variables. It is arguably the most important statistic reported in any study attempting to report the relationship between different variables. This purpose of this post is to aid in interpreting effect sizes, particularly from a social science perspective. There are several different kinds of statistics that are used to report effect sizes, such as Pearson correlation coefficients, regression coefficients, standardized mean differences, etc. Currently, this post will focus on providing information to help with interpreting the magnitude of Pearson correlation coefficients. I may add future sections regarding other effect size statistics in the future.

Correlation Coefficients


Many studies quantify the association between two variables by reporting the Pearson correlation coefficient (r) between the two variables. The correlation coefficient always takes on values from 1 to −1, with positive correlations indicating a positive relationship (i.e. as the value for one of the variable increases, the value for the other variable also increases) and negative correlations indicating an inverse relationship. Coefficients with greater absolute values indicate stronger relationships.

In this section, I won’t explain how to interpret the meaning of correlation coefficients on a conceptual level. There are plenty of resources that adequately do this both mathematically and geometrically (e.g., see the Wikipedia article). Instead, I’m going to provide some information that I believe will be useful for interpreting the magnitude of different correlation coefficients, particularly from a social science perspective. The goal is to provide information that allows us to have a better grasp of when a given correlation coefficient is weak, medium, or large. I begin by noting a common method that should not be used to interpret the magnitude of a correlation coefficient.

Standard deviations

A correlation coefficient between two variables indicates the change in one variable (in terms of standard deviations) that is linearly associated with a 1 standard deviation increase in the other variable. For example, if the correlation between X and Y is r = 0.3, then if the relationship between X and Y are modelled linearly, then a 1 SD increase in X is associated with a 0.3 SD increase in Y.

Percent of variance explained

One common method used to interpret the magnitude of a correlation coefficient involves squaring the correlation coefficient to get the “percentage of variance explained” (or the coefficient of determination) by the predictor variable. This approach is misleading and confusing for a number of reasons. One minor problem with this approach is that it uses the word “explained”, which implies that changes in one of the variables is causally responsible for changes in the other variable, but this need not be the case at all. If two variables have a non-zero correlation coefficient, this merely implies a statistical association between the two variables; there need not be any corresponding causal association between the variables.

A more important problem with using the percent of variance explained is that it uses units that are difficult to interpret. The units of variance are the square of the units of the variable in question. Consider height as an example. The mean height for adult men in the United States is about 70 inches. The standard deviation of height is about 3 inches. But the variance of height is about 9 squared inches. But what is a squared inch? It’s unclear how this should be interpreted, or if we should even care about this for non-technical reasons. This is why studies often report the standard deviation (which is expressed in the same units as the variable) instead of variance to report the spread of a distribution. Because the units of variance are difficult to interpret, the percent of variance explained to report effect sizes is also difficult to interpret for the very same reasons. Funder and Ozer (2019) have also mentioned this problem in a paper that addresses evaluating effect sizes in psychological research (page 158):

The computation of variance involves squaring the deviations of a variable from its mean. However, squared deviations produce squared units that are less interpretable than raw units (e.g., squared conscientiousness units). As a consequence, r^2 is also less interpretable than r because it reflects the proportion of variance in one variable accounted for by another. One can search statistics textbook after textbook without finding any attempt to explain why (as opposed to assert that) r^2 is an appropriate effect-size measure. Although r^2 has some utility as a measure for model fit and model comparison, the original, unsquared r is the equivalent of a regression slope when both variables are standardized, and this slope is like a z score, in standard-deviation units instead of squared units.

Another problem with using percent of variance explained is that it can inflate the gap in predictive validity between two variables when their r^2 values are compared. Funder and Ozer (2019) shows this quite clearly when considering a case where one’s payoff (outcome variable) is the result of two predictor variables: the result of a nickel toss and the result of a dime toss (page 158):

Consider the difference in value between nickels and dimes. An example introduced by Darlington (1990) shows how this difference can be distorted by traditional analyses. Imagine a coin-tossing game in which one flips a nickel and then a dime, and receives a 5¢ or 10¢ payoff (respectively) if the coin comes up heads.

Table 1

Result of nickel tossResult of dime tossTotal payoff
1115¢
10
0110¢
00

From the payoff matrix in Table 1, correlations can be calculated between the nickel column and the payoff column (r = .4472) and between the dime column and the payoff column (r = .8944). If one squares these correlations to calculate the traditional percentage of variance explained, the result is that nickels explain exactly 20% of the variance in payoff, and dimes explain 80%. And indeed, these two numbers do sum neatly to 100%, which helps to explain the attractiveness of this method in certain analytic contexts. But if they lead to the conclusion that dimes matter 4 times as much as nickels, these numbers have obviously been misleading. The two rs afford a more informative comparison, as .8944 is exactly twice as much as .4472. Similarly, a correlation of .4 reveals an effect twice as large as a correlation of .2; moreover, half of a perfect association is .5, not .707 (Ozer, 1985, 2007). Squaring the r is not merely uninformative; for purposes of evaluating effect size, the practice is actively misleading.

Finally, expressing effect sizes using percent of variance explained is misleading because it produces a figure that downplays the magnitude of correlations. For example, let’s say that there is correlation of r = .30 between a predictor variable and outcome variable. This might seem like a small effect size since the predictor variable “only” explains 0.3 * 0.3 = 9% of the total variance. But, as Sackett et al. (2008) note, a correlation of r = .30 can have rather significant impacts even though “only” 9% of variance is explained. They illustrate the importance of this effect size by calculating the odds of having an above-average value for the outcome variable conditional on a certain value (or range of values) of the predictor variable (page 216):

As long ago as 1928, Hull criticized the small percentage of variance accounted for by commonly used tests. In response, a number of scholars developed alternate metrics designed to be more readily interpretable than “percentage of variance accounted for” (Lawshe, Bolda, & Auclair, 1958; Taylor & Russell, 1939). Lawshe et al. (1958) tabled the percentage of test takers in each test score quintile (e.g., top 20%, next 20%, etc.) who met a set standard of success (e.g., being an above-average performer on the job or in school). A test correlating .30 with performance can be expected to result in 67% of those in the top test quintile being above-average performers (i.e., 2 to 1 odds of success) and 33% of those in the bottom quintile being above-average performers (i.e., 1 to 2 odds of success). Converting correlations to differences in odds of success results both in a readily interpretable metric and in a positive picture of the value of a test that “only” accounts for 9% of the variance in performance.

For these reasons, interpreting the magnitude of a correlation using the percent of variance explained is incredibly confusing and misleading. Instead, correlation magnitudes should be interpreted by directly referencing the correlation coefficient itself.

Now, how should we interpret the magnitude of a correlation coefficient? How do we determine when a correlation coefficient is small, medium, or large? Whether a correlation coefficient is small or large will likely depend on the context of the analysis and the field of study. A given correlation coefficient might be large in one context or field, but small in another, so it’s difficult to give a single answer. Nevertheless, we will have a much better grasp of the magnitude of a given correlation coefficient if we have a set of known correlations that can serve as benchmarks. So it will be useful to know the typical correlations reported in different scientific fields and the correlations between variables that we are already familiar with. The rest of this post will be focused on providing this information.

Distributions of correlation coefficients

Gignac and Szodorai (2016) collected a large sample of meta-analytically derived correlations published in the field of individual differences. Researchers gathered a total of 708 observed correlations from a sample of 87 meta-analyses. They found that the 25th, 50th, and 75th percentiles corresponded to correlations of 0.11, 0.19, and 0.29, respectively. Only about 10% of the correlations exceeded 0.40 (Table 1), and only about 2.7% of correlations exceeded 0.50 (page 75). Because of these findings, the authors recommended that the normative guidelines for small, medium, and large correlations should be 0.10, 0.20, and 0.30, respectively.

Percentilerρ
10.05.08
20.10.14
30.13.18
40.17.21
50.19.25
60.23.30
70.27.35
80.31.43
90.41.52

Note, the r values are the observed Pearson correlation coefficients, and the ρ are the true score correlations, i.e. the correlations corrected for unreliability.

Lovakov and Agadullina (2021) examined 12,170 correlation coefficients and 6,447 Cohen’s d statistics from 134 meta-analyses to report the typical effect sizes reported in social psychology research. They found that the 25th, 50th, and 75th percentiles corresponded to correlation coefficient values of 0.12, 0.24, and 0.41, respectively.

Percentiler
10.04
20.10
30.14
40.20
50.24
60.30
70.37
80.45
90.57

On the basis of these results, I treat low, medium, and large correlations as correlation coefficients in the ranges of r <.15, .15 < r < .30, and r > .30, respectively. The precise boundaries of the ranges are somewhat arbitrary but I think the guidelines are roughly accurate enough to aid in quickly interpreting the magnitude of a given correlation coefficient in social science. These ranges are also the guidelines proposed by Hemphill et al. (2003) which found that these guidelines represented the bottom, middle, and upper third (respectively) of correlation coefficients reported in a couple of meta-analyses in psychological assessment and treatment.

Correlation coefficients of common variables

In this section, I report the correlation coefficients of some common variables that most people are familiar with.

Small Correlations (|r| < .15)

Bottom third of correlations.

VariablesrSource
Emotional stability and grades.02Poropat (2009), Table 1
Education effects and crime.04Pratt et al. (2005), Table 2
IQ and happiness.05Strenze (2015), Table 25.1
Agreeableness and grades.07Poropat (2009), Table 1
Academic performance and income.09Strenze (2007), Table 1
Years of education and job performance.10Schmidt and Hunter (1998), Table 1
Lead and crime.11Higney et al. (2021)
Openness to new experience and grades.12Poropat (2009), Table 1
Socioeconomic status and crime−.13Pratt et al. (2005), Table 2
Medium Correlations (.15 < |r| < .30)

Middle third of correlations.

VariablesrSource
SES and college GPA.15Richardson et al. (2012), Table 5
Years of job experience and job performance.18Schmidt and Hunter (1998), Table 1
Parental income and offspring income.20Strenze (2007), Table 1
Inequality and crime.21Pratt et al. (2005), Table 2
IQ and college GPA.21Richardson et al. (2012), Table 5
SES and grades (K-12).22Harwell et al. (2016)
Conscientiousness and grades.22Poropat (2009), Table 1
Youth IQ and income.23Strenze (2007), Table 1
Poverty and crime.25Pratt et al. (2005), Table 2
Family disruption and crime.26Pratt et al. (2005), Table 2
Large Correlations (.30 < |r| < .45)

Top third of correlations.

VariablesrSource
Parental income and offspring cognitive ability.30Rindermann and Ceci (2018), Table 3
Conscientiousness and job performance.31Schmidt and Hunter (1998), Table 1
SAT score and college GPA.33Richardson et al. (2012), Table 5
College GPA and job performance.36Roth et al. (1996), Table 2
Academic performance and occupational status.37Strenze (2007), Table 1
IQ and skill acquisition in work training.38Strenze (2015), Table 25.1
IQ and job performance (work sample tests).38Roth et al. (2005), Table 4
Parental SES and offspring occupational status.38Strenze (2007), Table 1
County-level unemployment rate and violent crime.39Beaver and Wright (2011), Table 1
ACT score and college GPA.40Richardson et al. (2012), Table 5
High school GPA and college GPA.41Richardson et al. (2012), Table 5
Age-2 and age-17 IQ.43Yu et al. (2018), Table 2
IQ and non-military job performance (work sample tests).44Roth et al. (2005), Table 4
Youth IQ and occupational status.45Strenze (2007), Table 1
Parental education and offspring cognitive ability.45Rindermann and Ceci (2018), Table 3
Really Large Correlations (.45 < |r| < .60)

Top 10-20% of correlations.

VariablesrSource
Age-12 BMI and age-40 BMI.49Hulens et al. (2001)
Peer ratings and job performance.49Schmidt and Hunter (1998), Table 1
Structured employment interviews and job performance.51Schmidt and Hunter (1998), Table 1
IQ and job performance.53Schmidt and Hunter (1998), Table 1
Youth academic performance and educational attainment.53Strenze (2007), Table 1
IQ and grades (K-12).54Roth et al. (2015)
Parental SES and educational attainment.55Strenze (2007), Table 1
Youth IQ and educational attainment.56Strenze (2007), Table 1
Age-17 BMI and age-40 BMI.56Hulens et al. (2001)
County-level IQ and violent crime−.58Beaver and Wright (2011), Table 1
Really, Really, Large Correlations (|r| > .60)

Top 5% of correlation coefficients. Correlations typically only become this large when measuring the same construct at different times, or different measures of the same construct.

VariablesrSource
Age-6 and age-17 IQ.67Yu et al. (2018), Table 2
Age-11 and age-80 IQ.73Deary et al. (2004), page 134
SAT verbal and ACT verbal.74Koenig et al. (2008), Table 2
SAT math and SAT verbal.75Koenig et al. (2008), Table 2
County-level female-headed household rate and violent crime.76Beaver and Wright (2011), Table 1
Height at ages 5-11 and height at age 45.77Richmond-Rakerd et al. (2021), page 4
Age-13 BMI and age-18 BMI.77Hulens et al. (2001)
IQ and ACT total.77Koenig et al. (2008), Table 2
IQ and SAT total.82Koenig et al. (2008), Table 2
Age-12 and age-17 IQ.82Yu et al. (2018), Table 2
Age-20 and age-38 IQ.85Larsen et al. (2008)
SAT math and ACT math.86Koenig et al. (2008), Table 2
SAT total and ACT total.87Koenig et al. (2008), Table 2
Age-30 BMI and age-40 BMI.91Hulens et al. (2001)
Self-reported height and actual height.95Hodge et al. (2020)

Regression coefficients


If we are interested in comparing the effect sizes of different predictors for an outcome, it is problematic to rely solely on correlation coefficients. This is because the predictive validity of one predictor may be due to confounding with the other predictor.

For example, imagine that parental income and offspring academic achievement correlate at r = 0.25, whereas parental education and offspring academic achievement correlate at r = 0.30. This may give the impression that parental income and parental education are about equally important at predicting offspring achievement. However, it could be the case that parental income correlates with achievement only because parental income also correlates with parental education, which actually effects offspring achievement. It may turn out that, between families with similar levels of parental education, there is no any association between parental income and offspring achievement (many studies have shown this, in fact). If so, we would probably want to say that parental education is a more important predictor than parental income for offspring achievement, even though comparing correlation coefficients would give a different impression.

To avoid this problem, we cannot rely solely on correlation coefficients. We also want to quantify the independent associations of different predictor variables, i.e. the associations of each predictor while holding the other predictors fixed. This is where regression coefficients come in.

Of course, if one is merely interested in the raw statistical relationship between two variables, the correlation coefficients are fine. But typically we are interested in the relationship between two variables while holding fixed other potentially confounding variables, since we are typically interested in causal inference, which requires ruling out confounding variables.

Unstandardized coefficients

To quantify the independent associations of different independent variables on an outcome, many studies will perform a regression analysis. In a regression, the dependent variable (in this case, academic achievement) is modeled as a function of independent variable(s) simultaneously. Each independent variable is assigned a regression coefficient which quantifies the statistical association between that independent variable and the dependent variable, while holding all other independent variables fixed.

One concern is that regression coefficients are often unstandardized. An unstandardized regression coefficient indicates the change in the dependent variable associated with a one-unit increase in the independent variable. This makes comparisons of coefficients meaningless because different coefficients will have different units. For example, let’s say we have a regression model with SAT score as the dependent variable with the following 2 independent variables: parental education (in years of schooling) and parental time at work (in hours per year). Let’s say the unstandardized regression coefficient for both independent variables is 5. This would indicate that an additional year of parental education is associated with an additional 5 SAT points, and an additional hour of work per year is also associated with 5 SAT points.

In this highly unrealistic hypothetical, it is obvious that parental time spent working has a far greater association with SAT scores (even though both independent variables have an unstandardized coefficient of 5). The reason is the fact that there is far more variation in hours worked per year than there is in years of parental education. In other words, a one hour difference in hours worked is far more common than a one year difference in parental education (e.g., it’s very common for a random pair of individuals to differ in hours worked per year by 10 hours, but it’s rare that a random pair of individuals differ in years of education by 10 years, because most people have somewhere between 12 and 16 years of education). So the typical variation in parent’s time at work is associated with a far greater change in SAT scores than the typical variation in parental education. Even though parent’s time at work obviously has a far greater association with SAT scores than does parental education in this hypothetical, this difference is not apparent from comparing unstandardized coefficients.

Another way to illustrate why it’s meaningless to compare unstandardized coefficients is to note that the same independent variable can be represented in different units. Different units will entail different coefficients, even though the association between the independent variable and the outcome variable is the same. For example, when studies measure income, they can code the income variable in terms of dollars or e.g. terms of $10,000. This choice will impact the coefficients associated with this variable, even though it obviously has no bearing on the statistical relationship between income and any of the other variables in the model.

Standardized coefficients

It is clear that comparing unstandardized regression coefficients is meaningless. But we also cannot rely on comparing correlation coefficients for the reasons given earlier. Therefore, to compare the independent effects of different predictors, one can compare standardized regression coefficients. Standardized regression coefficients quantify associations not in terms of the units of either set of variables. Instead, it quantifies associations in terms of standard deviations. Standardized regression coefficients must fall between -1 and +1, like correlation coefficients. More specifically, a standardized regression coefficient indicates the change (in terms of standard deviations) of the dependent variable that is associated with a one-standard-deviation increase in the independent variable. Recall that this is also true for correlation coefficients (i.e. if the correlation between X and Y is r = 0.3, then a 1 SD increase in X is linearly associated with a 0.3 SD increase in Y). In fact, in a linear regression with just one independent variable, the standardized coefficient is identical to the correlation coefficient between the independent and dependent variable. This is also true in a multivariate linear regression, so long as the independent variables are all orthogonal. E.g. as Wikipedia notes, “For simple linear regression with orthogonal predictors, the standardized regression coefficient equals the correlation between the independent and dependent variables”.

For example, a standardized coefficient of 0.3 indicates that a 1 standard deviation change in the independent variable is associated with a 0.3 standard deviation increase in the dependent variable. This avoids the problem with unstandardized regression coefficients mentioned earlier. For example, return to the example with parent’s time at work vs parental education from above. The key problem with comparing unstandardized coefficients for parental time at work (in hours per year) and parental education (in years) was that there is far more variation in the units for former than the latter. But this is not a problem when we use standard deviations instead of the original units. Whereas a one hour difference in parental time spent working per year is far more common than a one year difference in parental education, a one standard deviation difference in parental time spent working is just as common as a one standard deviation difference in parental education (this is assuming that hours spent working per year and parental years of education are both roughly normally distributed; if not, then comparing standardized coefficients may be misleading depending on the distribution).

Todo


I’ll probably add more sections to this post in the future. Planned topics include other methods to interpret the magnitude of correlation coefficients (e.g., using consequences on personnel selection rather than using benchmarks) and alternative effect size statistics.

Leave a Reply

Your email address will not be published. Required fields are marked *