Interpreting Effect Sizes

Last Updated on September 28, 2021

Note: this post is a work in progress.

The effect size is a statistic that quantifies the magnitude of the association between two variables. It is arguably the most important statistic reported in any study attempting to report the relationship between different variables. This purpose of this post is to aid in interpreting effect sizes, particularly from a social science perspective. There are several different kinds of statistics that are used to report effect sizes, such as Pearson correlation coefficients, regression coefficients, standardized mean differences, etc. Currently, this post will focus on providing information to help with interpreting the magnitude of Pearson correlation coefficients. I may add future sections regarding other effect size statistics in the future.

Correlation Coefficients


Many studies quantify the association between two variables by reporting the Pearson correlation coefficient (r) between the two variables. The correlation coefficient always takes on values from 1 to −1, with positive correlations indicating a positive relationship (i.e. as the value for one of the variable increases, the value for the other variable also increases) and negative correlations indicating an inverse relationship. Coefficients with greater absolute values indicate stronger relationships.

In this post, I won’t explain how to interpret the meaning of correlation coefficients on a conceptual level. There are plenty of resources that adequately explain the meaning of correlation coefficients, both mathematically and geometrically (e.g., see the Wikipedia article). Instead, I’m going to provide some information that I believe will be useful for interpreting the magnitude of different correlation coefficients, particularly from a social science perspective. The goal is to provide information that allows us to have a better grasp of when a given correlation coefficient is weak, medium, or large. I begin by noting a common method that should not be used to interpret the magnitude of a correlation coefficient.

Percent of variance explained

One common method used to interpret the magnitude of a correlation coefficient involves squaring the correlation coefficient to get the “percentage of variance explained” (or the coefficient of determination) by the predictor variable. This approach is misleading and confusing for a number of reasons. One minor problem with this approach is that it uses the word “explained”, which implies that changes in one of the variables is causally responsible for changes in the other variable, but this need not be the case at all. If two variables have a non-zero correlation coefficient, this merely implies a statistical association between the two variables; there need not be any corresponding causal association between the variables.

A more important problem with using the percent of variance explained is that it uses units that are difficult to interpret. The units of variance are the square of the units of the variable in question. Consider height as an example. The mean height for adult men in the United States is about 70 inches. The standard deviation of height is about 3 inches. But the variance of height is about 9 squared inches. But what is a squared inch? It’s unclear how this should be interpreted, or if we should even care about this for non-technical reasons. This is why studies often report the standard deviation (which is expressed in the same units as the variable) instead of variance to report the spread of a distribution. Because the units of variance are difficult to interpret, the percent of variance explained to report effect sizes is also difficult to interpret for the very same reasons. Funder and Ozer (2019) have also mentioned this problem in a paper that addresses evaluating effect sizes in psychological research (page 158):

The computation of variance involves squaring the deviations of a variable from its mean. However, squared deviations produce squared units that are less interpretable than raw units (e.g., squared conscientiousness units). As a consequence, r^2 is also less interpretable than r because it reflects the proportion of variance in one variable accounted for by another. One can search statistics textbook after textbook without finding any attempt to explain why (as opposed to assert that) r^2 is an appropriate effect-size measure. Although r^2 has some utility as a measure for model fit and model comparison, the original, unsquared r is the equivalent of a regression slope when both variables are standardized, and this slope is like a z score, in standard-deviation units instead of squared units.

Another problem with using percent of variance explained is that it can inflate the gap in predictive validity between two variables when their r^2 values are compared. Funder and Ozer (2019) shows this quite clearly when considering a case where one’s payoff (outcome variable) is the result of two predictor variables: the result of a nickel toss and the result of a dime toss (page 158):

Consider the difference in value between nickels and dimes. An example introduced by Darlington (1990) shows how this difference can be distorted by traditional analyses. Imagine a coin-tossing game in which one flips a nickel and then a dime, and receives a 5¢ or 10¢ payoff (respectively) if the coin comes up heads.

Table 1

Result of nickel tossResult of dime tossTotal payoff
1115¢
10
0110¢
00

From the payoff matrix in Table 1, correlations can be calculated between the nickel column and the payoff column (r = .4472) and between the dime column and the payoff column (r = .8944). If one squares these correlations to calculate the traditional percentage of variance explained, the result is that nickels explain exactly 20% of the variance in payoff, and dimes explain 80%. And indeed, these two numbers do sum neatly to 100%, which helps to explain the attractiveness of this method in certain analytic contexts. But if they lead to the conclusion that dimes matter 4 times as much as nickels, these numbers have obviously been misleading. The two rs afford a more informative comparison, as .8944 is exactly twice as much as .4472. Similarly, a correlation of .4 reveals an effect twice as large as a correlation of .2; moreover, half of a perfect association is .5, not .707 (Ozer, 1985, 2007). Squaring the r is not merely uninformative; for purposes of evaluating effect size, the practice is actively misleading.

Finally, expressing effect sizes using percent of variance explained is misleading because it produces a figure that downplays the magnitude of correlations. For example, let’s say that there is correlation of r = .30 between a predictor variable and outcome variable. This might seem like a small effect size since the predictor variable “only” explains 0.3 * 0.3 = 9% of the total variance. But, as Sackett et al. (2008) note, a correlation of r = .30 can have rather significant impacts even though “only” 9% of variance is explained. They illustrate the importance of this effect size by calculating the odds of having an above-average value for the outcome variable conditional on a certain value (or range of values) of the predictor variable (page 216):

As long ago as 1928, Hull criticized the small percentage of variance accounted for by commonly used tests. In response, a number of scholars developed alternate metrics designed to be more readily interpretable than “percentage of variance accounted for” (Lawshe, Bolda, & Auclair, 1958; Taylor & Russell, 1939). Lawshe et al. (1958) tabled the percentage of test takers in each test score quintile (e.g., top 20%, next 20%, etc.) who met a set standard of success (e.g., being an above-average performer on the job or in school). A test correlating .30 with performance can be expected to result in 67% of those in the top test quintile being above-average performers (i.e., 2 to 1 odds of success) and 33% of those in the bottom quintile being above-average performers (i.e., 1 to 2 odds of success). Converting correlations to differences in odds of success results both in a readily interpretable metric and in a positive picture of the value of a test that “only” accounts for 9% of the variance in performance.

For these reasons, interpreting the magnitude of a correlation using the percent of variance explained is incredibly confusing and misleading. Instead, correlation magnitudes should be interpreted by directly referencing the correlation coefficient itself.

Now, how should we interpret the magnitude of a correlation coefficient? How do we determine when a correlation coefficient is small, medium, or large? Whether a correlation coefficient is small or large will likely depend on the context of the analysis and the field of study. A given correlation coefficient might be large in one context or field, but small in another, so it’s difficult to give a single answer. Nevertheless, we will have a much better grasp of the magnitude of a given correlation coefficient if we have a set of known correlations that can serve as benchmarks. So it will be useful to know the typical correlations reported in different scientific fields and the correlations between variables that we are already familiar with. The rest of this post will be focused on providing this information.

Distributions of correlation coefficients

Gignac and Szodorai (2016) collected a large sample of meta-analytically derived correlations published in the field of individual differences. Researchers gathered a total of 708 observed correlations from a sample of 87 meta-analyses. They found that the 25th, 50th, and 75th percentiles corresponded to correlations of 0.11, 0.19, and 0.29, respectively. Only about 10% of the correlations exceeded 0.40 (Table 1), and only about 2.7% of correlations exceeded 0.50 (page 75). Because of these findings, the authors recommended that the normative guidelines for small, medium, and large correlations should be 0.10, 0.20, and 0.30, respectively.

Percentilerρ
10.05.08
20.10.14
30.13.18
40.17.21
50.19.25
60.23.30
70.27.35
80.31.43
90.41.52

Note, the r values are the observed Pearson correlation coefficients, and the ρ are the true score correlations, i.e. the correlations corrected for unreliability.

Lovakov and Agadullina (2021) examined 12,170 correlation coefficients and 6,447 Cohen’s d statistics from 134 meta-analyses to report the typical effect sizes reported in social psychology research. They found that the 25th, 50th, and 75th percentiles corresponded to correlation coefficient values of 0.12, 0.24, and 0.41, respectively.

Percentiler
10.04
20.10
30.14
40.20
50.24
60.30
70.37
80.45
90.57

On the basis of these results, I treat low, medium, and large correlations as correlation coefficients in the ranges of r <.15, .15 < r < .30, and r > .30, respectively. The precise boundaries of the ranges are somewhat arbitrary but I think the guidelines are roughly accurate enough to aid in quickly interpreting the magnitude of a given correlation coefficient in social science. These ranges are also the guidelines proposed by Hemphill et al. (2003) which found that these guidelines represented the bottom, middle, and upper third (respectively) of correlation coefficients reported in a couple of meta-analyses in psychological assessment and treatment.

Correlation coefficients of common variables

In this section, I report the correlation coefficients of some common variables that most people are familiar with.

Small Correlations (|r| < .15)

Bottom third of correlations.

VariablesrSource
Emotional stability and grades.02Poropat (2009), Table 1
Education effects and crime.04Pratt et al. (2005), Table 2
IQ and happiness.05Strenze (2015), Table 25.1
Agreeableness and grades.07Poropat (2009), Table 1
Academic performance and income.09Strenze (2007), Table 1
Years of education and job performance.10Schmidt and Hunter (1998), Table 1
Lead and crime.11Higney et al. (2021)
Openness to new experience and grades.12Poropat (2009), Table 1
Socioeconomic status and crime−.13Pratt et al. (2005), Table 2
Medium Correlations (.15 < |r| < .30)

Middle third of correlations.

VariablesrSource
SES and college GPA.15Richardson et al. (2012), Table 5
Years of job experience and job performance.18Schmidt and Hunter (1998), Table 1
Parental income and offspring income.20Strenze (2007), Table 1
Inequality and crime.21Pratt et al. (2005), Table 2
IQ and college GPA.21Richardson et al. (2012), Table 5
SES and grades (K-12).22Harwell et al. (2016)
Conscientiousness and grades.22Poropat (2009), Table 1
Youth IQ and income.23Strenze (2007), Table 1
Poverty and crime.25Pratt et al. (2005), Table 2
Family disruption and crime.26Pratt et al. (2005), Table 2
Large Correlations (.30 < |r| < .45)

Top third of correlations.

VariablesrSource
Parental income and offspring cognitive ability.30Rindermann and Ceci (2018), Table 3
Conscientiousness and job performance.31Schmidt and Hunter (1998), Table 1
SAT score and college GPA.33Richardson et al. (2012), Table 5
College GPA and job performance.36Roth et al. (1996), Table 2
Academic performance and occupational status.37Strenze (2007), Table 1
IQ and skill acquisition in work training.38Strenze (2015), Table 25.1
IQ and job performance (work sample tests).38Roth et al. (2005), Table 4
Parental SES and offspring occupational status.38Strenze (2007), Table 1
County-level unemployment rate and violent crime.39Beaver and Wright (2011), Table 1
ACT score and college GPA.40Richardson et al. (2012), Table 5
High school GPA and college GPA.41Richardson et al. (2012), Table 5
Age-2 and age-17 IQ.43Yu et al. (2018), Table 2
IQ and non-military job performance (work sample tests).44Roth et al. (2005), Table 4
Youth IQ and occupational status.45Strenze (2007), Table 1
Parental education and offspring cognitive ability.45Rindermann and Ceci (2018), Table 3
Really Large Correlations (.45 < |r| < .60)

Top 10-20% of correlations.

VariablesrSource
Age-12 BMI and age-40 BMI.49Hulens et al. (2001)
Peer ratings and job performance.49Schmidt and Hunter (1998), Table 1
Structured employment interviews and job performance.51Schmidt and Hunter (1998), Table 1
IQ and job performance.53Schmidt and Hunter (1998), Table 1
Youth academic performance and educational attainment.53Strenze (2007), Table 1
IQ and grades (K-12).54Roth et al. (2015)
Parental SES and educational attainment.55Strenze (2007), Table 1
Youth IQ and educational attainment.56Strenze (2007), Table 1
Age-17 BMI and age-40 BMI.56Hulens et al. (2001)
County-level IQ and violent crime−.58Beaver and Wright (2011), Table 1
Really, Really, Large Correlations (|r| > .60)

Top 5% of correlation coefficients. Correlations typically only become this large when measuring the same construct at different times, or different measures of the same construct.

VariablesrSource
Age-6 and age-17 IQ.67Yu et al. (2018), Table 2
Age-11 and age-80 IQ.73Deary et al. (2004), page 134
SAT verbal and ACT verbal.74Koenig et al. (2008), Table 2
SAT math and SAT verbal.75Koenig et al. (2008), Table 2
County-level female-headed household rate and violent crime.76Beaver and Wright (2011), Table 1
Height at ages 5-11 and height at age 45.77Richmond-Rakerd et al. (2021), page 4
Age-13 BMI and age-18 BMI.77Hulens et al. (2001)
IQ and ACT total.77Koenig et al. (2008), Table 2
IQ and SAT total.82Koenig et al. (2008), Table 2
Age-12 and age-17 IQ.82Yu et al. (2018), Table 2
Age-20 and age-38 IQ.85Larsen et al. (2008)
SAT math and ACT math.86Koenig et al. (2008), Table 2
SAT total and ACT total.87Koenig et al. (2008), Table 2
Age-30 BMI and age-40 BMI.91Hulens et al. (2001)
Self-reported height and actual height.95Hodge et al. (2020)

I’ll probably add more sections to this post in the future. Planned topics include other methods to interpret the magnitude of correlation coefficients (e.g., using consequences on personnel selection rather than using benchmarks) and alternative effect size statistics (e.g., regression coefficients).

Leave a Reply

Your email address will not be published.