Wednesday, July 14, 2010

The China Study: With a large enough sample, anything is significant

There have been many references recently on diet and lifestyle blogs to the China Study. Except that they are not really references to the China Study, but to a blog post by Denise Minger. This post is indeed excellent, and brilliant, and likely to keep Denise from “having a life” for a while. That it caused so much interest is a testament to the effect that a single brilliant post can have on the Internet. Many thought that the Internet would lead to a depersonalization and de-individualization of communication. Yet, most people are referring to Denise’s post, rather than to “a great post written by someone on a blog.”

Anyway, I will not repeat what Denise said on her post here. My goal with this post is bit more general, and applies to the interpretation of quantitative research results in general. This post is a warning regarding “large” studies. These are studies whose main claim to credibility is that they are based on a very large sample. The China Study is a good example. It prominently claims to have covered 2,400 counties and 880 million people.

There are many different statistical analysis techniques that are used in quantitative analyses of associations between variables, where the variables can be things like dietary intakes of certain nutrients and incidence of disease. Generally speaking, statistical analyses yield two main types of results: (a) coefficients of association (e.g., correlations); and (b) P values (which are measures of statistical significance). Of course there is much more to statistical analyses than these two types of numbers, but these two are usually the most important ones when it comes to creating or testing a hypothesis. The P values, in particular, are often used as a basis for claims of significant associations. P values lower than 0.05 are normally considered low enough to support those claims.

In analyses of pairs of variables (known as "univariate", or "bivariate" analyses), the coefficients of association give an idea of how strongly the variables are associated. The higher these coefficients are, the more strongly the variables are associated. The P values tell us whether an apparent association is likely to be due to chance, given a particular sample. For example, if a P value is 0.05, or 5 percent, the likelihood that the related association is due to chance is 5 percent. Some people like to say that, in a case like this, one has a 95 percent confidence that the association is real.

One thing that many people do not realize is that P values are very sensitive to sample size. For example, with a sample of 50 individuals, a correlation of 0.6 may be statistically significant at the 0.01 level (i.e., its P value is lower than 0.01). With a sample of 50,000 individuals, a much smaller correlation of 0.06 may be statistically significant at the same level. Both correlations may be used by a researcher to claim that there is a significant association between two variables, even though the first association (correlation = 0.6) is 10 times stronger than the second (correlation = 0.06).

So, with very large samples, cherry-picking results is very easy. It has been argued sometimes that this is not technically lying, since one is reporting associations that are indeed statistically significant. But, by doing this, one may be omitting other associations, which may be much stronger. This type of practice is sometimes referred to as “lying with statistics”.

With a large enough sample one can easily “show” that drinking water causes cancer.

This is why I often like to see the coefficients of association together with the P values. For simple variable-pair correlations, I generally consider a correlation around 0.3 to be indicative of a reasonable association, and a correlation at or above 0.6 to be indicative of a strong association. These conclusions are regardless of P value. Whether these would indicate causation is another story; one has to use common sense and good theory.

If you take my weight from 1 to 20 years of age, and the price of gasoline in the US during that period, you will find that they are highly correlated. But common sense tells me that there is no causation whatsoever between these two variables.

There are a number of other issues to consider which I am not going to cover here. For example, relationships may be nonlinear, and standard correlation-based analyses are “blind” to nonlinearity. This is true even for advanced correlation-based statistical techniques such as multiple regression analysis, which control for competing effects of several variables on one main dependent variable. Ignoring nonlinearity may lead to misleading interpretations of associations, such as the association between total cholesterol and cardiovascular disease.

Note that this post is not an indictment of quantitative analyses in general. I am not saying “ignore numbers”. Denise’s blog post in fact uses careful quantitative analyses, with good ol’ common sense, to debunk several claims based on, well, quantitative analyses. If you are interested in this and other more advanced statistical analysis issues, I invite you to take a look at my other blog. It focuses on WarpPLS-based robust nonlinear data analysis.