Tuesday, April 13, 2021

The mismeasure of genetic differentiation


Red Tree, Piet Mondrian (1908-10)

If we look at SNP alleles associated with educational attainment, we see differences between Europeans and sub-Saharan Africans. Is genetic drift the cause? Or natural selection?



IQ has long been the yardstick of cognitive ability. As such, it describes phenotype, not genotype: it measures how your inborn potential has developed in your environment. Genotype is the inborn component of IQ. It can be inferred from twin studies, family studies, and adoption studies, but those approaches are indirect and far from perfect.


To measure genotype directly, we need to identify the alleles that affect the development of cognitive ability. We also need to measure the size of each allele’s effect. Recently, much progress has been made. By using genome-wide association studies (GWAS), researchers have identified many alleles that are associated with educational attainment (EA). EA is not quite the same as IQ—it also includes things like sitting still in class and brownnosing the teacher—but it's a good approximation.


In the most recent study of this sort, Lee et al. (2018) identified 1,271 single-nucleotide polymorphisms (SNPs) that are significantly associated with high EA in a sample of over one million people of European ancestry. Together, the SNPs can explain 11-13% of the variance in EA among individuals. This new yardstick is called the "polygenic score."


The polygenic score is more accurate for populations than for individuals. If we compare the mean polygenic score of a population and its mean IQ, the correlation is 90% (Piffer 2019). This high correlation is due to the logic of sampling: to estimate the mean cognitive ability of a population, we don't have to identify all of the relevant SNPs, just a large enough sample.


Like mean IQ, the mean polygenic score differs among human populations. It seems to have increased during the northward spread of modern humans out of Africa and into the temperate zone of Europe and Asia, with East Asians having the highest scores. This geographic pattern is in line with IQ data. The mean polygenic score is also very high among Ashkenazi Jews and Finns, again in line with IQ data (Piffer 2019).



Kevin Bird’s paper


The above findings have been disputed by the American researcher Kevin Bird in a recent paper. Although Europeans and sub-Saharan Africans have different alleles at genes associated with educational attainment, he argues that these differences correspond to small differences in cognitive ability. In fact, they are more consistent with genetic drift than with natural selection.


To prove his argument, he performed two analyses of the data: an Fst and a test for polygenic selection. In my opinion, both analyses have serious problems.


The Fst


This is the most common measure of genetic differentiation. If the Fst is low, differentiation is trivial and consistent with genetic drift. If it is high, differentiation is significant and consistent with natural selection.


For SNPs associated with EA, Kevin Bird reports an Fst of 0.111. Is that low or high? When Sewall Wright (1978, pp. 82-85) created this measure, he defined four categories of differentiation:


0 to 0.05 - little genetic differentiation

0.05 to 0.15 - moderate genetic differentiation

0.15 to 0.25 - great genetic differentiation

0.25 to 1 - very great genetic differentiation


Those categories are widely cited in the literature. A search in Google Scholar for "moderate genetic differentiation" and "0.05 - 0.15" shows over two hundred papers.


So does an Fst of 0.111 mean moderate genetic differentiation? Not according to Kevin Bird, who sees nothing at all below a benchmark of 0.118. That benchmark may be valid, but it cannot be easily verified and does not appear elsewhere in the literature. Nor does Kevin explain why it is better than the ones put forward by Sewall Wright. In fact, he makes no reference to them.


One may also question the Fst of 0.111. For the data source, the reader is referred to Lee et al. (2018), but that study was done only with European subjects. Moreover, Kevin Bird used 1,259 SNPs to calculate that Fst, even though he found only 685 SNPs that had data on both Africans and Europeans.


The Fst of 0.111 seems to be the diversification of those SNPs in Europeans. That value is what would be expected, but it says nothing about diversification between Europeans and sub-Saharan Africans.


The polygenic selection analysis


The other analysis is more on subject. Kevin Bird compared European data with African data as follows:


1. First, he looked through the 1000 Genomes Project for SNP data on Europeans and sub-Saharan Africans. He found data on five European-descended populations (Utah residents, Tuscans, Finns, British, Iberians) and five African populations (Yoruba, Luhya, Gambians, Mende, Esan). The two datasets had information on 685 of the 1,271 SNPs associated with educational attainment.


2. For each SNP, he noted the allele frequencies in Europeans and the allele frequencies in sub-Saharan Africans.


3. He calculated the differences in allele frequencies between the two groups. He then weighted the differences for the allele's effect size (its estimated positive or negative effect on educational attainment). For each allele, he used two different estimates of effect size: one from between-family data and the other from within-family data.


4. Alongside this list of weighted alleles, he created a second list to simulate genetic drift by randomly flipping the sign of effect size for 10,000 permutations.


5. When effect size was calculated from between-family data, the two lists clearly differed from each other. When it was calculated from within-family data, the overall difference was much smaller and easily explained by genetic drift.


Bird (2021) prefers the second dataset to the first, whereas Piffer (2019) prefers the first. Who is right? All things being equal, data should come from within families. There is less statistical noise because siblings have similar upbringings. With less noise, group differences can more easily be identified.


Yet, here, we have the opposite. We see a significant difference between Europeans and Africans in the between-family data, but not in the within-family data. Why? The reason is that the between-family data came from over a million subjects whereas the within-family data came from 20,000 sibling pairs. Being smaller, the second dataset had a lot more noise. Sure, there should have been less noise, all things being equal. But some things weren't.



Doing the comparison again but better


I suspect Kevin Bird still prefers within-family data. Fine. Let's repeat the comparison with a much larger sample of sibling pairs. There would then be less noise and probably a significant difference between African and European alleles in their effect on educational attainment. Kevin seems to anticipate this eventuality:


While the results presented here are more consistent with neutral evolution rather than divergent natural selection, it is not possible to rule out that data sets with more power could present different results. Additionally, although within-family effect sizes are recommended over between-family effect sizes, if the within-family effect sizes are re-estimated for SNPs ascertained by a between-family GWAS, there is still likely to be some level of confounding from population structure. (Bird 2021, p. 7)


He elaborates on the last point:


[...] the [polygenic] scores might be biased by a variety of factors, including the nonrandom ways that society is geographically structured [...]. For instance, Black people in the US, for reasons unrelated to genetics, live in areas with poorer air quality and more exposure to environmental toxins (Bird 2021, p. 8)


Yet, as he notes further on, these SNP alleles were identified only in European subjects, and their effects on educational attainment were estimated only from European data. So how could different alleles among Europeans be spuriously associated with differences in educational attainment among Europeans because of socioeconomic deprivation among Black Americans? Where and when do the latter come into this presumably spurious association?


Kevin Bird is right to point out that the allele effects were calculated from European data and may be less applicable to people of other origins. In fact, there is growing evidence that the genetic architecture of cognition is different in sub-Saharan Africans (Frost 2019). By ignoring that factor, however, we introduce even more noise into the data and muddle even more any differences that may exist between Africans and Europeans. The data may indeed be of low quality, but that shortcoming would, if anything, obscure group differences. Again, Kevin is making a coherent point within an incoherent argument.



Other ways?


There are other ways to distinguish between genetic drift and natural selection. One way is to measure the ratio of nonsynonymous alleles to synonymous alleles. If a trait has little functional value and is thus vulnerable to genetic drift, nonsynonymous alleles will tend to proliferate and become as numerous as synonymous alleles (Tomoko 1995). Of course, if nonsynonymous alleles greatly outnumber synonymous alleles, there may be natural selection for diversity (Rana et al. 1999).


An SNP, by its very nature, has alleles that differ from each other by only one base substitution, and this fact limits our ability to distinguish between genetic drift and natural selection. It would thus be interesting to identify genetic polymorphisms that are associated with educational attainment but have several nucleotides.


If such a polymorphism is undergoing genetic drift, the most frequent alleles will be the ancestral allele and those that differ from it by one base substitution. The less frequent ones will be those that differ by two or more base substitutions. In short, the frequency of an allele will be inversely related to the number of base substitutions that separate it from the ancestral allele.


The picture is different with natural selection. The most frequent alleles will not necessarily be the ones that differ the least from the ancestral allele. If allele frequency is graphed as a function of base substitutions, the result will not be a smoothly decreasing exponential curve. The most successful allele may differ from the ancestral one by several base substitutions.





Bird, K.A. (2021). No support for the hereditarian hypothesis of the Black-White achievement gap using polygenic scores and tests for divergent selection. American Journal of Physical Anthropology. Feb. 1-12, DOI: 10.1002/ajpa.24216.



Frost, P. (2019). Differences in the genetic architecture of cognition? Evo and Proud, September 25



Lee, J. J., Wedow, R., Okbay, A., Kong, E., Maghzian, O., Zacher, et al. (2018). Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics 50(8): 1112-1121.



Tomoko, O. (1995). Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. Journal of Molecular Evolution 40 (1): 56-63


Piffer, D. (2019). Evidence for Recent Polygenic Selection on Educational Attainment and Intelligence Inferred from Gwas Hits: A Replication of Previous Findings Using Recent Data. Psych 1(1): 55-75. https://doi.org/10.3390/psych1010005   


Rana, B.K., D. Hewett-Emmett, L. Jin, B.H.J. Chang, N. Sambuughin, M. Lin, et al. (1999). High polymorphism at the human melanocortin 1 receptor locus. Genetics 151(4): 1547-1557.



Wright S. (1978). Evolution and Genetics of Populations, Volume 4. University of Chicago, Chicago, IL.

1 comment:

Sean said...

This is an interesting post, but quite technical so I hesitated to comment on it. Would consanguineous marriage affect the result of a test for drift?