Saturday, November 4, 2017

Very Preliminary Quantitation of Human Genetic Structure

Genetic differentiation increases with higher levels of genetic integration.

Ted Sallis

Introduction

I have finally performed some preliminary analyses of genetic structure – which I (predominantly) define as the association of alleles at different loci, an association that differs between individuals, between families, and between ethnies. The lack of genetic structure calculations is one of the two major genetics-based weaknesses of On Genetic Interests, the other being the reliance on Fst - which is not a real measure of genetic differentiation - rather than on genetic kinship data.  I’m not going to directly get into genetic kinship here (but note that the “genepool” level of analysis of DifferInt does give sort of a measure of genetic kinship, if the numbers are “crunched” appropriately), but since I’ve been discussing genetic structure for so long, here I present a minimal proof-of-principle of genetic structure quantitation with some human SNP data. This is not an optimal study, which needs to be performed by those with the time, expertise, databases, and computational resources do it well and efficiently (the same goes for global genetic kinship assays). Also, the methodology itself is not optimal and doesn’t cover the entirety of the genetic structure concept, but it does at least cover the underlying core principle.  

Methods

The DifferInt program dealing with genetic integration (1-3) – based on the work of Gillet and Gregorius on “genetic integration” (2) - was utilized, as well as some lists of human SNPs and publicly available HapMap population SNP frequency data. Thus, HapMap populations were analyzed. Europeans (EURO) included CEU (Utah residents of Northern and Western European ancestry) and TSI Tuscans, East Asians (EASIA) included CHB and CHD Chinese and JPT Japanese as well as a separate set of Chinese samples previously named HCB (instead of CHB), South Asians (SOUTH ASIAN) included GIH Gujarati Indians, Negroes (AFRICA) includes YRI Nigerians and ASW SE USA African ancestry and LKK and MKK Kenyans, and there also was Mex (MEXICAN: Mexican ancestry). I also produced a CEU-YRI hybrid by taking ~ ½ the alleles from CEU and ~ ½ from YRI – obviously, this is NOT how real admixture would take place (there would be mixing of both alleles at single loci as well as multiple loci, as well as other important differences consequent to sexual reproduction) – this is merely a very crude proof-of-principle.

Ideally, DifferInt populations would be ethnic groups and within each population there would be the individuals of that population, each with their distinct genotypes.  Due to the limitations of this study, the design was somewhat different and at a broader level of analysis. Here, the populations are continental population groups (races) and the “individuals’ within the populations are the ethnic groups themselves – actually the consensus genotypes at each locus for that ethnic group.  Therefore, the entire set of consensus genotypes for an ethnic group is what is being called a single “individual” here.  The consensus genotypes are such that for each gene locus, the most frequent genotype at that locus for the ethnic group was chosen.  So, for example, if a locus has AA – 0.2, AG – 0.3, GG – 0.5, then GG was the genotype chosen in this case.  This results in a “model” individual of a consensus ethnic genotype set.  This is sub-optimal for three related reasons: it doesn’t cover the intra-ethnic group variation; it doesn’t cover the frequency distributions of genotype per locus that are, of course, very important; and there are cases where the most frequent genotype is only slightly more frequent than the second most frequent genotype.  SNPs used are those for which I found genotype data for all twelve ethnic groups evaluated; the SNPs were taken from publicly available information sources.  51 SNPs of my initial list fit the requirements.

Whenever there were two genotypes listed as being of equal frequency at a given locus for any group, I chose the genotype that was the same as to the majority of the other groups.  In other words, I was conservative, and when there was a choice, I always chose the option that minimized differences between the greatest number of groups. That serves two purposes: first, to ensure that whatever differences that are observed are definitive, and not merely in part the result of cherry picking of genotypes; second, to obviate claims of a “racist agenda” in attempting to maximize group differences.  

The three levels of analysis are the genepool (i.e., individual allele “bean-bag” genetics), single locus genotypes (association of alleles at one gene locus - i.e., from the two homologous chromosomes), and, most importantly and consistent with my general basic idea about genetic structure, the multilocus genotypes (the association of all the different single locus genotypes together, how genetic variants at multiple loci are associated with each other).  

Each of these levels can be analyzed with “elementary genic differences” or “neglecting elementary genic differences.”  Considering elementary genic differences is an analysis of the number of individual genes that differ in the types of alleles; from the DifferInt manual: “The genic difference between genetic types at the same level of integration is basically determined by the number of their individual genes that differ in allelic type.”

Neglecting elementary genic differences is a discrete differentiation in which 0 is identity of all alleles of all loci and 1.0 being if the types “differ by at least one allele at one locus” - also from the manual: “By replacing the elementary genic difference between genetic types by the discrete difference, the measures…are based only on relative frequencies of the genetic types of the individuals in the population."  Differentiation is higher when measured with the second, discrete analysis as compared to the first one. Keep in mind that in my crude model the “individuals” are consensus genotypes based on SNP frequency data from ethnic data sets; thus it would make sense that measuring the “discrete difference” would work best for such coarse-grained, “single-point” distinct and discrete pooled samples. Just measuring the numbers of individual genes that differ by allelic type (elementary genic differences) is not measuring (in my opinion) genetic structure (as I define it) per se; measuring the relative frequencies (neglecting elementary genic differences) is somewhat closer to my conception, so I used that for my analysis.

Differentiation is at a scale of 0 (exactly alike, no differentiation) to 1.0 (completely differentiated).

A major flaw in my study is using consensus genotypes, as opposed to actually listing all the individual samples or being able to use allele frequency data (which DifferInt does not do) since, ultimately, we want a range of ethny-specific genotypes characteristic of each group; it would not be a single, fixed consensus genotype.  Using fixed consensus genotypes also makes it even more imperative to look at the discrete DifferInt metrics that neglect the “elementary genic differences.”

Results

(w/o EGD = without [neglecting] elementary genic differences – see above)

Genepool:

EURO/EASIA: 0.3603, EURO/AFRICA: 0.4779, EURO/SOUTH ASIAN: 0.1765, EURO/MEXICAN: 0.1863, EASIAN/AFRICAN: 0.4240, EASIA:/SOUTH ASIAN: 0.2868, EASIA/MEXICAN: 0.2475, AFRICA/SOUTH ASIAN: 0.3922, AFRICA/MEXICAN: 0.4265, SOUTH ASIAN/MEXICAN: 0.2157

Note that the relative differentiation between groups at the genepool level is consistent with what is expected from standard population genetics studies.

Single-locus (w/o EGD):

EURO/EASIA: 0.5784, EURO/AFRICA: 0.8235, EURO/SOUTH ASIAN: 0.3039, EURO/MEXICAN: 0.3529, EASIAN/AFRICAN: 0.7026, EASIA:/SOUTH ASIAN: 0.4951, EASIA/MEXICAN: 0.4461, AFRICA/SOUTH ASIAN: 0.6765, AFRICA/MEXICAN: 0.7108, SOUTH ASIAN/MEXICAN: 0.4118

There is a considerable increase in differentiation considering association of alleles at single loci.  This makes sense, particularly since in many cases differences between ethnies are at the level of whether alleles at the relevant loci are homozygous or heterozygous (which would also have obvious implications for traits affected in a dominant/recessive fashion by the SNP differences, or by gene sequences linked to such differences).

Multiple-locus (w/o EGD):

Was 1.0000 for all comparisons: complete differentiation.

That is not surprising, as combinations of alleles are going to be relatively specific in an ethny-dependent fashion, and the more loci looked at the greater the proneness to distinctiveness.  Of course, with the relatively blunt instrument of combining DifferInt with consensus genotype data, one would expect complete differentiation (with enough loci looked at) at almost any level of genetic difference. The problem here is that while this is informative in a qualitative sense, it doesn’t help gauge relative differences in the degree of “complete differentiation.”  For example, the “complete differentiation” comparing Europeans and South Asians when considering multiple loci is expected to be less than that between, say, Europeans and Africans.  The closer two groups are at the genepool level, the less “complete differentiation” should be expected at the multiple-locus level.  Note that single-locus differences (above) track well with the genepool differences, so the same should be expected at the multiple-locus level if a more scalable metric could be designed.

This lack of scalability at the multiple-locus level may be due to DifferInt itself and/or the type of crude, consensus, discrete SNP data I am using  If it were possible to include allele frequency data – which could be done with this program by actually separately listing each individual with their own genotype rather than a consensus – that would likely help.  Or, if the program itself was changed so that one could just directly include the frequency data for each allelic type rather than having to actually enter each individual as such (although with the proper computational resources and programs I presume listing the individuals would be easy, but I formatted everything by hand, which was tedious).  Alternatively, one could look at relative genetic structure by looking at SNP permutations (not the same type of permutation analysis that DifferInt can do).  One could ask, to what degree are different permutations of allelic types more similar or different? That would be very informative for EGI purposes, if properly designed.

In any case, at least for the data used here, DifferInt was reasonably quantitatively scalable for genepool and single-locus analyses, while multiple-locus analyses were more qualitative.

Also let us look at the CEU/TSI intra-EURO comparison:

Genepool: 0.0392, Single-locus (w/o EGD): 0.0784, Multiple-locus (w/o EGD): 1.0000

Not surprisingly, the intra-European comparison exhibits little differentiation at the genepool level, which is doubled for single-locus integration.  Multiple-locus again shows complete differentiation.  On the one hand, this multiple-locus finding is expected, and makes sense, since the overall genetic structures of CEU and TSI are different.  However, we once again observe the problem of scalability.  EURO/AFRICAN and CEU/TSI both exhibit complete differentiation at the multiple-locus level, but the two are not obviously equivalent. The combinations of alleles at multiple loci for CEU vs TSI are going to be more similar than that for EURO vs. AFRICAN, even if both cases exhibit complete differentiation.  Again, this is a problem with the type of data I used as input, but I suspect as well it is a feature of the program itself. Consider that EURO/AFRICAN differentiation at the genepool level was already at the level of 0.4779 and the maximum possible is 1.0000.  So, it is obvious that the differences are not properly scalable, and likely would not be even with optimal data.  In a properly scalable analytical system, the maximal possible differentiation with multiple-locus analysis should be many-fold greater than that of genepool (and associated with the number of loci examined).  It is at the multiple-locus level that I find this program weakest, which is unfortunate since that is the most important level of analysis.

What the program considers is not perfectly aligned with my conception of genetic structure, but it is not orthogonal either.  There is considerable conceptual overlap; thus utilizing the program at least for a proof-of-principle demonstration is useful.  

The hybrid model (26 loci from CEU, 25 from YRI) is below.  This is, admittedly, highly artificial and not biologically realistic, but makes the general point (real admixture actually would be expected to cause even more differentiation than shown here):

Genepool: 

CEU/YRI: 0.5090, CEU/Hybrid: 0.2640, YRI/Hybrid: 0.2450

As CEU would be expected to be a bit more differentiated from YRI (and other Africans) as are TSI, the CEU/YRI genepool differentiation is slightly higher than the more general EURO/AFRICA, although another possibility is that the non-YRI Africans are closer to Europeans than are YRI. Hybrid values are in between CEU and YRI.

Single-locus (w/o EGD): 

CEU/YRI: 0.8341, CEU/Hybrid: 0.4510, YRI/Hybrid: 0.3922

This increases as expected.

Multiple-locus (w/o EGD): 1.0000 for all comparisons.

Complete differentiation, as expected, but again flawed by lack of scale.  The “complete differentiation”: between CEU/YRI would be expected to be larger than that between CEU/Hybrid, bit that cannot be distinguished in this analysis.  Nevertheless, this shows that merely increasing production of hybrid offspring cannot compensate for foregone parental kinship at the multiple-locus level.

Discussion

The findings (even with the limitations of the analysis) strongly support and extend the EGI concept; ethnies are more genetically differentiated at the level of higher genetic integration than at the mere “beanbag” genepool approach of individual alleles.

However, the gulf between family and ethny is also likely to be increased when genetic structure is taken into account, so one must be prudent in balancing investments.  However, keep in mind two things.  First, the typical ethny is larger than the typical extended family by five to eight orders of magnitude, so the ethny-ethny differences are of greater relative import than the family-ethny differences.  Second, differences will be expected to increase with genetic integration at every level of genetic interest – not only ethny-ethny and family-ethny, but also, for example, between self and family. But the family is needed for the self to have genetic continuity (although one can argue that the larger extended family could be dispensed with as long as the nuclear family is intact, or even that a human male just “spreads his seed” sans family structures), and one can argue that the family needs some sort of ethny, some sort of national culture, for secure familial genetic continuity.  Families mixed beyond wide racial lines are characterized by a deficit of genetic interests for the divergent members of such families, so the fact that those families are less dependent on national ethnies need not concern us, in any reasonable quest to maximize net genetic interests. So, in summary, when all is said and done, the findings here actually INCREASE the validity of ethnic genetic interests (with “ethnic” meaning ethny, which can include race). 

In the future, I may perform some additional analyses with this program and with these (and other) data; but the main point has already been established. Or, better yet, if I can think of other methods of analyzing the data to yield more useful results that would be more optimal.  It would be helpful if others, with more time and computational resources (including better data sets, can generate additional DifferInt data as well as designing better methods for assaying genetic structure (or finding other existing programs; I will search for such as well).

This was a crude analysis, yet very useful I think to “break the ice” on the topic, especially since I can’t help but notice that no one else has been doing it (insofar as I know).  Do you have the time and resources to do better?  Great: Go to it.

Final Conclusions

1. Although the analysis has limitations, it demonstrates that human genetic differentiation increases as genetic structure is considered.

2. A considerable amount of this increase in genetic differentiation is at the single-locus level, which I had not previously considered as being that important.

3. Most importantly, the multiple-locus analysis shows complete differentiation.

4. A problem in this analysis is with the scalability of the multiple-locus determinations, and the program is unable to evaluate the entire genetic structure concept; better methods, analyzed with better data, are required.  In the meantime, it would be useful to even just have more in-depth analyses using DifferInt.

5. When all is said and done, this analysis, even with its limitations, extends the EGI concept.

References

1. https://www.uni-goettingen.de/en/124871.html

2. Gillet, E.M., Gregorius, H.-R. (2008) Measuring differentiation among populations at different levels of genetic integration. BMC Genetics 9, 60. http://dx.doi.org/10.1186/1471-2156-9-60

3. Gillet, E.M. (2013) DifferInt: Compositional differentiation among populations at three levels of genetic integration. Molecular Ecology Resources 13, 953-964. http://dx.doi.org/10. 1111/1755-0998.12145