Monday, 7 October 2013

Group differences and within group variation

Consider the following problem. You are searching the Atlantic Ocean for a nuclear submarine which is about to launch 16 ballistic missiles, capable of destroying 32 cities. There is nothing to be seen on the surface of the ocean other than waves. However, the fast moving submarine creates a tiny pressure wave which can be detected by satellite sensing. It shows locations in which you are very likely to find the submarine. Would you turn down this information because the waves caused by the submarine are an infinitesimal fragment of the total number of waves on the ocean?

Now consider the human genome. You are searching for your close cousins, by which you mean all those who are your first and second cousins. You assume that such persons will all have the usual apportionment of legs and arms and digestive processes and inner organs, but you are mostly interested in those aspects of character which may make them distinctive as individuals, and somewhat like you: personality, attitudes, intellect. Only some parts of the genome will be of interest to you. A few hundred genetic variants, or a few thousand at most, in particular combinations (the correlation of correlations) suffice for you to track down your cousins. You can confirm their identity through the usual documentary channels of birth certificates and surnames. By this this genetic analysis you may be able to recognise even those relatives who have lost their official papers. Should you turn this down because only a small part of the genetic code was required to confirm their identity? I think most people would say that if the technique works accurately enough then those small signals are worth studying for their discriminative value.

Now consider a less happy scenario. You come upon a grizzly murder scene whic contains some scraps of human flesh. There is very little other material which leads you to guess the identity of the victim. You extract DNA from the mortal remains, and find different DNA on the paper in which the flesh was wrapped. Would you like to know as much as possible about the victim and the putative murderer? Would it help to track down the missing person, and the last person who touched them, if you knew the races of the persons concerned?

The first scenario is entirely hypothetical, or at least, I assume it is. The second is where we are currently on population genetics. The third is where we are with genetic forensics. In the latter case finding the general geographic origin of the missing person or perpetrator is relatively easy. More detailed work allows guesses about membership of more precise geographic subgroupings, but with a higher error term. Population genetics allows the reconstruction of a genetic tree and can identify who your first and second cousins are.

So, that’s it in practice. We are able to use principal component analysis, factor analysis, cluster analysis and discriminant function analysis on genetic data. We can classify people by their relatedness, and determine group membership from fragments of DNA. Now, as the French say, we have to see if it works in theory.


  1. "the correlation of correlations" - nice phrase. Even better concept. Is it yours?

  2. As far as I know the phrase is mine. I have aphoristic aspirations. On the other hand,I suppose that the concept is very general, because it simply describes in statistical terms what is obvious from ordinary observation, which is that some things go together, and the presence of one partly predicts the others.Latent factors line up with these correlations of correlations. For example, if you were to entirely remove all skin pigmentation (as in African albinos) then the correlations between nose shapes, lip shapes, skull shapes, types of hair etc would still remain, plus the links which are being claimed for some gene clusters, though those need to be replicated. Same process leads to a well-specified diagnosis, and where you do not get such correlations of correlations, then you doubt the nosological status of the presumed disease entity.