Tuesday, 28 May 2013

ORIGINAL PAPER: A high-quality replication of Galton’s study one century later: Wilkinson & Allison (1989)

Michael A. Woodley, Jan te Nijenhuis & Raegan Murphy

In Woodley, te Nijenhuis, and Murphy (2013, in press) we argue that intelligence has declined substantially since Victorian times, based on a meta-analysis of simple reaction time. An exchange of ideas started at several blogs. We hereby reply to the blogposts of Scott Alexander and HBD Chick, reacting to an earlier post made by us.

A paper has come to our attention that provides strong evidence against the supposed representativeness problem across cohorts (e.g. Alexander, 2013). The study in question is that of Wilkinson and Allison (1989) using a sample of 5,324 visitors to the London Science Museum, which is situated at the exact site of Galton’s 19th century Anthropometric Laboratory in South Kensington.  All visitors undertook psychophysical testing on a simple reaction time-measuring apparatus, just as the people in Galton’s study did. Of these mixed-sex participants 1,189 were aged between 20 and 29, and are thus highly similar to the age range employed in our own study. Their simple RT mean was substantially slower than the weighted 1889 RT mean (245 ms vs. 194.06 ms), and furthermore the mean of this sample falls very close to the meta-regression-estimated mean across studies for the late 1980s (approximately 250 ms, see: Figure 1 in Woodley, te Nijenhuis & Murphy, 2013). The remarkable features of this study are the ways in which it replicates virtually every significant demographic aspect of Galton’s study.

There is the issue of a participation fee. Galton is known to have requested a participation fee of 3 pennies (approximately £5 in modern UK currency). The London Science Museum required the payment of an admissions fee right up until December 2001. Furthermore it still requires the payment of fees of £6 to £10 for access to some special exhibitions (London Science Museum, 2013a). The Wilkinson and Allison (1989) study was in fact conducted as part of a special exhibition entitled Medicines for Man, which was hosted by the Museum from the early 1980s (Medicines for Man Organizing Committee, 1980). Therefore participation fees were employed in the case of both studies.

There is strong evidence for the demographic convergence between the two studies. Johnson et al. (1985) indicate that whilst Galton’s sample included persons from all occupational and socioeconomic groups in Victorian London, it was nonetheless skewed towards students and professionals, and both groups could fairly be described as solidly White and middle class. In the last decades of the 20th century, museum attendance in the UK exhibited precisely the same skew in terms of sociodemography. Eckstein and Feist (1992) for example noted that most UK museum visitors are drawn from White and upper-middle-class populations. Furthermore Hooper-Greenhill (1994) observed that the largest minority ethnic groups in the UK (i.e. Asians and Afro-Caribbeans) are underrepresented amongst museum visitors. In acknowledging this issue, a House of Commons report in 2002 stated that free admission to museums would unlikely ‘… be effective in attracting significant numbers of new visitors from the widest range of socio-economic and ethic groups’ (House of Commons report, 2002, p. 23).

The presence of this self-selection amongst visitors strongly harmonizes the studies of Galton and Wilkinson and Allison. Add to this the fact that participation fees were employed in both cases, the fact that the geographical locations were exactly the same and finally the fact that the age demographic of interest (i.e. twenty-somethings) were intensively sampled in both cases (i.e. 3,410 in the case of Silverman’s subset of Galton’s sample and 1,189 in the case of Wilkinson and Allison). The net of this is that the studies become even more strongly convergent in terms of comparing like with like. Thus the argument of more heterogeneous samples visiting museums in the 1980s compared to more restricted samples visiting museums in the 1880s is critically weakened. The principal objections that can be leveled against this are as follows.

Firstly there is the issue of tourism. Most tourists to the UK are from the US and Europe (Tourism 3B), meaning that they are likely to be both ethnically and socioeconomically matched to the majority of the participants in this study (i.e. UK citizens). In fact, international arrivals in the United Kingdom in 1990 show that of the 439 million inbound tourists, 60% were European in origin and 21% emanated from the Americas. Hence, 81% of the tourist population came from groups which are highly ethnically similar to the British. Only 12% came from Asia and the Pacific with a meager 3% coming from the Middle East and 2% from Africa (Tourism 3B). In sum, it is unlikely that tourists being tested in the 1989 study were substantially ethnically different from the typical UK museum visitor. Based on current statistics from the Science Museum, the preponderance of visitors hail from the UK (69%) and the preponderance of those are from Greater London (44%; London Science Museum, 2013b). Historically, especially prior to the 1990s this figure would have been much higher, owing to far lower levels of tourism to the UK (in 1990 international tourism levels were less than half the current levels,  >940 million per year, BBC, 2013). This means that in all likelihood well over 70% of the participants in Wilkinson and Allison’s study would have been British, and the overwhelming majority of these would have been White, upper middle-class and from London. The overwhelming majority of the international visitors would have been ethnically and broadly socioeconomically matched to the British visitors.

Secondly is the issue of instrumentation. Galton utilized a pendulum chronoscope with a temporal resolution of around a centi-second (i.e. 1/100th of a second, or 0.01 seconds). The electronic apparatus employed by Wilkinson and Allison in all likelihood had a higher resolution (post-1908 chronoscopy at least had the potential to be accurate to a single milli-second; Haupt, 2001), however a centi-second level only resolution in Galton’s apparatus cannot account for the substantial discrepancies between these two studies.
Thirdly, Galton’s sample was single person-single trial, whereas Wilkinson and Allison’s study employed two practice trials followed by 10 trials per person for the purposes of averaging. This protocol would almost certainly have enhanced the reliability of Wilkinson and Allison’s data relative to Galton’s (Jensen, 1980); however in both cases we are dealing with aggregates. Strong biases (i.e. jumping the gun vs. slow to start) have the potential to cancel each other out when employing these sorts of very large datasets, as these sources of error are distributed in a Gaussian fashion. This means that aggregate-level mean-wise comparisons are appropriate for comparisons between data exhibiting different coefficients of reliability coupled with very large Ns.

On this basis Wilkinson and Allison’s (1989) study must be considered an excellent replication of Galton’s study. Its mean reaction time for the relevant age cohort is almost precisely where our meta-regression predicts it should be. This is clearly strong supporting evidence for the robustness of the increase in simple RT latency produced to date and so puts even more nails in the coffin of those who argue that the trend can be accounted for by lack of representativeness across cohorts.

Alexander, S. S. (2013). The wisdom of the ancients. Slate Star Codex. URL: http://slatestarcodex.com/2013/05/22/the-wisdom-of-the-ancients/ [retrieved on 24/05/13]
BBC. (2013). GCSE Bitesize. Geography tourism trends. http://www.bbc.co.uk/schools/gcsebitesize/geography/tourism/tourism_trends_rev1.shtml
Eskstein, J. & Feist, A. (1992). Cultural Trends, 1991. London, Policy Studies Institute.
Haupt, E. J. (2001). Laboratories for experimental psychology: Gottingen’s ascendancy over Leipzig in the 1890s. In: Rieber, R. W., & Robinson, D. K. (Eds.), Wilhelm Wundt in History. The Making of a Scientific Psychology. (pp. 205-250). New York: Kluwer Academic.
Hooper-Greenhill, E. (1994). Museums and their Visitors. London, Routledge.  
House of Commons, Culture, Media and Sport Committee (2002). National Museums and Galleries: Funding and free admission. House of Commons, United Kingdom.
Jensen, A. R. (1980). Bias in Mental Testing. New York: Free Press.
Johnson, R. C., McClearn, G., Yuen, S., Nagosha, C. T., Abern, F. M., & Cole, R. E. (1985). Galton's data a century later. American Psychologist, 40, 875–892.
Medicines for Man Organizing Committee. (1980). Medicines for Man: A Booklet Based on an Exhibition at the Science Museum about Medicines - how They are Discovered and how They Work, how They are Made and Tested, how They are Prescribed and Dispensed, and how Laws Control Their Use. London, Science Museum.
No author (no date). Tourism 3 SB. Oxford University Press
London Science Museum. (2013a). http://www.sciencemuseum.org.uk/visitmuseum/prices.aspx [retrieved on 27/05/2013]
London Science Museum. (2013b). http://www.sciencemuseum.org.uk/about_us/history/facts_and_figures.aspx [retrieved on 27/05/2013]
Wilkinson, R. T., & Allison, S. (1989). Age and simple reaction time: Decade differences for 5,324 subjects. Journal of Gerontology, 44, 29–35.
Woodley, M. A., te Nijenhuis, J., & Murphy, R. (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time. Intelligence. doi:10.1016/j.intell.2013.04.006

Give me a very bright child until he is 38 and I will give you civilization

While some researchers toil over birth cohorts, diligently tracking every child born in a particular week, others go searching for exceptional children and track them instead. More fun, I suppose. They do so to answer the question: is intelligence all it’s cracked up to be? Even more pedantically: is there any real difference between those who get a high score on an intelligence test compared with those who get an extremely high score, or is being reasonably bright good enough for most purposes in ordinary life?

Kell, Lubinski and Benbow (2013) “Who Rises to the Top? Early Indicators” Psychological Science 2013 24: 648 published online 26 March 2013
Youth identified before age 13 (N = 320) as having profound mathematical or verbal reasoning abilities (top 1 in
10,000) were tracked for nearly three decades. Their awards and creative accomplishments by age 38, in combination with specific details about their occupational responsibilities, illuminate the magnitude of their contribution and professional stature. Many have been entrusted with obligations and resources for making critical decisions about individual and organizational well-being. Their leadership positions in business, health care, law, the professoriate, and STEM (science, technology, engineering, and mathematics) suggest that many are outstanding creators of modern culture, constituting a precious human-capital resource. Identifying truly profound human potential, and forecasting differential development within such populations, requires assessing multiple cognitive abilities and using atypical measurement procedures. This study illustrates how ultimate criteria may be aggregated and longitudinally sequenced to validate such measures.

The Lubinski and Benbow gang have been tracking very bright kids for ages, and the results are clear: being brighter than 99.25 % of the general population, whilst all very well in itself and an almost guaranteed passport to a productive and happy life, doesn’t amount to all that much. Such people have a modest sufficiency of intellect, but no more. For a real impact, you have to be brighter than 99.75% of humanity. Those in the latter category have four times the impact of their less able colleagues. They publish more, have more doctorates, register more patents, and have more impact on their disciplines. How can such a small margin make such a difference? Well, once you are that far out on the right tail of the normal distribution you move quickly from being 1 in 1000 to being 1 in 10,000. Galton referred to those in the last category as having achieved “eminence”. These are “scary bright” minds.

Earlier work looked at some key early academic achievements, but now the research has looked at their worldly successes in mid-career. The authors give a long list of what their prodigies have achieved, and it is clear that they have been busy, successful, and are well plugged into the commanding heights of American academia and industry.

So, once you achieve the eminence of being 1 in 10,000 are these paragons of intellect on a level playing field, comrades of equal brilliance? No. A little extra something is required, and those that have it shoot ahead of even this advanced class. A few of them take most the prizes: one or two raise disproportionate amounts of research money, and account for several major advances. There are degrees of brilliance among this cognitive elite. So, what is it like to scoop the top prizes? Although the paper does not report any personal testimonies or self-evaluations (these may follow in a later paper, perhaps) it would not be surprising to find that many of them are probably very happy with their achievements. In my view, some aspects of academic life make for perpetual uncertainty, as if the peer review process can never be turned off.

A physicist of my acquaintance who made a habit of inviting Nobel Laureates to his departmental seminars found that they often doubted that they had deserved the accolade, and feared that the assembled physicists would spot the error of attribution the moment they began their lecture. Driving them to the department, he had to do his best to calm their nerves. So, even among this select crew, there are orders of precedence. Mercifully, as a visiting psychologist following on from the Nobel Laureates I was spared any such harsh evaluation, and accepted for my peripheral entertainment value.

So, to summarise the results of this rich source of research results on intelligence, if you get asked: “What does IQ mean, really?” you will find that Lubinski and Benbow have many of the answers.

Just for the record, by the age of 38 I had achieved (at the most generous and inclusive count) 28 rather slight publications, tenure, and one promotion, but few citations, no patents, no companies founded, and no perceptible impact either on my discipline or the course of Western civilization. All that may change soon, with any luck, but if you have passed 38, then look back and check your comparative achievements, and if you are approaching 38, remember that the clock is ticking.

Pereunt et imputantur.

Practice makes (one third) perfect, the other two thirds require talent

It has become part of popular wisdom that expertise requires 10,000 hours of practice. Some sages further imply that if you have the discipline to put in those ten thousand hours you can achieve success in any calling. This notion fits in nicely with the general nostrum that “you can be anything you want to be”.

The basis for this work (Ericsson, 1993) was to identify student musicians of high, middling and low achievements and ask them to look back at their careers to estimate how long they had practiced. The same approach was used to look at chess players. The results were written up to suggest that the hours of practice were the main causal variable. Other possible causal variables were not investigated in significant depth.

However, there were one or two problems with this notion. First, there was plentiful evidence that very many people tried to practice for many hours, and then gave up. In my own case, when I gave up trying to learn to play the piano after several years, my class mates who had been listening against their will on the other side of the wall while doing their homework were universally grateful. My fruitless hunt for the correct note had been pure torture for them. I had the wish, but not the wherewithal.  Second, if you study those who are forced to practice by pushy parents (generally lauded by journalists for having trained their children to excel in chess) you find two things: the best chess playing children in the family have studied for much longer (one standard deviation more) but they have spent many more hours in practice than grandmasters who have no difficulty beating them soundly in chess competitions. Forced drill and endless practice pays relatively meagre results, while talent and substantial practice soar ahead.

David Z. Hambrick et al. (2013) Deliberate practice: Is that all it takes to become an expert? http://dx.doi.org.libproxy.ucl.ac.uk/10.1016/j.intell.2013.04.001

The authors have taken a more open-minded approach to the subject. They have looked at practice in student musicians and chess players, and their best estimate is that practice accounts for one third of the variance. The other two thirds are up for grabs. It is likely, but not directly measured, that talent contributes to the remaining two thirds. There is plentiful data showing that good musicians and chess players are much brighter than average, and we know from more mundane activities in the Army that intelligence is associated with learning things faster and with performing to a higher level, particularly when the task becomes more complicated and when you have to apply general principles rather than follow a checklist. Linda Gottfredson has assembled a lot of data on the importance of g in everyday life. http://www.udel.edu/educ/gottfredson/reprints/1997whygmatters.pdf

Have a look at the Hambrick et al. paper, which also includes discussions about the contribution of personality  (not much, once you have measured hours of practice) and genes directly involved in particular skills.

Bottom line from the authors: 

1)The evidence is quite clear that some people do reach an elite level of performance without copious practice, while other people fail to do so despite copious practice.

2)Ten thousand hours are not required. Some chess players take 26 years of practice to make Master level, while others achieve that in less than 2 years.

Now, about your piano playing….. please consider your neighbours.

Sunday, 26 May 2013

A response to two critical commentaries on Woodley, te Nijenhuis & Murphy (2013)

Michael A. Woodley, Jan te Nijenhuis, & Raegan Murphy

Our study on the lowering of intelligence has drawn massive attention from the media, with headlines from Brazil to Vietnam. Also thousands of reactions were posted on blogs, including two highly relevant critical comments on the blogs of Scott Alexander and HBD Chick. We give a response in this post. We are also pleased that our paper in Intelligence is starting a scientific discussion on the lowering of intelligence.

Alexander (2013) advances the argument that Galton’s sample is unrepresentative of the population of Victorian London, and may be heavily skewed towards those with high-IQ and faster reaction times (RTs) owing in part to the fact that Galton charged a small fee to those wishing to participate in his data collection exercise. Hence, these studies should not be used as the basis for comparison with more modern studies, which, it has been argued are relatively far more representative in many cases of the populations from which they are drawn. We show here that this argument is wrong.

HBD Chick (2013) has advanced a second argument to the effect that Galton’s sample, and other contemporaneous 19th century studies (i.e. Ladd & Woodsworth, 1911; Thompson, 1903) represent ethnically homogeneous samples in comparison with more modern ones, which are obviously less homogeneous. Given the existence of ethnic-group differences in reaction time (RT) means (i.e. Jensen, 1998), this is proposed as a cause of the substantially depressed means in current-era studies, thereby undercutting our conclusion that RT has become slower for the general population (HBD Chick, 2013). We show here that this second argument is wrong in as much as changing population composition cannot account for the preponderance of the observed secular trend.

In addressing the first argument, the seminal paper of Johnson et al. (1985) which constitutes the source of Galton’s simple visual RT data employed in both our study and that of Silverman (2010), contains excellent data on the socio-economic and occupational diversity of the relevant subset of Galton’s exceptionally large sample (N around 17,000 individuals, 4838 [or 30%] of whom were included in Johnson et al’s study). The paper states that “… a sizable portion of Galton’s sample consists of professionals, semi-professionals, and students. However … all socioeconomic strata were represented” (p. 876). As can be seen in Tables 10 and 11 (pp. 890-891), the male cohort could be split into seven socioeconomic groups (Professional, Semi-professional, Merchant/Tradesman, Clerical/Semiskilled, Unskilled, Gentlemen [aristocracy] and Student or Scholar). For females, there were six socioeconomic groups represented in the data (Professional, Semi-professional, Clerical/Semiskilled, Unskilled, Lady [aristocracy] and Student or Scholar). In both the male and female sample the modal group appears to be the Student or Scholar category; in both cases these groups exhibit the largest Ns – 1657 in the case of 14-25 year old males, and 297 in the case of equivalently aged females. The second- and third-largest groups amongst the males of equivalent age were Clerical/Semiskilled (N=425) and Semi-professional (N=414). This is basically true of the female sample also, with Semi-professional being the next largest group after Student or Scholar (N=104) and Clerical/Semiskilled comprising the third largest group (N=47). Whilst it is obviously true that the sample is skewed towards Students or Scholars in both cases, individuals from these lower-middle/upper-working class occupations combined (see p. 888 in Johnson et al., 1985; for a full description of how these occupational categorizations correspond to employment type), make up a respectable proportion of the 14-25 year old samples also (>30% in the case of the males, and >30% in the case of the females). It is important to note that according to Johnson et al (1985) many of the students would have been pupils at schools accompanied by teachers on day-trips to Galton’s laboratory at the Kensington Museum. However, a fundamental point is that Silverman’s (2010) study uses only data for those aged 18-30 (see Table 1, p. 41 in Silverman [2010] for full details of this subsample), hence is quite unlikely to have been nearly as skewed towards school-aged students relative to the sample as a whole, which included a much larger range of ages.

A careful reading of Silverman (2010) will reveal that he was cognizant of precisely how much socioeconomic diversity was present in Galton’s dataset. Accordingly he was very careful to include only samples that would broadly match one or more of the categories in Galton’s dataset (see: Silverman, 2010, Table 2, pp. 42-43 for full disclosure of the sample background characteristics). One advantage of Silverman’s care and meticulous attention to detail is that it permits us to make like for like comparisons with specific socioeconomic and occupational groups in Galton’s data, thus we can directly test the claims of Alexander (2013). Concerning the post-Galton studies Silverman included five student samples, two of which date from the 1940s (Seashore et al. 1941), and the remaining three of which date from the 1970s to the 2000s (mean testing year = 1993; Brice & Smith, 2002; Lefcourt & Siegel, 1970; Reed et al., 2004). These can be compared with the combined Galton and Thompson 19th-century student data in a three-way comparison as follows:          

Comparison involving male students          Difference in mean N-weighted RT means
19th-century students vs. 1940s-era students                          +16.8 ms (183.2-200 ms)
19th-century students vs. ‘modern’ students                           +74.2 ms (183.2-257.4 ms)
1940s-era students vs. ‘modern’ students                              +57.4 ms (200-257.4 ms)

The difference between the 19th century and the ‘modern’ male students is very similar to the meta-regression-weighted increase in RT latency between 1889 and 2004, estimated on the basis of all samples included in the meta-analysis (81.41 ms). Silverman also included data from other socioeconomic groups. For example the study of Anger et al. (1993) included a combined male + female sample of 220 postal, hospital and insurance workers from three different US cities. These occupations clearly fall into the Clerical/Semiskilled and Semiprofessional groups identified in Galton’s study. For both males and females in Galton’s data, the N-weighted RT mean for these two groups is 185.7 ms, the N-weighted average amongst the participants in the study of Anger et al. (1993) was 275.9 ms. This equates to a difference of 90.2 ms between the 19th century and 1993. Again, this is not dissimilar to our meta-regression-weighted estimate of the cross-study increase in RT latency (81.41 ms).

The results of these broadly socioeconomically- and occupationally-matched study comparisons therefore imply an additional degree of robustness to the findings of our more statistically involved analysis of the overall secular trend. Furthermore, this evidences Silverman’s contention that as an aggregate, the ‘modern’ studies have broadly equivalent representativeness to the subset of Galton’s data employed in his and our own analyses. Alternatively we could state that neither Galton’s nor Silverman’s data are truly fully representative of any population, however they are both ‘biased’ in their sampling towards broadly similar groups.

We continue with the second concern, i.e. the lack of strict ethnic matching criterion, hypothesized to lead to substantially depressed RT means in current-era studies. Ethnic-group differences in performance on various elementary cognitive tasks have been documented and are to be expected (i.e. Jensen, 1998). Substantial changes in terms of the ethnic composition of test-takers would however be needed in order for the magnitude of change to be solely or even substantially a consequence of this process. This is assuming of course that within and between ethnic-group comparisons in terms of RT produce proportional results.

RT is related to g via mutation load (as measured using fluctuating asymmetry; Thoma et al., 2006). Mutation load is therefore likely to be a general source of individual differences in cognitive functioning within populations (Miller, 2000), but not between them (e.g. Rindermann, Woodley & Stratford, 2012), hence there is no good reason to expect ethnic-group differences in RT means to be meaningfully comparable to within-group differences in terms of proportionality (consistent with this is the observation that on simple RT these differences whilst present are actually quite small; Jensen, 1993; Lynn & Vanhanen, 2002, pp. 66-67). So, indeed ethnically heterogeneous samples will exhibit slightly slower or even faster reaction times (depending on the populations and proportions involved), however the current proportions of groups exhibiting slower simple RT means to Whites in Western countries are simply too small, and the group-differences too slight to have had a substantial effect.

It is also worth noting that the weighted mean of our modern (post-1970) aggregated estimate (264.1 ms) is actually less than Jensen’s (1993) finding of a 347.4 ms mean of simple visual RT amongst a sample of 582 White US pupils described as being of European descent, and also Chan and Lynn’s (1989) finding of a 371 ms simple RT mean for over 1000 White British school children in Hong Kong. It must be noted however that these studies were conducted on young children – simple RT shortens until the late 20’s when full neurological maturation is achieved (e.g. Der & Deary, 2006), hence Jensen and Chan and Lynn’s estimates are likely to be underestimates of the adult simple RT means of these Whites, which may be somewhat closer to our sample mean of ‘modern’ (mostly White) populations in actuality.

We would like to thank Scott Alexander and HBD Chick for their interest in our study, and for their commentaries, however the counter-arguments, whilst thought-provoking, do not appear to withstand scrutiny. We must therefore conclude that the secular slowing of simple reaction time between the closing decades of the 19th century and the opening one of the 21st has had little to do with sampling issues.


Alexander, S. S. (2013). The wisdom of the ancients. Slate Star Codex. URL: http://slatestarcodex.com/2013/05/22/the-wisdom-of-the-ancients/ [retrieved on 24/05/13]

Anger, W. K., Cassitto, M. G., Liang, Y.-X., Amador, R., Hooisma, J., Chrislip, D. W., et al. (1993). Comparison of performance from three continents on the WHO-recommended
Neurobehavioral Core Test Battery (NCTB). Environmental Research, 62, 125–147.

Brice, C. F., & Smith, A. P. (2002). Effects of caffeine on mood and performance: A study of realistic consumption. Psychopharmacology, 164, 188–192.

Chan, J., & Lynn, R. (1989). The intelligence of six year-olds in Hong Kong. Journal of Biosocial Science, 21, 461-464.

Der, G., & Deary, I. J. (2006). Age and sex differences in reaction time in adulthood: Results from the United Kingdom Health Lifestyle Survey. Psychology and Aging, 21, 62–73.

HBD Chick. (2013). We’re dumber than the Victorians. HBD Chick. URL: http://hbdchick.wordpress.com/2013/05/22/were-dumber-than-the-victorians/ [retrieved on 24/05/13]

Jensen, A. R. (1993). Spearman’s hypothesis tested with chronometric information-processing tasks. Intelligence, 17, 47-77.

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT:

Johnson, R. C., McClearn, G., Yuen, S., Nagosha, C. T., Abern, F. M., & Cole, R. E. (1985). Galton's data a century later. American Psychologist, 40, 875–892.

Ladd, G. T., & Woodworth, R. S. (1911). Physiological psychology. New York, NY: Scribner.

Lynn, R., & Vanhanen, T. (2002). IQ and the Wealth of Nations. Westport, CT: Praeger.

Miller, G. F. (2000). Mental traits as fitness indicators: Expanding evolutionary psychology’s adaptationism. Annals of the New York Academy of Sciences, 907, 62–74. 

Reed, T. E., Vernon, P. A., & Johnson, A. M. (2004). Sex difference in brain nerve conduction velocity in normal humans. Neuropsychologica, 42, 1709–1714.

Rindermann, H., Woodley, M. A., & Stratford, J. (2012). Haplogroups as evolutionary markers of cognitive ability. Intelligence, 40, 362-375.

Seashore, R. H., Starmann, R., Kendall, W. E., & Helmick, J. S. (1941). Group factors in simple and discrimination reaction times. Journal of Experimental Psychology, 29, 346–394.

Silverman, I. W. (2010). Simple reaction time: It is not what it used to be. The American Journal of Psychology, 123, 39–50.

Thoma, R. J., Yeo, R. A., Gangestad, S., Halgren, E., Davis, J., Paulson, K. M., & Lewine, J. D. (2006). Developmental instability and the neural dynamics of the speed-intelligence relationship. Neuroimage, 32, 1456-1464.

Thompson, H. B. (1903). The mental traits of sex. An experimental investigation of the normal mind in men and women. Chicago, IL: The University of Chicago Press.

Woodley, M. A., te Nijenhuis, J., & Murphy, R. (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time. Intelligence. doi:10.1016/j.intell.2013.04.006 

Thursday, 23 May 2013

He was a bit of a laugh

To those who knew him to any extent, he was seen as pretty normal. Above average at school, fond of football, a fan of Tottenham Hotspur and an amateur centre field player. Most of the accounts initially were reasonably positive. By the following day the story was a bit more nuanced. He had punched a girl in the face 10 years ago. He was aggressive. He was a big guy and you wouldn’t want to get on the wrong side of him. He had a very difficult father, who contributed nothing much to his upbringing. No-one had any inclination he would do a thing like that, though he had become a preacher, though not about Jesus.

With a confederate he rammed a car into an off duty soldier, cut his head off with a meat cleaver, dismembered his body with knives and machetes, and then for 20 minutes posed for photographs and gave interviews in which he gave his justifications and issued threats. Some extremely brave women passers-by (one getting off her passing bus for the express purpose of helping the fallen man) engaged the murderers in conversation. They reported that the main protagonist was apparently neither drunk nor on drugs, just an angry guy.

Murder is too rare a behaviour in for the average citizen to pick up warning signals. Too rare for the security services as well, who knew of lead protagonist, but not of his most recent Jihadist plans. With any luck a fuller picture will emerge, though probably never a way of distinguishing the noise of the average angry guy who converts to Islam and preaches anger from the signal that two butchers are hunting for an English soldier walking one afternoon on a English street.

Monday, 20 May 2013

Give me a child until he is seven, and I will give you the man

 (This nostrum, attributed to St. Francis Xavier, also works for girls and women, though separate equations are required, because of interrupted careers).

In popular culture, in academic debate, and in the nitty-gritty of medico-legal battles about the bright future which might otherwise have been enjoyed by a damaged child seeking compensation, there is much interest in what one can predict about a person’s future given knowledge of their social class, circumstances, school performance and intelligence at age 7. In medieval times it was only at age 7 that it seemed pragmatic to recognise that the infant had survived the very high early life death rates, and could be welcomed as a human being. In these gentler times parents have no compunction about photographing their infant, secure its survival. It is not bad luck to register, name, photograph, film, record and display the vulnerable neonate to the world.

A recent study has added some evidence to these discussions, finding that maths and reading make an additional contribution to later success in life, over and above the general factor of intelligence. Stuart Ritchie and Tim Bates have written an elegant paper in Psychological Science “Enduring Links from Childhood Mathematics and Reading Achievement to Adult Socioeconomic Status”. http://pss.sagepub.com/content/early/2013/05/02/0956797612466268

Using the population born in a single week in 1958 (National Child Development Study data held by Institute of Education, and in my view “gold dust” for proper research) they got the data on social class of origin, maths, reading, intelligence, academic motivation, duration of education and attained social class.
In a nutshell, the effects of mathematics and reading achievement at age 7 have an effect on attained Socio-Economic-Status by age 42. Mathematics and reading ability both had substantial positive associations with adult SES, above and beyond the effects of SES at birth, and with other important factors, such as intelligence. Achievement in mathematics and reading was also significantly associated with intelligence scores, academic motivation, and duration of education. These findings suggest effects of improved early mathematics and reading on SES attainment across the life span.

Of course, readers of this blog will know the standard lament by now: many causes interact with each other, and teasing them apart is difficult, but not impossible.  For example, in the original study the social class of origin of the children was noted, but the intelligence of the parents was not measured. So, we cannot assume the “influence of social class” is from social class advantage per se. It will be a blend of material advantage and genetic advantage, of unknown proportions.  The explanatory model probably should say “a class and genetic mixture”.

In ancient times the data would be presented in terms of means, standard deviations, a correlation matrix, and then perhaps a multiple regression equation. A useful and familiar progression, but not without interpretive problems. Ritchie and Bates are made of brighter stuff, and use a OpenMX magic box http://openmx.psyc.virginia.edu/ to generate there structured equations.

Personally, I approach structured equation modelling with some trepidation, fearing a magic lantern show which will convince me of anything, but Tim Bates thunders: “SEM exposes all assumptions, claims, and lacuna ruthlessly: it should be ubiquitous.” The (complicated story) is shown in their Figure 2, which traces direct and indirect coefficients on final achieved social status. From this it is possible to argue that, although intelligence has a strong causal effect, there is an additional direct contribution from Maths, with a lower direct effect from Reading. Nonetheless, there is a case for improving the teaching of these skills so as to make an independent additional contribution to life successes. Intelligence leads to motivation, which leads to years in education, which leads to attained socio-economic status. The latter leads into log income at the very end, which may be a relief to those who value cash over social approval.

A few points: once you put in social class of origin and housing tenure, the number of rooms in the parental home has no effect. All other things being equal, the “bedroom tax” is unlikely to diminish social mobility in a generation’s time.

I should like to have been able to give you a much more detailed statistical analysis but I was not taught maths properly when I was seven. At about that age, or slightly older, I announced to my grandfather, an Edinburgh engineer: “I know my12 times table”.  He looked at me with a dour expression, and replied: “When I was a wee lad I knew my 20 times table”.

Edinburgh has much to answer for.

Sunday, 19 May 2013

ORIGINAL PAPER: "A response to Prof Rabbitt – The Victorians were still cleverer than us" by Woodley, te Nijenhuis and Murphy

A response to Prof Rabbitt – The Victorians were still cleverer than us
By Michael Woodley, Jan te Nijenhuis, and Raegan Murphy

Professor Rabbitt has reacted to our interpretation of the secular trend in simple reaction time speeds first detected by Silverman (2010), and validated by us (Woodley, te Nijenhuis & Murphy, 2013). We would like to thank professor Rabbitt for his interest in our work and for being one of the first to substantially contribute to the scientific discussion that was started by our paper. Rabbitt makes several interesting points of criticism – here we will show however that these do not constitute sufficient grounds to reject the reality of the secular slowing of simple reaction time.

Firstly, Rabbitt argues that the level of inaccuracy in instrumentation designed to measure simple reaction time was historically quite high, especially in the pre-1970’s era where he argues that it was on the order of 100 or so ms. Rabbitt then goes on to state paradoxically that a reading of 200 ms might therefore fall between 200 and 299 ms, which assumes a bias of 99 rather than 100 ms, and also that the instrumentation would consistently ‘round down’ reaction time estimates. In actuality a bias of 100 or so ms would yield an average bias of 50 ms either way, assuming that the error due to bias was normally distributed, and that there was no tendency for biases to be skewed in one direction rather than in the other. Rabbitt does not provide any evidence for such a tendency towards rounding down – he merely states this as a fact apparently based on personal experience with pre and post-1970’s instrumentation. 
Secondly, Rabbitt argues that method variance across studies employing different instrumentation makes direct mean-wise comparison of results problematic. He illustrates this via reference to the use of warning signals along with the signal intensities, durations and rise-times of different light sources (such as bulbs, fluorescent tubes, LEDs, computer monitors, etc), and also with respect to response keys that might have been non-uniformly ‘sticky’ across different apparatus.

Thirdly, Rabbitt argues that the presence of only two data points from the Victorian era in our studies means that we can “… leave aside an important question whether there is any sound evidence that creativity and intellectual achievements have declined since the Great Victorian Flowering”.  

In addressing the first of Rabbitt’s claims, we are skeptical about the suggested level of inaccuracy in pre-70’s era instrumentation (such as Galton’s apparatus and the electro-mechanical Hipp chronoscope). True millisecond resolution in measurement had been achieved far earlier than Rabbit claims, namely in 1908 (Haupt, 2001), with instruments prior to that being typically accurate to at least a hundredth of a second. It is not obvious why decent resolution (perhaps on the order of a hundredth of a second) would not have been within the grasp of someone of Galton’s mental stature and notoriously obsessive attention to detail (Rose & Rose, 2011). His apparatus was described in an 1889 paper and employed a half-second pendulum, whose duration could be estimated using very basic mathematics. Its release occurred concomitantly with the concealing of a white paper disk, which functioned as the stimulus - depressing a key facilitated its capture, registering the reaction-time score. Similarly the much more sophisticated Hipp chronoscope, with its electro-mechanical clutch-based mechanism was capable of true millisecond resolution (Haupt, 2001). The issue of true millisecond resolution is at any rate rendered moot in light of the fact that we are dealing with the means of a large number of individuals measured by Galton and others in multi-trial type experiments. Resolutions of hundredths of a second would seem to suffice in such samples (Haupt, 2001).

These observations aside, there is a far more substantive problem with Rabbitt’s primary claim, namely that, even assuming a normally distributed 100 ms level of inaccuracy, the preponderance of pre-1970 studies still reveal upper bound means for simple reaction time that are shorter in duration than the sample size weighted ‘true millisecond resolution’ mean of post-1970 studies.  
Table 1
Reaction time means for five pre-1970 studies used in Woodley et al. (2013) along with estimates of error due to sub-100 ms measurement imprecision
Reported mean (combined and N-weighted for the sexes where available)
Error range assuming 50 ms either way
184.3 ms    (Galton, 1890’s)
134.3-234.3 ms
208 ms       (Thompson, 1903)
158-258 ms
197 ms       (Seashore et al., 1941)
147-247 ms
203 ms       (Seashore et al., 1941)
153-253 ms
286 ms       (Forbes, 1945)
236-336 ms
Weighted mean of post-1970 studies = 264.1 ms

Based on Table 1, assuming a normally distributed 100 ms inaccuracy, the upper estimate falls below the post-1970 ‘true millisecond resolution’ mean in four out of five cases (the exception being the study of Forbes, 1945). The cumulative odds of this being a chance result can easily be calculated. Let us assume a 50% chance that the instruments would produce a mean value whose upper-bound estimate falls above that of the post-1970’s study. The odds of four studies producing consecutive means whose values are lower is equal to 0.5*0.5*0.5*0.5, or 6.25%. In other words, the probability that this is a chance finding is small. If we add to this the systematic review of Ladd and Woodsworth (1911), which found a mean for 19th- and early 20th-century samples of 192 ms, and whose hypothetical upper mean also falls below the weighted post-1970 mean (242 ms), the cumulative odds of this being a chance finding fall to 3.12%. 
Secondly, and again assuming high inaccuracy, why are the results of the pre-1970's studies likely to be overestimates rather than underestimates of the true values? Let’s look at the sources of bias that Rabbitt describes. Sticky keys might require more force to in order to register a result. This was more likely to have been a problem in the case of earlier studies employing cruder instruments, such as mechanical or hybrid electro-mechanical apparatuses, rather than computer-based ones, for example. This suggests that the bias would have been in the opposite direction for earlier studies to that described by Rabbitt. Sticky keys would necessarily lengthen rather than shorten reaction time estimates. Long-duration visual signals, and also ones that are more intense and exhibit rapid rise-times typically elicit faster (or maximal) reaction times (Kosinski, 2012). Galton’s apparatus used a purely mechanical signal in the form of a paper disk, which could be made to disappear via the operation of levers, thus triggering the subject to depress a key and halt the swing of a half-second pendulum. The signal duration was therefore indefinite – persisting until the point at which the apparatus would be reset. It is hard to argue against the high visibility of such a signal either, assuming a well-lit laboratory. Subsequent studies employing the Hipp chronoscope such as Thompson (1903) and the studies described in Ladd and Woodsworth (1911) would have employed light sources. Thompson (1903) for example employed a Geissler tube suspended against a black background which was reported as producing a “flash of pale purple light” that was “thrown out sharply” (p. 8). Geissler tubes are plasma-discharge or fluorescence-based illumination sources. Fluorescent light sources exhibit extremely rapid rise-times compared to filament-based incandescent bulbs, for example (Sivak, Flannagan, Sato, Traube & Aoki, 1993).       

Whilst the issue of signal duration in these early studies employing light sources as stimuli is indeed problematic, the suboptimal tendency is towards shorter duration signals (i.e. brief flashes), which would lengthen rather than shorten reaction time estimates. It is long-duration visual signals that permit the recovery of accurate maximal reaction time latencies (Kosinski, 2012). Once again, any measurement error in these earlier instruments would tend to skew the estimates towards higher rather than lower latencies.

What of the issue of warning signals? As Silverman (2010, p. 41) reports, there is very little evidence that warning signals actually make a difference to recorded reaction time latencies, especially when the ensuing stimulus is unpredictable, as was the case in all studies employed in our and Silverman’s analyses. It is unlikely that Galton utilized a warning system in his single person-single trial study. Thompson (1903), however, did use an audio warning system in her study involving multiple trials per person. The difference in the means between the two studies is extremely small (18.7 ms), and in the opposite direction to that predicted by the theory that the presence of a warning signal reduces the latency of reaction time means. This strengthens Silverman’s conclusion that employing warning signals makes little difference.

We agree with Rabbitt, and also Jensen (2011), who both argue that method variance between studies can be a substantial problem when it comes to comparing between different studies, especially those using different instrumentation. However, Rabbitt seems to have missed the point of the meta-analytic nature of our own and Silverman’s study. Indeed, the study of Silverman (2010) set out to explicitly address the issue of method variance using a stringent set of seven inclusion rules (p. 41) coupled with a detailed meta-analytic search. The rules were selected on the basis that all studies included in the comparison set should be as closely matched with respect to Galton’s study on as many dimensions as possible. The stringency of these rules means that method variance across studies is substantially reduced, however the trade-off is that the number of potentially usable studies is also massively reduced. Our meta-regression ultimately demonstrates the power of a properly conducted meta-analysis in this regard as we found no significant role for moderators in explaining the secular trend towards increasingly latent simple reaction time performance. There is scatter around the regression line, but that is exactly what meta-analytical theory predicts. All data points being on or very close to the regression line is an extremely unlikely outcome for a meta-analysis (see Hunter & Schmidt, 2004).   

Finally, what of the issue of sound evidence for the greater accomplishments of 19th-century Western populations relative to contemporary ones? This is an important issue that has been addressed quantitatively using historiometry, which is the historical study of human progress or individual personal characteristics, using statistics to analyze references to geniuses, their statements, behavior and discoveries in relatively neutral texts (Simonton, 1984). Historiometric research into innovation rates and the lives and accomplishments of eminent individuals (geniuses) has shown that the per capita rate (i.e. events per billion of the population per year) of significant innovation and also geniuses in science and technology peaked in the late 19th century, after a long period of increase. Throughout the 20th century there was a decline (Huebner, 2005; Murray, 2003).

What is a significant innovation? It is simply one that is conspicuously different from anything that came before – so much so that multiple encyclopedists and compilers of inventories of innovation are likely to independently note it. Examples include the development of the plough, the steam engine, splitting the atom and putting a man on the moon. The iPhone 5 is not a significant innovation in comparison with its earlier incarnations by contrast, and is unlikely to be considered as such by contemporary historians of science and technology. Similarly geniuses can be rated via the degree to which these same sources reference them. The use of a ‘convergence’ criterion based on prominence across encyclopedias not only allows us to reasonably quantify the frequencies of significant innovation and geniuses throughout the history of civilization, but it also allows us to rank those same innovations and individuals in terms of importance. This historiometric technique, like many extremely useful ideas, has its origins in the writings of Galton (1869).

In conclusion, whilst Rabbitt’s criticisms are interesting, they are clearly insufficient grounds for rejecting the central claims made in our paper – namely that the secular trend in increasing simple reaction time latency is robust and translates into a decline of -1.23 IQ points per decade or -14.1 points since Victorian times.

Forbes, G. (1945). The effect of certain variables on visual and auditory reaction times. Journal of Experimental Psychology, 35, 153–162.
Galton, F. (1869). Hereditary genius. London, UK: Macmillan Everyman's Library.
Galton, F. (1889). An instrument for measuring reaction time. Report of the British Association for the Advancement of Science, 59, 784–785.
Haupt, E. J. (2001). Laboratories for experimental psychology: Gottingen’s ascendancy over Leipzig in the 1890s. In: Rieber, R. W., & Robinson, D. K. (Eds.), Wilhelm Wundt in history. The making of a scientific psychology. (pp. 205-250). New York, NY: Kluwer Academic. 
Huebner, J. (2005). A possible declining trend for worldwide innovation. Technological Forecasting and Social Change, 72, 980–986.
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis (2nd Ed.): Correcting error and bias in research findings. Thousand Oaks, CA: Sage.
Jensen, A. R. (2011). The theory of intelligence and its measurement. Intelligence, 39, 171–177.
Kosinski, R. J. (2012). A literature review on reaction time. http://biae.clemson.edu/bpc/bp/lab/110/reaction.htm
Ladd, G. T., & Woodworth, R. S. (1911). Physiological psychology. New York, NY: Scribner.
Murray, C. (2003). Human accomplishment: The pursuit of excellence in the arts and sciences, 800 BC to 1950. New York, NY: Harper Collins.
Rose, H., & Rose, S. (2011). The legacies of Francis Galton. The Lancet, 377, 1397.
Simonton, D. K. (1984). Genius, creativity and leadership: Historiometric inquiries. Cambridge, MA: Harvard University Press.
Sivak, M., Flannagan, M. J., Sato, T., Traube, E. C., & Aoki, M. (1993). Reaction times to neon, LED, and fast incandescent brake lamps. The University of Michigan Transportation Research Institute, Report Number. UMTRI-93-37.
Seashore, R. H., Starmann, R., Kendall, W. E., & Helmick, J. S. (1941). Group factors in simple and discrimination reaction times. Journal of Experimental Psychology, 29, 346–394.
Silverman, I. W. (2010). Simple reaction time: It is not what it used to be. The American Journal of Psychology, 123, 39–50.
Thompson, H. B. (1903). The mental traits of sex. An experimental investigation of the normal mind in men and women. Chicago, IL: The University of Chicago Press.
Woodley, M. A., te Nijenhuis, J., & Murphy, R. (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time. Intelligence. Doi:10.1016/j.intell.2013.04.006

Thursday, 16 May 2013

Angelina Jolie and prophylactic double mastectomies

Angelina Jolie, who has the BRAC1 mutation, has undergone a prophylactic double mastectomy and says that she has thus reduced her risk of getting cancer from “87% to 5%“. She was faced with a most dreadful dilemma, and has been praised for her courage and for her willingness to make her story public. It is likely that her example will lead to more women with BRAC mutations having both breasts removed.

Should they?

If I a similar mutation and was offered the prospect of reducing my cancer risk from 87% to 5% by having my testicles removed, and I, like Angelina, had lost family members to cancer, I might go ahead. However, I would first spend some time checking the statistics, and re-reading Gerd Gigerenzer’s masterly “Reckoning with Risk” and his more recent “Calculated Risks: How to know when numbers deceive you”.

The reason for my caution is that:
a) Most of us have difficulty with statistics
b) Most of us have a particular difficulty with percentages and
c) Most doctors have as much difficulty with numeracy as the rest of us, but are more likely to be over-confident.

One of the main problems is that doctors and journalists concentrate on relative risk reduction and not absolute risk reduction. The first compares two procedures in terms of their relative effectiveness at reducing risk, the second shows you the overall reduction in risk. It is not much use to reduce your relative risk if the absolute level remains very much the same.

Consider the results from an earlier retrospective study by Hartman et al. (1999) which gives deaths per 100 women in the high risk group:

Prophylactic mastectomy        1
Control (no mastectomy)        5

You can see that the rate of death in this high risk group (with BRAC mutations) in the women without mastectomies is higher than in those who had the double mastectomy. 

The Relative Risk Reduction is 80% (4 women have been saved, and 4 divided by 5 is 80%.

The Absolute Risk Reduction is 4% (prophylactic mastectomy reduces the number of women who die from 5 to 1 in 100, a saving of 4 women per hundred).

Now you can see why clinicians, researchers, drug companies and journalists prefer relative risk reduction percentages to absolute risk reduction figures. They usually look more dramatic, and make better headlines.

However, research has moved on, and Rebbeck et al. (2004) report on 483 women with BRAC mutations who had double mastectomies at average age 38 and were followed for 6.4 years.  http://jco.ascopubs.org/content/22/6/1055.long

Results Breast cancer was diagnosed in two (1.9%) of 105 women who had bilateral prophylactic mastectomy and in 184 (48.7%) of 378 matched controls who did not have the procedure, with a mean follow-up of 6.4 years. Bilateral prophylactic mastectomy reduced the risk of breast cancer by approximately 95% in women with prior or concurrent bilateral prophylactic oophorectomy and by approximately 90% in women with intact ovaries.
Conclusion Bilateral prophylactic mastectomy reduces the risk of breast cancer in women with BRCA1/2 mutations by approximately 90%.

Prophylactic mastectomy        2         
Control (no mastectomy)      49

The relative risk reduction is 47/49 is 95%
The absolute risk reduction is 49-2 is 47%

A clear result, wouldn’t you say? This study is in line with previous studies such as Hartmann et al [4] who evaluated the efficacy of bilateral prophylactic mastectomy in a retrospective cohort analysis of 639 moderate- and high-risk women who had bilateral prophylactic mastectomy at the Mayo Clinic between 1960 and 1993. Data from this study suggest that bilateral prophylactic mastectomy is associated with a 90% reduction in breast cancer incidence and mortality in women at high risk of breast cancer. In the only other study of BRCA1/2 mutation carriers to date, Meijers-Heijboer et al [6] reported no postbilateral prophylactic mastectomy breast cancers in 76 BRCA1/2mutation carriers after 2.9 years of follow-up, compared with eight breast cancers in 63 mutation carriers who did not undergo bilateral prophylactic mastectomy (P = .003).

Well, now we can make a number of points. Most studies concentrate on whether a woman is diagnosed with cancer again. However, cancers are increasingly treatable, and although the medication is thoroughly draining and unpleasant, so is a double mastectomy, and the latter is permanent. Furthermore, for those with BRAC cancer risks there are prophylactic medications available. Getting a diagnosis of cancer is not identical with dying of cancer.

The Rebbeck paper does not report on mortality figures. These are theoretically calculated for the next 30 years, but we do not know what improvements we may get in cancer treatment over three decades. On current trends it should improve significantly. Survival rates are 93% if it’s caught at the earliest stages and 88% at stage 1.

A Cochran review in 2004 concluded: Bilateral Prophylactic Mastectomy should be considered only among those at very high risk of disease. 

What Angelina Jolie appears to have done is reduce her chance of getting cancer by half, a very significant reduction, but at the cost of both breasts. She was understandably frightened of getting cancer, but she was not doomed, and other treatments are available.  

There is always a celebrity effect, but any woman considering a prophylactic mastectomy should look at the data carefully, and look at the human costs and benefits of all treatment options. Modern medicine is saving more of us from cancer, for longer than ever before, but it still throws up the most awful dilemmas.