Psychological comments: What makes problems difficult?

Sunday 20 March 2016

What makes problems difficult?

Psychologists have been better at measuring intelligence than explaining how they do so. “The indifference of the indicator” is all very well, but this dictum has been met with public indifference and incomprehension. This is because psychometricians keep saying that intelligence matters, but then put their foot in it by saying “but how you test it doesn’t matter”. Technically, this is correct: it does not matter precisely what the test is, so long as it has sufficient difficulty to stretch minds and grade them. In that sense the actual indicator of intelligence is a matter of indifference, but only so long as it has the necessary psychometric properties.

I try to get round this problem of understanding by giving the example of digit span: remembering digits forward is easy (and only weakly predictive of general ability) but remembering digits backwards is harder (and more strongly predictive of general ability). In that difference lies the essence of difficulty.

http://drjamesthompson.blogspot.co.uk/2014/03/digit-span-modest-little-bombshell.html

It then gets rather technical. Some tests are good indicators at the lower end of ability, others at the higher end. They all have characteristics and quirks. Hence the reification of intelligence test results into g which satisfies most researchers but bemuses the general public.

Compare this with the forced expiratory volume test.

Forced expiratory volume (FEV) measures how much air a person can exhale during a forced breath. The amount of air exhaled may be measured during the first (FEV1), second (FEV2), and/or third seconds (FEV3) of the forced breath. Forced vital capacity (FVC) is the total amount of air exhaled during the FEV test.

Neat, isn’t it? (You can then study whether 30 mins of aerobic exercise over 8 weeks raises the volumes. It does, a bit.) Can psychometrics define an intelligence measure in as simple a way?

Diego Blum, Heinz Holling, Maria Silvia Galibert, Boris Forthmann. Task difficulty prediction of figural analogies. doi:10.1016/j.intell.2016.03.001

The purpose of this psychometric study is to explain performance on cognitive tasks pertaining Analogical Reasoning that were taken into consideration during the construction of a Test of Figural Analogies. For this purpose, a general Linear Logistic Test Model (LLTM) was mainly used for data analysis. A 30-itemed Test of Figural Analogies was administered to a sample of 422 students from Argentina, and eight of these items were administered along with a Matrices Test to 84 participants mostly from Germany. Women represented 77% and 76% of each respective sample. Indicators of validity and reliability show acceptable results. Item difficulties can be predicted by a set of nine Cognitive Operations to a satisfactory extent, as the Pearson correlation between the Rasch model and the LLTM item difficulty parameters r = .89, the mean prediction error is slightly different between the two models, and there is an overall effect of the number of combined rules on item difficulty (F_(3,23) = 15.16, p < .001) with an effect sizeη² = .66 (large effect). Results suggest that almost all rotation rules are highly influential on item difficulty. (my emphasis).

Figural matrices are a good test of intelligence. Raven dreamed his up from logical principles, using patterns he had seen on pottery in the British Museum. His test works very well, even though one difficult item among the 60 is placed a little too early in the B sequence. Incidentally, to my mind this placing error is one of the proofs that the test is reasonably culture fair, in that all racial groups find it difficult, without having to confer across continents about it.

Tests of this sort are known as the A:B::C:D analogies (A is to B as C is to D). When a problem is based on finding the missing element D of the analogy (I.e., A:B::C:?), then C:D becomes the target analog and A:B becomes the source analog. What needs to be extrapolated from one domain to the other is the compound of structural relations that binds these two entities, and not just superficial data (Gentner, 1983). The basic problem A:B::C:? can be applied to different types of contents, namely: verbal, pictorial and figural (Wolf Nelson & Gillespie, 1991).

How does one describe the difficulty level of each item? Mulholland, Pellegrino, and Glaser (1980) studied the causes of item difficulty in geometric analogy problems, and concluded that the number of item elements, as well as the number of transformations, had a significant effect on error rates.

These authors decided to build a test with designed levels of item difficulty, and chose to keep the same standard figures in all items, so as to reduce surface complexity and concentrate on underlying operational differences between items. They used 9 main rules to build the items, rotating the figures by 45, 90 and 180 degrees, using X and Y axis reflections, line subtractions and dot movements. You can call this: “How to build your own IQ test” and the supplementary material shows you how to do this. Note that certain rule combinations lead to some imprecisions and, therefore, the process of rule-based item generation should not be considered a pure-mechanical procedure. As a consequence, the authors have further explanations about their design guidelines which need to be understood

Based on the data provided in Table 2, specific rule-based contributions to item difficulty can be interpreted. The short clockwise main shape rotation, the subtraction and the dot movement rules make some contributions in this regard. Most interestingly, the best predictors of item difficulty are all the other rotation rules (I.e., both counter clockwise rotations, both long rotations, and the short clockwise trapezium rotation), followed by the reflection rule. Special mention must be given to the long clockwise trapezium rotation, which has the biggest influence on item difficulty. In other words, people found it most difficult to manipulate rotations during task resolution. In fact, the two easiest items according to the Rasch model (items 2 and 4) do not comprise rotation rules, nor does item 25 which is the 7th easiest item. Also, combining rules within a single item has an impact on item difficulty by itself, since both the ANOVA results and the Box Plot show that the higher the number of combined rules, the greater the item difficulties.

I am aware that some of this has been done before, if only because I attended conferences years ago showing that an intelligence test could be constructed out of general principles of learning, and that it had good predictive value.

I think that this is a good paper which should be mentioned whenever critics assume that test material is arbitrary and unrepresentative in some way. This work establishes that rules of design complexity are strongly associated with the ease or difficulty human subjects experience when they solve problems.

One fly in the ointment: it seems that psychology is now 76% a girly subject and women are less good at mental rotation of shapes, so it might be good to check this with boys studying something other than psychology.

The authors found that this test works well at low as well as high levels of ability, which is particularly useful.

A high positive correlation (r = .89) reveals that item difficulties are strongly associated with the predicted difficulties of each rule, and these item difficulties remain practically unchanged in a further study.

By wary of comparison only, the test-retest correlation of the Wechsler after 6 months is 0.93, so the above correlation of 0.89 is a very strong endorsement of the design principles of the test created by the authors.

Perhaps we have taken a step towards finding out what makes problems difficult.

Take a closer look at the paper here:

https://drive.google.com/file/d/0B3c4TxciNeJZcWZiUmpfRm50VlE/view?usp=sharing

12 comments:

Santoculto22 March 2016 at 15:22
Psychology of worker/semi-slaves* all the time,

just do not see because you don't want.
ReplyDelete
Replies
Santoculto22 March 2016 at 15:22
https://upload.wikimedia.org/wikipedia/commons/b/b8/Camp_ArbeitMachtFrei.JPG
ReplyDelete
Replies
dearieme22 March 2016 at 16:28
"A 30-itemed Test of Figural Analogies was administered": God, that must be boring to do. How does anyone retain the interest to bother finishing it?
ReplyDelete
Replies
Santoculto23 March 2016 at 22:37
Seems there are three types of ADHD

first type, the classical: combo mental+physical hyperactivity und attention deficit,

second type, the potential sportsman: physical hyperactivity und attention deficit,

third type, the potential creative genius: (only) mental hyperactivity und attention deficit.

mental hyperactivity is not exactly the same than neuroticism because neuroticism tend to imply in higher density of negative thinking while mental hyperactivity would be more heterogeneous, with both, highly density of positive, neutral and negative thinkings.

or just other bullshit...
ReplyDelete
Replies
Santoculto24 March 2016 at 12:11
ADHD is just the type who are in the dead end of ''tolerance for school regulaments'' and other ''obligatoriness.

Adhd is a archaic version of humankind because they (i'm little bit like tha') to do what they want, similar with non-human animals, it's not a offense.

Domesticated people internalize civic and not-so-civic regulaments while adhd on average, seems, to do what they want to do and specially during the school, in other words, they (und me) are LESS domesticated.
ReplyDelete
Replies
Anonymous26 March 2016 at 20:27
For those interested in Rasch measures, which put item difficulties and test-taker abilities on the same scale, allow all arithmetic operations ( *,/,+,-, rather than at most + and - for IQ)and form the basis for item response theory (IRT) in general, this textbook is the best free online resource that I have found:
Measurement Essentials

Here is a practical slideshow walk-through showing how Rasch measures ("W score")were used in the development of the Woodcock-Johnson IQ test:
Applied Psych Test Design: Part C - Use of Rasch scaling technology . The Stanford-Binet, also published by Riverside uses the same scale ("change-sensitive" score or scale "CSS"), which has as its only arbitrary choice setting the CSS for an average 10-year old equal to 500. Since division is a valid operation on scores on this scale, one can say that in an absolute sense, the average adult with a score of 510 to 515 is only 2 or 3% more intelligent than the average 10 year-old, and less than 10% smarter than the average 5 year old with a score of 470.

Unfortunately Riverside seems reluctant to publish the average age- vs. CSS or W-score graphs for the full test, let alone for different standard deviations, but slide 19 of the slideshow does just that for the WJ block rotation sub-test, which is likely pretty close to the full-scale results, though likely with smaller variance than the full test. (Assessment Service Bulletin Number 3: Use of the SB5 in the Assessment
of High Abilities, p.11 of the PDF says the top full-scale score observed on the SB5 was 592, whereas the block-rotation subtest (BR) has an adult mean of about 508 and a s.d. of about 8.5, which would put a 592 nearly 10 s.d. out, which is rather unlikely, so the actual s.d. for the full scale must be larger. The distribution is also likely log-normal or otherwise fatter-tailed than normal.) So using BR as a proxy likely understates the differences between subjects on the full test, but even so, the difference between +3s.d. and average adults is larger than between average adults and average 5 year olds. The BR score of a +3 s.d. 5 year old is about the same as a +0.5 s.d. 22 year-old, which would be about what I would expect the typical graduating psychology major to score. There are many other such comparisons; I have enjoyed hours playing with that chart. (I have an improved .png version rescaled in years rather than months, with a background grid, and also a Paint.NET version separated into layers for easier analysis, if anybody wants it.)

I'd be very interested in finding similar charts for a full-scale test, fluid / crystallized scales or any other sub-tests.

-EH / savantissimo
(FYI, Wordpress is acting even more messed-up than usual.)
ReplyDelete
Replies
E. Harris28 March 2016 at 20:29
Here is that updated graph with what I hope is a more coherent explanation: Converting IQ at a Given Age to an Absoloute (Rasch) Measure of Intelligence
ReplyDelete
Replies

Add comment