Wold & Wennerås solve 2 unknown variables with 1 equation and receive honorary degree from Chalmers

by Ola Hansson

Introduction

“Shameful”

That was the headline in The Economist on May 22, 1997. “Women really do have to be at least twice as good as men to succeed”, the introduction continued.

When the immunologists Agnes Wold and Christine Wennerås had their study Nepotism and sexism in peer-review published in prestigious Nature 1997 (Vol. 387/22) they attracted great international attention. They showed that a woman has to be 2.6 times as productive as a man to receive a post doctoral fellowship from the Swedish Medical Research Council of that time. After a long time of struggling Wold and Wennerås were now met by an almost unanimous approval from journalists, politicians and scientists - all unusually willing to credit the irrepressible authority of the “cold” and “clear” figures.

Of course, the same journalists, politicians and scientists soon distorted the study’s limited claim on revealing discrimination in a research council to a statement about women’s conditions in general. And, watching a TV debate more than 10 years later, you can be certain that it is Wold and Wennerås that Gudrun Schyman refers to when she says that women in politics must be twice as capable as men to be regarded as equal: “It is scientifically proven!”

It is such routine misuse of science that I will criticize initially. But I also have a more interesting and fundamentally very important objection to the study in itself – an objection that I have, surprisingly, not been able to find anywhere else in spite of having run across quite a few attempts to have a go at Wold’s and Wennerås’ study.

1 research council ≠ ∑scientific referees

By comparing a research council’s judgments of applications for postdoctoral fellowships with the applicants’ scientific productivity and qualifications (measured by number of articles and weighed according to the publications “impact factor”) Wold and Wennerås claim to have proven the occurrence of discrimination due to gender within Swedish science. Maybe they are right. But is this discrimination due to gender, representative of the academic world of Sweden as a whole, or is it specific for appointments exclusively to postdoctoral fellowships and only within the medical science? We don’t know. To decide this we would need, for example

more studies to compare with, i.e. also studies of other kinds of appointments and nominations, together with studies of other scientific disciplines as well (critics seem to demand these)
reasoning about what kind of mechanism may cause the discrimination, together with reasoning and comparisons with other social phenomena that support the occurrence of this mechanism (feminists seem to be content hereby).

But wouldn’t it be reasonable to assume that the procedure is the same everywhere? Why should the Medical Research Council specifically distinguish itself as especially discriminating?

The answer to this question depends on why they specifically chose to scrutinize the Medical Research Council. Was it chosen by pure chance – well, then they would have good reasons to continue their investigation on the hunch that they were on the track of something significant. If they, on the other hand, chose it for some other reason, e.g. that it had already distinguished itself as being unfair – well, then it is of course less likely that the Medical Research Council is typical of the Swedish research climate and unlikely that they have found evidence of a general phenomenon.

Wold och Wennerås state that they chose to scrutinize the Medical Research Council because of their suspicion that women were treated unfairly. (The fact is that both of them were rejected by the council the very same year they chose to examine.)

“Our investigation was prompted by the fact that the success rate of female scientists applying for postdoctoral fellowships at the MRC during the 1990s has been less than half that of male applicants.”

Since, moreover, the difference in “success rate” was even greater during the examined year, 1995 (women 8 %, men 26 % – hence less than a third) there is a risk that the alleged discrimination isn’t even representative of the MRC. In any case, they examined a year when the women – maybe by pure coincidence – were less fortunate than normal.

When for example, the journal “Ny Teknik” in 2006 claimed that it is now “scientifically demonstrated why there are so few women in high positions in the academic world”, they thus make a typical journalistic leap into pure speculation from a take-off from Wold’s and Wennerås’ study as if it automatically should be generally applicable merely because it is carried out in a scientific manner. It is, as a matter of fact, quite possible that other researchers would find discrimination of men if they copied the methodology to the letter but chose to examine another research council in another scientific field and on other levels than postdoc.

I’d like to emphasize that this in itself is not a reason to criticize Wold and Wennerås. They have chosen a fully legitimate research object and they handle (at least in the study; in the interviews the tone is different) their conclusions with care. They present their message as a reflection and not as a fact:

“If gender discrimination of the magnitude we have observed is operative in the peer-review systems of other research councils and grant-awarding organizations, and in countries other than Sweden, this could entirely account for the lower success rate of female as compared with male researchers in attaining high academic rank”

And even if this would prove to be a wholly erroneous apprehension, there is of course an inherent value in revealing discrimination in a research council, no matter how deviating or specific it is. I’d hesitate to call it research though – it seems to me that investigation would be more apt a term. But no matter the choice of words, Wold’s and Wennerås’ effort led to a replacement of the members in the Medical Research Council, the number of female experts increased, and more strict regulations were introduced. A well accomplished feminist mission in other words. Provided that Wold och Wennerås were right, it should please anyone who cares about justice and objectivity in the world of research.

Were they right? Did such a severe depreciation of female researchers’ competence occur as the study is claimed to have demonstrated? It is very possible, maybe even likely, but it certainly is contestable – and it should be contested. Wold’s and Wennerås’ analysis depends on a very problematic assumption that they don’t even seem to have reflected upon. But before I present this, my other and more interesting, objection, we have to toil through some technicalities.

Sense < Figures < Multiple linear regression analysis

That Wold and Wennerås attracted such vast international attention was undoubtedly due to the great authority of statistics and figures. Quantifiable facts tend to appear objective no matter how subjectively the variables and initial values are chosen. If, moreover, the statistical analysis is performed with multiple linear regression, most remaining doubters will be efficiently silenced. Few people understand regression analysis and those who do will nevertheless face a distinctly nontrivial task if they wish to criticize the model on which the analysis is based.

Let us step by step go for a quick tour through Wold’s and Wennerås’ procedure:

Each application to the Medical Research Council consisted of a curriculum vitae, a bibliography and a proposal for a research project. From these, the council examiners, independently of each other, allotted three marks on a scale from 0–4:

the applicant’s scientific competence
the relevance of the proposed research project
the quality of the suggested research methodology.

Women received on average a somewhat lower mark than men in each category. The mark on the applicant’s scientific competence showed the greatest difference (2.21 against 2.46). Wold and Wennerås chose this parameter as a starting point for their study.

They had thus defined their research object: “the Medical Research Council’s judgment of scientific competence”. To determine whether the council judged women and men under the same conditions, Wold and Wennerås now needed some kind of measure. To get this they made the assumption that a researcher’s scientific competence is linearly correlated with the number and quality of the researcher’s published articles in scientific journals.

This is neither obvious nor unproblematic. Nevertheless I would guess that many researchers regard it as a reasonable, albeit rough, assumption. “There are of course quite a lot of exceptions”, someone will say; “and they increase in numbers the further from the average we get”, someone else might add; “the correlation is hardly linear – the difference between 0 and 1 article is not the same as the difference between 19 and 20 articles”, a third one points out.

Also, having published many articles above a certain level is not a measure of how far above the level you have reached. An exceptionally skilled researcher can easily get a highly impressive “publication record” but will rarely wish to be associated with so called salami publications (research results cut into many but thin slices). Having a high “publication record” does not necessarily imply that your publications are of the best quality, but at least there is a guarantee that you are not worthless.

Well, let us nevertheless accept the assumption. But how do we quantify “numbers and quality”? Wold and Wennerås proceeded quite thoroughly. They proposed 6 different methods of measurement that we can call parameters of productivity:

total number of publications ( = publication record)
total number of first-author publications
total citations
total impact measure
first-author impact measure
first-author citations.

Of these, especially the parameter “total impact measure” was found to correlate with gender in “the Medical Research Council’s judgment of scientific competence”. To determine whether there could be other explanations than discrimination for this correlation, Wold and Wennerås received assistance from statisticians to perform multiple linear regression analysis on their data. With this technique, one can eliminate parameters that seem to cause a result but in reality only reflect a dependence on those other parameters that are the real causes.

Wold and Wennerås performed regression analysis on each of the six parameters of productivity. This is however not a routine task that simply takes care of itself. Initially you have to create a model and to do this you have to choose the factors the model will consist of. These choices are of course anything but obvious, especially when it concerns models that should handle diffuse notions such as discrimination and scientific competence. The choices are in practice subjective and are limited by the researchers’ imagination as well as by the data they have at their disposal.

But the alternative is to give up. Wold and Wennerås were, like most researchers, forced to make both subjective choices and more or less well-founded assumptions. As far as I can judge from their report, they made reasonable assumptions and created their models with conscientiousness rather than with laziness.

The result of the regression analysis showed that the model with “total impact measure” was most capable of predicting “the Medical Research Council’s judgment of scientific competence” (r²=0.47). In addition, two other factors in the model of the regression analysis correlated with the council’s judgment: the gender of the applicant and the applicant’s connection to a member of the council. According to the model, a woman needed 2.6 times higher “total impact measure” than a man to be judged as equally scientifically competent.

What, then, is “total impact measure”? Well, it is the number of articles (the parameter “total number of publications”) weighted by the “impact factor” of the journals in which they have been published. The impact factor is a measure of how frequently the journal’s articles are quoted in other scientific journals and is therefore to some extent an expression of the journal’s significance.

The impact factor is a much less valid measure of the significance of the individual articles in the journal, though. One reason for this is that the number of quotations is not normally distributed among the articles – a small number of articles represent the majority of the quotations. A measure of how much each scientist’s specific articles have been quoted is consequently fairer. And, actually, two of Wold’s and Wennerås’ six parameters of productivity did measure this:

“Total citations” measured how frequently the researcher’s articles had been quoted during 1994. This parameter, however, did not give any statistically significant results in the regression analysis.

“First-author citations” measured how frequently articles where the researcher was the main author were quoted during 1994. This parameter gave statistically significant results in the regression analysis (r²=0.41), however a little less so than the parameter “totalt impact measure” chosen by Wold and Wennerås.

Would it have been more honest to choose “first-author citations”? Does the higher relevance compensate for the lower correlation in the regression analysis? I can’t answer this, but I note that both parameters in any case are more or less direct measures of the number of quotations, and that the number of quotations in itself is a dubious measure of scientific significance.

Why is an article quoted? Did it bring research forward? Was it theoretically interesting but stained by methodological errors? Was the article a lightweight with a striking conclusion and hence a gratifying source for routine quoting? Is it perhaps even a general punching bag? Well, that is rare but it does occur. That cliques of colleagues rather quote each other than an isolated but equally prominent researcher, is on the other hand so obvious that it should need no mention.

Another problem concerning the impact factor is that older journals have the unfair advantage that individual classic articles which date back decades may still be frequently quoted while their current work are more anonymous.

To sum up, I would state that while it is true that the models and the reasoning of the study are founded on layer after layer of subjective assumptions, each of which may have mattered individually for the result, assumptions must be made and that Wold and Wennerås, at least according to their own report of the procedure, seem to have gone about it honestly and ambitiously.

None of this is particularly surprising. Most research results are much less solid than they claim to be. A few are totally worthless; other ones are extremely dubious, but may nevertheless contribute with a piece of the puzzle that a later study will solve.

I don’t expect that my reasoning above will shake any feminist’s confidence in Wold’s and Wennerås’ conclusion, though. On the contrary, I now count upon facing the half rhetorical, half indignant question:

“But of course … a fortunate feminist coincidence! Why else should all these layers of subjective assumptions lead to the result that particularly women (surprise, surprise …) are discriminated?”

Bearing in mind what has been said so far, I can empathize with that question. And I actually can’t answer it. What I can do, however – and this is the main point of this article – is to show that it is a meaningless question. In fact, Wold’s and Wennerås’ result says nothing about which gender is discriminated against.

Peer vs. Peer = Pears vs. Oranges

Wold and Wennerås base their entire study on the assumption that there is a correlation between the variables productivity and competence. We concluded that while it certainly was a problematic assumption it still was reasonable, if crude. Their mistake is of a more fundamental nature. Wold and Wennerås, you see, take it for granted that the one variable is objective and fair while the other is subjective and unfair. This is not a reasonable assumption. It is simply wrong.

The fact is that Wold’s and Wennerås’ study depends on the false premise that articles for scientific journals are evaluated and selected by neutral persons using objective criteria, while scientific competence is evaluated by biased persons using subjective criteria. This is not the case. The editors of a scientific journal always know who the author is, and in 1995 (more rarely nowadays) the same went for most of the expert referees who were engaged to examine the articles. These persons are neither more nor less human than the members of scientific councils. Often they are the same people.

What Wold and Wennerås actually do is to compare one kind of peer review (a research council’s assessment of applications for postdoc fellowships) with another kind of peer review (assessment of articles for publication in scientific journals). Consequently their result might just as well be interpreted the other way round. As well as saying that a woman must publish 2.6 times more than a man to be judged as equally competent, one can say that a man has to be 2.6 times as competent as a woman to get published as much. Conclusion: Scientific journals discriminate men. Where are the headlines?

Instead of regarding women as being discriminated by the members of the Medical Research Council one can thus regard them as being favored by the editors and the referees that evaluate their articles. Why? Maybe the journals want to appear “equal”, emphasize “the female perspective”, etc. There are probably just as many conceivable reasons as there are conceivable reasons for the Medical Research Council to favor men, but that does not mean that we have to believe in any of them. As a matter of fact there might be fully legitimate and objective reasons why the council’s and the journals’ evaluations turned out to be so different – the main reason is that they simply evaluated different things. To compare the council with the journals is consequently to compare pears with oranges.

Journals

To begin with: journals evaluate neither persons nor projects – they evaluate submitted articles. More precisely: the journals select their referees for a clearly defined task: to evaluate an article as an individual effort – not to compare it with another article. An editor has available an extensive list of available referees. He can therefore choose experts with a very specific competence. In most cases he chooses a referee who is an expert in exactly the same subject as the article deals with.

Thus the journal’s referees evaluate the articles from a very narrow perspective. They are able to examine details, discover errors and find out if it adds any new knowledge to its specific subject. On the contrary, articles containing really innovative research risk less fair judgment, especially in a large and established field, since the referees then tend to be stuck in a groove and are skeptical of anything contrary to conventional wisdom.

It is also more difficult to find referees to evaluate an innovative article. The article is likely to be shelved or refused due to insufficient scientific competence of the examiner – not of the examined! On the other hand, it is comparatively easy to find referees for articles concerning routine research within the current boundaries. In this way the journals encourage “more of the same”-research.

Since the evaluating referee is devoted to precisely the same research field as the one he examines, there is also a risk that the dispassionate pocket-lens inspection changes into a frantic search for errors if the results are opposed to his own. An interesting insect to inspect turns into a competitive cockroach to crush. (or inversely, a beautiful butterfly to buttress).

It is also easy to understand that referees within marginalized or very narrow fields may be unreasonably positive to all articles – good or bad – that can put their field in the spotlight.

Favoritism, discrimination and subjective motives consequently occur in journals as well as in research councils. Bearing this in mind we can note that Wold and Wennerås examined applications for post doctoral fellowships and that researchers on this level rarely have had time to publish much. What they have published they have often published in collaboration with teams including more established researchers (another reason for Wold and Wennerås to choose “first-author citations” instead of “total impact measure”), and if you have already published a lot, it would indicate that you may be a pet protégé of established colleagues – equivalent with the circumstances that the Medical Research Council were accused of, in other words.

Research councils

Unlike the referees of a journal, the members of the Medical Research Council were not hand-picked for a specific task. They, like members of other research councils and expert panels, were chosen before the applications were received, and their task was to compare all the applications within a general field to each other.

Just like the referees of a journal, the members of a research council are highly qualified researchers. But instead of searching for errors in a specific article they are assigned to judge the applicant’s competence and the potential of their projects. And because they evaluate projects that often are quite different from each other, they must have a wider perspective than is required from the referees of the journals. Because of this, they are capable of judging whether a project could be relevant even outside of the applicant’s field. That a research project is innovative and cross boundary is in this case therefore an asset and not, as in the journals, a burden. Other factors, like for instance the calculated cost for a project in material, persons and time, are relevant as well but totally irrelevant for journals dealing with already accomplished projects.

Research councils and other expert panels that nominate to posts must, unlike the journals, of course also take into consideration personal qualities and demands that have little to do with scientific competence. On the other hand, the wider perspective reduces the risk of vested interests in the research field and consequently also personal competition between examiner and examined.

To sum up: research councils and journals evaluate different things and employ different kinds of experts. Researchers of international eminence are to be found on both sides, but are asked to perform different tasks. That is, journals do not employ more qualified researchers than those who sit in research councils. If I have emphasized the negative aspects of peer review “the journal way” and the positive aspects of peer review “the research council way”, it is because Wold and Wennerås, in association with both followers and slanderers, completely ignore these. The journal naturalSCIENCE (volume 1, Article 7, 1997), for instance, concludes a cheerleading chorus with the following totally unfounded conclusion:

“The paper concludes with an appeal for the development of a peer review system with built-in resistance to the weaknesses of human nature. Surprisingly, though, it does not discuss what is perhaps the most striking implication of the study; namely, that in the evaluation of scientific competence, bibliometric measurement alone may provide the least biased form of peer review, and certainly the most democratic. Further, in comparison with other forms of peer evaluation, appropriate bibliographic analysis may prove both more realistic, because it depends solely on past achievement rather than the assessment of future promise.”

But whether my account of the positive and negative aspects of the two kinds of peer review is fair or not, does not really matter. The basic flaw of Wold’s and Wennerås’ study is that they shut their eyes to the existence of two variables on the one side of the equation. Positive or negative–they are subjective anyway. No matter how sophisticated statistical tools you apply: two subjective parameters will still never provide an objective end result.

Actually, Wold’s and Wennerås’ study does not even exclude the possibility that it was in fact men who were discriminated against by both council and editors. It is of course neither probable nor likely, but the result is consistent with a reality where men have to be 10 times as competent (according to an all-knowing God’s judgment) to get a fellowship by the Medical Research Council, and 26 times as competent to get published in journals. Wold’s and Wennerås’ figures merely describe a relation between two variables, no absolute values of human abilities.

Such an exhaustive skepticism is of course not very constructive. It is necessary to choose some kind of measure for competence in order to do research at all. My point is that the measures themselves are of little value without further studies and analyses of the mechanisms, causes and effects.

Wold’s and Wennerås’ conclusion can be compared with a woman who is trying to lose weight and does not like what the scales tell her: “65 kilos? But I have been exercising every morning for a week!” Over to the spare bathroom and scale number two: “62 kilos … just what I suspected, the first scale is defective!”

In this example, it is easy to see how wishful thinking makes you jump to conclusions. We realize immediately that it might just as well be scale number two that displays the wrong value; or that both display too large a value; or even that both display too small a value. That so many readers seem blind to the falsity of Wold’s and Wennerås’ conclusion is probably due to the actual existence of objective measures for mass while there is none for competence.

For that matter, Wold’s and Wennerås’ study is of course far from worthless. It is not sufficient to establish any conclusions about discrimination, but it nevertheless provides interesting information and can serve as a first step in the exploration of equality in the world of research.

Athletic Design, 8 March 2009
English translation 2 July 2009