Friday, 21 March 2014

Rating scientific articles: why and how?

Assuming we need to assess the quality of individual scientific articles, there are two broad approaches:
  • the quantitative approach of counting citations and/or computing alternative metrics,
  • the qualitative approach of conducting peer review, and giving quality tokens to articles. A quality token can be publication in a journal, or a grade.
The quantitative approach has the fundamental flaw of being an indirect way of assessing quality. And it has undesirable side effects, such as the misuse of citations for assigning credit, rather than for helping readers. In this post I will deal with the qualitative approach.

Giving grades, implicit or explicit.

Assuming some form of peer review is conducted, how can the results be concisely summarized?

Judging articles from the journals which publish them is widely recognized as bad practice. This bad practice is however very widespread, maybe because it has the advantage of simplicity. In each field of research journals are grouped in three or four tiers according to an informal but universally accepted hierarchy, so peer review implicitly results in giving grades which take three or four values -- something very helpful to impatient evaluators. But these grades are assigned in a rather contorted way, since the reviewers are asked not to assign grades, but to judge whether articles are suitable for publication in a given journal.

It would therefore be useful to rate articles in a more direct and meaningful way. This has been attempted by some innovators, most recently by the open peer reviewing platform Publons. Publons asks reviewers to give two scores to each article: one score for quality, one score for significance. Each score is a number from 1 to 10, "where 10 is amazing, 5 is average, and 1 is poor". The most obvious problem with this system is that the meaning of each possible score is not clearly defined: in which cases should you give a score of 6 rather than 7?

Principles of rating.

For a rating system to be useful and meaningful, some principles must be obeyed. I propose that the most important principles are:
  • Clarity: Each possible grade should have an explicit meaning. In particular grades should not be numbers, as adding or averaging them would destroy their meaning.
  • Transparency: Each grade should come with all the underlying arguments and considerations. Grades are meaningful only if the full story is available. 
  • Mutability: Grades can evolve when the articles are modified, or when later research sheds more light on the subject. 
These features are lacking to various degrees in the "rating by journal" system, and in all extant alternatives that I know of. Rating by journal has almost no mutability, as publication in a journal is almost always final. Rating by journal is not transparent when reviewers' reports are not published. And rating by journal lacks clarity, as articles can get published for a number of reasons: journalistic impact, new ideas, technical achievements, etc.

Sketching a rating system.

Let me sketch a system which would follow the above principles. Each article could have three grades, one in each of the following three categories:
  • Interest
      A - major discovery
      B - strong new results
      C - incremental progress
      D - not worth reading
  •  Correctness
      A - proved beyond reasonable doubt
      B - strong evidence
      C - some evidence, but important doubts remains
      D - main claims are unsupported
  • Clarity
      A - very satisfactory
      B - some polishing would be welcome
      C - important clarifications are needed
      D - please rewrite article 
Each reviewer would give a grade in one or more categories -- but there is no need to ask each reviewer to deal with all aspects of the article. If several reviewers give grades in the same category, and the grades differ, there should be a discussion leading to a consensus, or a way to decide which grade is the most appropriate, by reviewing the reviews. Grades could evolve if the article is modified. Extra categories could be added depending on the particular field of research, but the three categories above should probably always be present.

Authors could be asked to rate their own articles. Their grades would not have the same status as grades from independent reviewers, but could still be helpful indications, especially when compared with authors' and reviewers' grades for other articles by the same authors.

The real problem.

Grades are only one aspect of a new scientific publishing model. For innovative platforms such as Publons or the Selected Papers Network, the crucial issue is probably to get people to publicly comment on each others' articles -- "publicly" meaning that the comments should be made public, not that their authors should always renounce anonymity. To solve this issue, it is necessary, but far from sufficient, to avoid making mistakes in the relatively simple matter of grades.