What
can we do about dishonest judges?
Dirk L. Schaeffer
Vancouver, B.C., Canada
1.
Introduction
In
an earlier paper (“Why do we have judges at figure skating events?”) I posed
several apparently simple-minded questions (why do we have judges, what happens
to the minority when majorities rule?) and found they led to some unexpected
conclusions. The most notable of these was that there were many, many ways in
which the simple averaging of raw scores was a far preferable method of
evaluating skating performances than the majority-of-judges systems now being
used. I will not repeat the list
here. Rather, I want to look now at
the obvious questions this analysis raises:
why do we keep using majority-based systems, and what would happen if we
replaced them with simple raw scores?
But
let me start with a little bit of history, specifically that of majority-based
scoring in figure skating. I wish I
could tell you how our present system, characterized by multiple judges and rank
placements for each judge, as well as majority rules, got started, but I
haven’t been able to find any details on the actual history.
That doesn’t matter much, however, since it is easy to create
reasonable scenarios, and the bottom line – during most of the last century
– was that it worked. Both the
ability to reach agreement among judges of often very differing minds, and the
relative ease of scoring would have made it appear as a quite suitable, if not
ideal, method for determining competitions.
(Although, of course, it had its problems, too:
resulting in the banning of several judges and, in one case, the whole
Russian judging panel on grounds of inappropriate marking.)
What
seems most important now, however, is that two major changes in the sport have
occurred since then. First,
computers have made all forms of scoring easy and instantaneous.
And second, the stakes – limited for most of the last century to
whatever pride home countries or coaches or competitors could take in winning
– have become much higher. Which raises the question of cheating.
Before
I turn to that, however, I want to clear away one possible source of confusion:
the question of trimmed means.
2.
Trimmed means
Inevitably,
when we talk about the scoring of figure skating events, we come up against
suspicions of bias on the part of the judges.
And one of the first fears we have is that of giving any single judge,
who may cast an extreme and unreasonable vote, too much power.
Whatever other reasons may account for their development, majority-based
systems were clearly designed to inhibit that power; and most of us seem to feel
that any alternative system should also contain some checks against it.
Hence, even when we talk of means-based alternatives, we tend to think of
trimmed means, rather than the average of the full panel. Indeed, I’ve more or less done the same myself, as recently
as six months ago.
But
it turns out that trimmed means are not a useful alternative, for several
reasons. A small one is that
agreement percentages for trimmed mean data are as poor as they are for
ordinals. Another small one is that
trimmed means deprive us of the opportunity of monitoring individual judges.
(This is not immediately obvious, but the proof is too long to go into
here, since the issue is fairly minor anyway.)
The big reason is the dreaded “mean shift”.
The
mean shift is a phenomenon that can occur when one judge uses a marking scale
that is measurably higher or lower than the other judges. This can lead to a situation where all judges agree in the
ranking of a group of skaters, but that ranking gets overturned or jumbled when
trimmed means are used. Edmund
Russell laid this out very neatly in his 1995 paper “Amateur figure skating:
is the scoring system outmoded?” and most of us have encountered his, or
similar demonstrations since then.
It’s
easy to over-estimate the significance of this effect, since the conditions
under which it can arise at all are quite rare.
But this doesn’t seem to matter as much as the fact that, rare or not,
it CAN occur, and can lead to some extraordinarily distorted standings when it
does – as distorted, in fact, as those that can occur with majority-based
systems. And since I have argued
that the deficiencies of the latter are sufficient to require alternatives, I
cannot endorse trimmed means as a serious contender for that role.
Thus,
when I discuss means-based systems in this paper, I shall be referring just to
that, and not any trimmed or otherwise diddled-with alternative.
3.
Majority-based systems and bias
If
majority-based systems are not as good at achieving interjudge agreement as are
means-based systems, if they are prone giving absurd conclusions, and if they
are sufficiently complex to alienate large numbers of potential fans, what
justification is there for our continued use of them?
One
answer that is often proposed is simple habit.
Our judges have all been trained under these systems, it is argued, and
it would be virtually impossible to retrain them all. But that argument is totally nullified by the data presented
in my earlier paper: judges trained
to use ordinal and majority-based systems actually do better when their
decisions are re-scaled as means-based judgments.
Which is to say, they don’t need retraining: the existing system was too complex for them to learn in the
first place.
Another
answer – as I implied earlier – is that they appear to guard us against
bias, dishonesty, and cheating. (And
that, by the way, is the only other justification for their continued use that I
have been able to find.) But
let’s look at this more closely.
First
and most fundamentally, we can ask the question of whether this is something
that scoring or judging systems should be doing in the first – or any other
– place. We do not, after all,
expect time-clocks, measuring tapes, or weigh scales to guard against the use of
steroids; and judges in figure skating perform the same role as those
instruments in the more objectively measured sports. All we ask of the instruments is that they be reliable,
consistent, and accurate – which is, I think, all we should ask of judges, and
which is what the rigorous training schedules we subject them to attempts to
ensure. (And does, quite
astoundingly: the measures of
interjudge agreement presented in my earlier paper, and in several of Ed
Russell’s papers for the American Statistical Association, indicate that
figure skating judges are just about as reliable and consistent as it is humanly
possible to be.)
Of
course, asking our clocks and tape measures to be accurate and consistent also
implies that they are free of bias, which is more likely in human judges. But
when we suspect that they are not, we send them out for repair, rather than try
to devise some kind of majority system of multiple clocks to compensate for that
suspicion.
But
what are we actually dealing with when we talk about “bias”? My best guess is that it comes in about five different forms.
First,
there is simple inattention or distraction:
for whatever reason, some judges may miss some aspects of a skater’s
performance. While this problem is
real, the solution is simple and already in place: use more than one judge.
Second,
there is the question of differing judgmental abilities: some judges may simply be better at what they do than others.
This problem is less real (ultimately, we have no way of telling who is
“really” a better judge since we have no way of telling what is “really”
a better performance), but the solution is again in place: rigorous training
procedures.
Third,
there is what might be called “principled” bias: the tendency of some judges
to prefer certain types or skating or certain approaches to the discipline; the
way Russian dance judges are said to prefer “expression,” while British
judges prefer “technique”. Some
years ago, for example, I was able to show that it is possible to identify
judges who seemed to prefer “artistic” skaters, on the one hand, or
“technical” skaters, on the other. While
these preferences would color their judgment to a measurable extent, I concluded
that it was far too small to be worth worrying about.
Beyond that, this sort of “bias” is, to some extent, deliberately –
and reasonably – built into our scoring systems, which evaluate competitors
separately in each of these areas. All
of this seems, reasonable, principled, and fair: as a form of bias, this is
wholly benign.
Fourth,
there is “home country” bias: the tendency, actively encouraged by many
national skating organizations, of judges to rate their own country’s
competitors more highly than judges of other countries do.
Again, I examined this issue statistically some years ago, with somewhat
mixed findings. Briefly, some 75%
(as compared to the 50% to be expected if there were no bias) of the judges I
examined – using all data, including qualifying, short, and long programs for
the men’s, ladies’, and pair’s events, and all four dance events, at the
1994 Olympics and World’s competitions – showed some evidence of this form
of bias. I considered this less
significant than the fact that a full 25% did not show such bias.
Virtually all of this bias was at levels that were either statistically
or substantively – or both - insignificant.
One
instance, however, had serious repercussions.
Here, a judge shown to be biased on purely statistical grounds cast the
deciding vote giving Oksana Baiul the gold medal at the Olympics, and two
deciding votes that kept Olympic gold medalist Alexei Urmanov out of third place
in both the short and long programs at the World’s. (The offending judge,
Alfred Korytek, has more recently been suspended for more blatant cheating.)
I
have no idea of how common such extreme home-country bias is today, as compared
to six years ago. The occasional data I’ve looked at – such as the Men’s
and Ladies’ long programs of the 2000 World’s – suggest that it is in fact
declining, rather than increasing as the stakes in figure skating competitions
get higher. If that is true, it is
probably only because it is so obvious and easy to detect. Korytek, who played the game skillfully in 1994 – including
rating some (low-ranking) Ukrainian skaters far lower than the rest of the panel
did, to be make his over-all record look less biased – has moved on to less
easily identified forms of chicanery, such as collusion.
These cases then move into the final category.
Before
we get to that however, I should note that our majority-based systems appear,
indeed, to be effective at limiting the impact of this form of blatant
home-country bias. But this success
may be more apparent than real, as Korytek’s 1994 record demonstrates.)
The
final form of bias, then, is one that might be called simply “venal”, or in
common terms, outright cheating. Trading
favors, bribery, and any other form of advance rigging of the results all fall
into this category, and the recent boasts by members of the French figure
skating association testifying to their sophistication at “political”
approaches to judging decisions would appear to be at least a borderline
instance.
I
doubt that anything can be done about this at present. Certainly the present
scoring systems do nothing to make it any less likely than any other system
would. Indeed, I am coming to the
tentative conclusion that the devious scoring methods currently in place make it
easier to disguise such chicanery than the more open and direct methods of raw
score judgment – but that is, at this point, at best an educated guess on my
part.
All
in all, then, this consideration of the sources and identification of bias and
other threats to the validity of judges’ ratings, gives very little support to
those who are content with the present systems.
They don’t seem to be particularly effective at stopping anything but
the most blatant forms of home country bias, and all rational considerations
indicate that more open scoring systems would at least allow us to be able to
identify bias or cheating when it does occur, better than do the current
systems.
4.
Judges who cheat
In
the discussions of cheating judges that I have encountered so far, we have
invariably talked about dishonest judges as if they existed in some kind of
vacuum inhabited only by judges and performing skaters.
So let’s pose another one of those simple-minded questions to see where
it will lead us: why do judges
cheat?
The
typical answer is: to enhance the standings of their countries’ athletes, or
something like that. But think
about that for a moment. It’s
clear, for example, that by themselves judges have no reason – other perhaps
than patriotism – to cheat, and much to lose by doing so, if they get caught. And that means that if they do cheat, it is because someone
else – a skating association, an over-zealous coach, a gambler – wants them
to and is able to bribe or coerce them into doing so.
My point is that judges, both honest and dishonest, are only a part of a
very large web, involving many players, each with their own motives.
That
large web then contains not just the judges, but also their FSAs, the referees
of the events, the disciplinary committees, and the ISU; as well, perhaps as any
number of outside parties, such as the friends of Tonya Harding.
To some extent, all or most of these individuals or agencies are involved
whenever cheating occurs, either by making it easier or more difficult for the
judges to act in an improper manner.
It
seems clear that at the present time, many aspects of this larger web not only
do little to stop cheating but actually make it much easier.
The
primary agent – if only because it ultimately has all the power – is, of
course, the International Skating Union, an organization whose dedication to
secrecy appears to be matched only by its devotion to avoiding anything that
might look like blame. Certainly
its response to Jean Senft’s attempts, last season, to bring to light what
appeared to her to be a potential instance of collusion among judges – by
first ignoring, and then suspending her, rather than investigating the alleged
colluders – is worthy only of the Spanish Inquisition, and sends a loud and
clear message to all would-be cheaters that they have little to fear from their
governing body.
I
suspect that referees and the referee system are equally conducive to dishonest
practices, but that may be only because I simply do not understand them.
The nub here is the matter of post-event reviews of the judges’
ratings, performed by the referees, acting – as near as I can tell – more or
less on their own best judgment. Here, judges’ marks for all (or perhaps only the top five
or ten) skaters are reviewed, and judges are asked to justify any out-of-line
marks they may have awarded. In
theory, this sounds like a reasonable system, but despite having discussed, or
attempted to discuss, this with a number of judges, I have been unable to
discover any principle or rule that would determine which judges or judgments
are “out of line” and which are not. At
least in their training sessions judges are instructed as to the degree of
agreement that is required of them in order to pass on to the next level of
certification; once all training is past, however, not even that much exists in
the way of criteria for acceptable performance. Again, whether intentional or not, such vagueness sets a
fertile field for chicanery to flourish in.
What
all this leaves us with, then, is the awareness that if cheating exists in
figure skating at all, judges are only the visible tip of a very large iceberg
which may encourage or condone dishonest activities – by looking the other
way, and avoiding publicity as far as this is possible.
To look for a method of judging that would fix this system is roughly the
equivalent of looking for improved automobile speedometers as the solution to
drunk driving.
All
of this discussion has, however, been quite hypothetical: none of us know more about bias or cheating than that it has
occurred in the past at sufficiently high numbers to get the entire Russian
panel of judges suspended for a year, and to lead to some suspensions this year.
Since the ISU neither publicizes such suspensions nor publishes any
guidelines to indicate the conditions under which suspensions occur, it is
impossible to tell how much more often suspensions occur or – more importantly
– should occur but do not (as with the off-again-on-again waltz involving
Senft’s accusations last season.)
What
seems very clear, however, is that judges, and the marks they give, are not the
best way to attack this problem. Indeed,
virtually the only advantage they may seem to promise is that of nipping some
bias in the bud: that is, allowing
a responsible majority to disenfranchise a biased judge prevents any damage that
bias may do before it can occur. This
may seem to be a far more powerful tool for ensuring the legitimacy of marks
than just identifying biased or dishonest judges for later punishment or other
action, which would seem to be the best that means-based systems can do.
But
these are just “seemses” and a quite different scenario can be laid out for
the use of means-based scoring. That
scenario answers our final question: how
would a means-based system work, in actual practice?
5.
O brave new world
So
what would happen if we switched to a simple means-based scoring system?
How would it work, how would it address the challenges allegedly
addressed by BOM and OBO, and what problems would it raise?
First,
of course, we would reap all the benefits of a means-based system, as compared
to those we now have. For starters,
these include:
Means-based
scores are easier to understand;
Means-based
scores are comparable from event to event;
Means-based
scores never lead to reversals of position during a competition;
Means-based
scores involve no distortions of judges’ decisions;
Means-based
scores treat all judges equally; and
Means-based
scores do not favor performance scores over technical.
In
actual practice, means-based scores would work much as scoring does now, only
far more simply. That is, judges
would assign technical and performance scores to each skater; these would be
added or averaged across all judges, weighted appropriately for the program
(one-half for the short program in the men’s, ladies’, and pairs’ events,
and so on), and carried over into the next program.
At the end of the competition, the highest score wins, and so on down.
Should ties occur, sensible tie-breakers (discussed in my other posting)
would be invoked — say, allowing them to stand for marks below fifth place,
and holding a tie-breaker competition for marks between first and fifth places.
What
problems would this present? There
appears to be only one (other than the non-issue of judges allegedly having to
learn a new scoring system): dishonesty.
How
well can majority-based systems handle this?
Well, judging by nothing more than Korytek’s success at the 1994
Olympics and Worlds, not very well. Indeed, the only support for the proposition
that majority-based systems are up to this task at all, come from Ed Russell’s
Monte Carlo studies of the effects of bias in several different judging systems,
including best-of-majority, Borda and trimmed Borda counts (these are simply
summed, or averaged, ranks) and trimmed means.
(Monte Carlo demonstrations, by the way, are large sets of hypothetical
data, using random numbers generated according to specific rules, to indicate
what might happen in the “long run”. They
have nothing to do with actual events whatsoever.)
But these show only that under the somewhat circular reasoning he used
(defining majority-based judgments as “correct”) even the majority systems
failed as much as 20% of the time under the mildest bias conditions, and about
90% of the time under the most severe.
Once
we recognize, then, that majority-based systems have only the limited virtue of
being better than some other alternatives, it becomes more reasonable to phrase
this issue in simple cost-benefit terms. How
much do we stand to gain by any set of procedures we adopt, and how much will
that cost us? And this makes it
possible to look at a wide range of alternative strategies.
Two
that have been suggested recently, for example, are those of hiring an
independent pool of ISU (rather than national) judges, and of restricting the
selection of judges in various ways. The
first of these seems reasonable, particularly when accompanied by close
monitoring – more on that below – and the threat of loss of a fairly
prestigious job. It seems doubtful
that the existing power groupings within in ISU – many of which are quite
content with the present arrangements, because they actually leave more, rather
than less, room for chicanery than the more transparent means-based systems
would – will accept this. But
then, it seems doubtful that they will accept any change more drastic than the
cosmetic shift from BOM to OBO.
On
the other hand, the ISU appears to have accepted at least a limited form of
panel adjustment by moving to random draws immediately prior to the competition.
But this too seems fairly minor. More
reasonably, we might ask that any country with a competitor seeded in the top
five positions be ineligible for that judging panel, although this would
probably permanently disenfranchise Russian and American judges.
(And, predictably, this new system does just the opposite, by restricting
the judges’ pool to those countries actually represented at the event.
Doesn’t this appear to endorse home country bias as something judges
are expected to engage in? What earthly justification can be offered for this
restriction?)
Any
change, however, should be accompanied by better monitoring and evaluation of
the judges’ evaluations, and much of this can be done on-line.
I can even envisage a system that works more or less as follows.
Our
computers now allow us to monitor at least the most egregious errors on the
judges’ part as they occur. These
are home-country-bias and deviation from the majority (as measured by the sort
of correlations described above). Such
monitoring could result in either or both of two actions.
First,
any infraction (error) on the judge’s part would be assigned a pre-determined
number of penalty points, which would work much as driver’s points do now.
Collect enough, and you’re suspended; collect too many, and you’re
banned.
(Just
how many are enough or too many is still to be determined, as are the actual
criteria for the errors). As with
driver’s points, some of these would continue to accumulate over time; others
would be erased after some length of time.
Second,
for the most serious infractions there would be the option of simply removing
that judge from the panel entirely. Recalculated numbers based on the eight-(or
six-)judge panel remaining can be made available as quickly as results are now:
the necessary calculations can be done in nanoseconds. (And if throwing
out a judge in the middle of a competition seems draconian, please note that the
judging system currently in use for artistic gymnastics allows just that.)
Any
of these systems, backed by the sort of strong enforcement outlined above, would
seem to address most of the issues of potential chicanery more directly, and at
far less cost, than our present attempts do.
What
they can’t immediately deal with, however, is collusion. But then neither can majority-based systems, which fail about
50% of the time when two judges show only the mildest bias, at least 70% of the
time when three or more judges are biased, according to Russell’s simulations.
But, strange as it seems, raw score systems may be better able to handle
this, even today, than are our majority systems.
The main reason for this is that at present it takes only
a small nudge (scoring a performance 5.3 for presentation and 5.2 for
technique in the long program, rather than 5.2 and 5.3) to bring about a change
in ranking, although this would not have any effect at all in a means-based
system. It follows that in order to
cheat in a means-based system, somewhat more blatant procedures have to be used;
and these will be that much easier to identify.
6.
Summing up
The
answer to the question posed in the title of this posting is then quite simple,
although not terribly satisfying: what
we can do about dishonest judges is, first and foremost, identify them.
Once that’s accomplished, the rest may or may not be quite
straightforward, depending on whether our governing organizations continue to
value their status more than the sport.
And
this identification would be far easier if we switched to a means-based scoring
system, because such systems, being more transparent, both allow clear rules and
criteria to be drawn, and allow more direct analysis of the judges’ actual
behavior. Thus, I am currently
working on techniques that will, in fact, be able to statistically demonstrate
collusion when it occurs. I’m
persuaded that this can be done by the simple recognition that since collusion
must necessarily involve the manipulation of numbers, it must therefore be
visible, in some way, in the numbers themselves.
“Seeing” these patterns is somewhat simpler when the numbers involved
possess basic mathematical properties – which means and raw scores do, and
rankings don’t. And that
reasoning applies at all times, again arguing strongly for a means-based system
as the best procedure for identifying collusion, when telephone tapes and
surreptitious foot-signals are not available to provide actual smoking guns.
On
balance, then, I would add the notion that means-based systems are better able
to identify chicanery and – given enforcement such as on-line rejection of
judges – better able to deal with it than are majority-based systems, to the
list of reasons for preferring this system to anything we now have.
Entire contents of this site copyright by Dirk L. Schaeffer