Why do we have judges at figure skating events?
Dirk L. Schaeffer
Vancouver, B.C.
1.
Introduction
A
few months ago I became intrigued with the results of the 2000 Men's World's
Figure Skating Championships. At
this competition, in the Free Skate (long program) event, one skater was ranked
fourth by one judge and fifteenth by another, while another was ranked fifth and
sixteenth by two different judges, and a third eighth and seventeenth.
Since such discrepancies appeared excessive by any standards, I wanted to
examine these scores more closely to see if I could find an explanation.
That
explanation – or part of it – turned out to be easy enough to find, but in
the process of looking for it, I stumbled across some much more interesting and
significant numbers, and it is these that I want to report on now.
But I don’t just want to tell you what they are, since that would
reduce them to simple facts, for you to believe or not, according to whim.
They’re important enough, I think, for me to want you to believe and
understand them, and the only way I can hope to achieve that is to take you with
me on the journey that led to them. This
may take some time; I hope I can make it entertaining enough to hold your
attention. But it is a long road to
travel, and we shall have to start it at the very beginning – with the
question posed in the title: why do we have judges?
We
all know the answer: because we
want a system that will fairly evaluate skaters’ abilities, and we have do not
have sufficient objective standards – such as height, speed, number of
rotations – to measure the full range of skaters’ programs. Hence, we select and train experts – judges – to make
these evaluations for us.
But
this simple answer has at least one ramification that may not be as obvious.
Just as we have no objective criteria to evaluate skaters, we can have no
objective standards to evaluate judges either.
And this means that the only criterion for judges’ abilities that we
can devise is that of their agreement with other judges.
It also means that if we are concerned about scoring systems – which is
what I was concerned about when I started questioning the discrepant judgments
at this event – the best scoring system is the one that results in the
greatest agreement among the judges. Our
present judging systems (OBO – one-by-one – or it’s uncle, BOM –
best-of-majority) were never designed with this question in mind, however: they
strive not for agreement among all judges, but only among the most judges,
rejecting – skater by skater – those that do not agree with the majority.
I
know of only two alternative scoring systems that could be used, each in two
forms: Borda counts, and average
marks. Borda counts are the sums or
averages of the ranks assigned to each skater by the set of judges; their
variant form is the trimmed Borda, which throws out the highest and lowest ranks
in each set. Average marks are
essentially just that: the sum or
average of all the raw marks assigned by the set of judges; the variant form is
the trimmed mean, again throwing out the highest and lowest.
(Another possible variant here would be that of differentially weighting
the technical and performance scores before calculating the sum of the two; but
I don’t know of any attempts to implement this, or even suggest what
weightings to use.)
Now
in most cases there may be very little difference among these approaches in the
final rankings of a set of skaters at any event that they produce; but that's
not the point. The point is that if
one system achieves more agreement among the judges than the others do, that is
the system that we should be using, if we are at all serious about this business
of judging. Indeed, and
particularly after the intense discussion occurring when the International
Skating Union switched from BOM to OBO a few years ago, we have every right to
ask that proponents of majority systems demonstrate that they are in fact better
at achieving agreement among judges than any other system.
Of
course, nobody has offered any such demonstrations.
And that became the more interesting question that the data of the 2000
men’s competition raised for me. Let
me confess, however, that I instantly rejected Borda counts as a working
alternative, for reasons of the systems known weaknesses, as well as many others
that became apparent as I worked my way through the data.
For all practical purposes, I became concerned here only with the
distinctions between equality-based and majority-based scoring systems.
Let
me now describe what I did and, more importantly, why I did it, in analyzing
these numbers: taking us into the arcana of inferential statistics.
(By
way of introduction, you should understand that there are actually two kinds of
statistics, called “descriptive” and “inferential”. Descriptive statistics are fairly easy: the are merely numbers that summarize a set of numbers –
for example, a baseball player’s batting average, or the incidence of HIV in a
population. Inferential statistics
are tougher: these procedures relate statistics to the laws of probability - the
dreaded “bell-shaped curve” - in order to allow us to make predictions from
them. For example, while a baseball
player’s batting average tells us little about how well others may do,
correlating batting averages with such factors as home-vs-away, distance to the
right field wall, or handedness of the pitchers, will allow far more accurate
predictions. Similar, incidence of
HIV in any given location will tell us little about it’s incidence in other
locations, until we correlate it with factors such as poverty, drug use, and
sexual practices. Inferential
statistics can do this, when they are successful, because they tap into the
underlying principles that affect or control the numbers in question; and it is
these underlying principles, rather than any specific predictions, which were of
concern to me.)
2.
Measures a
The
statistic that directly measures the degree of agreement between two sets of
numbers - say, the marks given to the same skaters by two judges, or the marks
given by one judge and some other measure, such as final standing - is called
the "co-efficient of correlation," or, more simply “correlation”.
It is currently one of the most commonly used statistics in health, social
sciences, biology, and most other disciplines.
It’s a number that ranges from +1.00 to -1.00, and the closer it falls
to either extreme, the higher the agreement between the two sets.
"Percent agreement" - the statistic used throughout these
analyses - is simply the square of the correlation. (There are reasons for
squaring this number, but they are relatively arcane, and do not make any
difference here.)
However,
correlations apply only to two sets of numbers, and in the case of these skating
data I had nine judges to deal with in order to get an overall index of
agreement for any measure. How was
I to handle this?
Two
methods presented themselves: one
was to correlate the judgments of each judge with all eight others - resulting
in a total of 36 correlations among the nine judges - and average those.
The other was to correlate each judge's marks or ranks with the average
of all eight other judges, and then average those nine correlations to get the
single index.
I
chose the second method, for a variety of reasons, two of which seemed fairly
important. Consider, for example,
the case of a single judge who is far, far out of line with all others.
Under the first system, his aberrant scores will pull eight of the
thirty-six correlations - almost one quarter - down.
Under the second, it is only one of nine correlations that will be thrown
off. The second reason was that as
long as the same method was being applied in all cases, it didn't really make
much difference which I chose, and the average-of-eight-others method was by far
the cleaner.
It
was also apparent at the start of this investigation that I would not get very
far in my search for what went wrong in the judging of the men's competition if
I did not have some other, more reasonable, set of numbers to compare these data
to. For this, the Ladies’ long
program of the 2000 World's seemed the most obvious choice:
although a few extreme rankings could be observed here, in general this
appeared a reasonably uncomplicated, "normal" event.
The results for this competition are therefore included throughout.
3. Preliminary findings
But
what can such statistics tell us? For
starters, let's look at the results for the presentation and technique scores,
and their sum, for the two sets of skaters.
Here, and throughout, the numbers presented were found by first
calculating the correlation of the judgments of the entire set of skaters made
by any one judge with some other measure (such at the average judgments of all
eight other judges, or final standing), and then averaging the correlations
describing all nine judges, to get one single “percentage agreement” number
to describe the entire set.)
Percent
agreement
Men's
Ladies Source
87.6%
92.7% Presentation
92.7%
95.1% Technique
92.5%
93.5% Sum
There
are several things we can learn from even these, very preliminary numbers.
First,
and not too surprisingly, they are all extraordinarily high. Correlations of
this magnitude are found only in the rarest of cases, where - as in this case -
the judges or raters have had extensive experience in attending to common cues
and reaching shared decisions. By
comparison, correlations indicating agreement levels of 10% or less are often
sufficient to achieve statistical significance and are then used to underpin
legislation, determine the direction of political and advertising campaigns, and
the like.
Second,
agreement is consistently higher among the judges of the ladies' event than
among those of the men's. Of
course, we suspected that anyway, given the number of widely discrepant
judgments in the men's event. But
looked at in this context, the numbers also suggest a simple explanation for
some of the difficulties there: the men's competition may simply have been
harder to judge, in the sense that there were more skaters of nearly equal
ability and talent in that event than in the ladies'.
Third,
technique is more reliably judged, in both cases, than presentation.
We could have guessed that too, just by considering our own abilities and
discussions in newsgroups and the media. But in this case the difference is far
more pronounced in the men’s event than in the ladies’, and thus more
clearly locates the source of the problems found there.
None
of these conclusions are particularly surprising.
But they are quite gratifying in suggesting that the percentage agreement
statistic is a realistic one, which can provide us with useful summaries -
particularly, in this case, in highlighting the differences between the men's
and ladies' judges - of judges' behavior.
But
a fourth conclusion from these numbers may be a little more interesting: namely
the fact that agreement on technique is higher, in both groups, than agreement
on the summed scores. There are
several things to be noted about that.
For
example, if one were to be very zealous in insisting that agreement among judges
is the only thing that matters, it might seem that we could improve things
simply by not trying to judge presentation.
But of course we can't do that, since there is general agreement that
both elements are essential for a skating performance.
Only later, after both elements have been summed and become subject to a
variety of actuarial manipulations, can we argue that if two procedures lead to
differing amounts of agreement, the one that does better should be accepted.
But at this level, that argument cannot be made.
More
importantly, the decline in agreement from techniques to summed scores suggests
- but only suggests - that one aspect of what I (and most others) assume to be
the judges' thought processes may not work as well as we thought.
The
reasoning goes something like this: judges of varying backgrounds may value
either technique or artistry more highly than the other (as Russian dance
judges, for example, were said to value expression while the classical British
tradition favored technique). Assigning
marks to both these, and simply summing them - which has the effect of weighting
them equally - could provide an opportunity for these summed scores to cancel
out such differences, theoretically allowing agreement on summed scores to be
higher than agreement on either element. But this has not happened here.
But
this is a minor issue in the scoring procedure, since summing the raw
presentation and technique scores is only the third step in the judges' marking
processes. They are also required
to translate these raw scores into ordinal ranks, and to adjust for ties by
assigning a higher value to the presentation marks than the technical.
Naturally, we would like to know what the effect of these transformations
is.
4.
Ranking and tie-breaking
This
ordinal rank, the argument goes, is what the judge actually has in mind when she
assigns marks to a skating performance, using her rating of technique and
presentation merely as numbers which, when transformed, will lead to the desired
rank. (And which also serve as
convenient reminders of the performance of a prior skater.)
But
as I said, this move from raw sums to ordinals involves two different
activities: one is tie-breaking, and the second is ranking. The first of these is a non-arithmetic transformation and
cannot be modeled by any arithmetic system, but the second can easily be
duplicated. That is, we can simply
translate the raw scores into ranks, leaving ties as they occur, and compare
agreement values under these conditions to the agreement values based on the
ordinals. This will give us an
indication of the effect of ranking alone, as compared to the joint effect of
ranking and tie-breaking, so that subtracting the first of these from the second
will allow us to estimate the effect of tie-breaking in isolation.
Let's
look at those numbers.
Men's
Ladies Source
92.7%
95.6% Ordinals
92.9%
95.6% Ranked
marks
And
what do these tell us? Well for
starters, they're all higher than any of the number in the preceding set,
indicating that ranking can improve agreement.
But unfortunately, when we subtract the second number from the first, we
get negative scores in the men’s event and no difference in the ladies’,
meaning very simply that the tie-breaking procedures currently used either don't
do us any good, or actually do us bad: that
is, they decrease agreement among judges.
(When
this became apparent to me, I began looking at the issue of tie-breakers more
generally, and after a while that sort of got out of hand.
My conclusions on that topic then formed the basis for a separate paper,
since most of them are not directly relevant to the issues of concern here.
In that other paper, I argue that tie-breaking procedures are stupid,
irrelevant, ineffectual, biased and counter-productive; with the numbers given
here used to shore up the "ineffectual" part of the argument.)
But
what, generally, does ranking do to any set of real numbers?
Well, it sort of depends on the numbers themselves.
If they are fairly evenly spaced along their line, ranking will smooth
out minor differences and lead to slightly higher agreement statistics.
Of course, these don't testify to more real agreement - the real numbers
don't change, but they look neater. If,
on the other hand, the raw numbers are bunched on their scale, several crowding
close together while leaving large gaps in other places, ranking can
artificially create differences where none, or few, exist.
We
can see the first alternative at work in the results for the ladies'
competition, where ranking alone creates a jump in agreement of about two
percentage points (from 93.5% to 95.6%).
By all indications, this was a straightforward competition where judges
may have had some disagreements, but no major differences.
But
the numbers look far different for the men's competition, edging up less than
one percentage point as a result of the ranking procedure (from 92.5% to 92.9%),
and falling back half that way again (to 92.7%) when tie-breaking is added.
Let
me try to illustrate this by use of some admittedly extreme cases.
You'll remember that in the judging of the men's event Vincent
Restancourt was ranked fourth by one judge, fifteenth by another.
This certainly appears to indicate clear disagreement, about as bad as it
can get. But just looking at the
summed scores for these two judgments shows an entirely different picture: the
one judge had scored him at 11.0, the other at 10.7.
Similarly, two judges placed Anthony Liu at eighth and seventeenth place,
a nine-point spread. But their raw
marks place him at 10.9 and 9.7 - again a much smaller difference.
So
what accounts for the discrepancy in these numbers that first started this
investigation? Mostly, it seems, a
very confusing scoring system which led to judges providing rankings that they
probably never intended; and partly, a tough competition, where the bulk of the
skaters, in the middle range, were of virtually equal ability.
But
we're getting ahead of our story here, because our investigations have only
taken us as far as the judges' rankings. There's
still the OBO transformation, which combines the scores of all judges (or a
shifting majority of them) into a single final standing, to be considered.
5. OBO and direct ranking
In
going from raw summed scores to final placements, one-by-one or BOM systems
twice invoke ranking and tie-breaking procedures: first at the level of
individual judges to determine ordinals, and then at the level of the judging
panel as a whole to translate these ordinals into majority judgments, which are
again ranked to determine the final placements.
How can we duplicate these procedures without using non-arithmetic
transformations such as tie-breakers and majority judgments?
Just
as we used simple ranks to compare to ordinals in the last set of analyses, we
can use the averages of these ranks as a straightforward comparison to the OBO
standing. This leads to our final
set of numbers here - which now compare each judge with an outside criterion
(OBO standing or average of the nine ranks) rather than compare each judge with
the average of the eight others.
Men
Ladies Source
93.7%
96.4% OBO
final standing
95.5%
96.4% Mean
scores
Very
briefly, for the ladies event, both OBO and averaged scores produce the same
amount of agreement; for the men’s event, OBO does markedly worse.
But
let’s put these numbers in context. Remember,
all these judges were trained and tested and re-trained and examined under a
majority-agreement system; all of them were taught that their major task was
that of ranking skaters, not of marking them on an absolute scale.
But despite all that training and practice, it turns out that the
majority system does no better than an equality-based approach under normal
circumstances, and does markedly worse (!) when the task becomes more difficult.
But
this is clearly heresy. And even I
wasn’t about to believe it at first blush.
6. Replication
When
a researcher is faced with data and findings that contradict what everyone knows
to be true, the first thing to do - if he is serious - is to question his data.
And the first question to ask is that of their representativeness: could
the same results occur for another data set, or were these, for whatever reason,
unique to the 2000 World's Championships? So
I scoured my files for more data, noting as I did so that the complete set of
the 1994 World's and Olympics results, which I had used to analyze home-country
bias some years ago, was useless here. At
that time, I believed, as we all did, that it was in fact only the judges'
ordinals, and not the raw marks, that properly reflected their judgments, so I
confined myself to that set of numbers.
Fortunately,
however, I did manage to turn up a complete set of data for the long program of
the Pairs competition at the 1997 World's Championships, which Deb England had
sent me apropos of some conversation at that time, and which seemed to meet my
needs perfectly. Like those I'd
been using, it was a World Championship event, but represented a different
discipline and was scored according to a slightly different system, the BOM,
rather than the OBO procedures in place now.
So
I redid most of the calculations above - except those for the technical and
performance scores - using this new data set.
But even before I did so, some of the wierdnesses of majority-based
systems became quite apparent. For
example, first place at this event was won by Woetzel & Steuer of Germany:
but they were actually ranked second by a majority of the judges (six, in fact).
Eltsova & Bushkov of Russia came in second, with four first place
marks, compared to the three garnered by Woetzel & Steuer.
Now
I know you understand how this happened, but try to put yourself in the place of
somebody who doesn't fully understand these how systems work, and explain to him
that, yes, in figure skating the final standings are determined by the majority
of judges, and yes, Woetzel and Steuer were ranked second by the majority of the
judges, and yes, they finished first.
For
the rest though, the judging seemed fairly straightforward, so that it was not
until you got down to Berezhnaya & Sikharulidze, in twelfth place, that real
disagreement occurred, with one judge placing them tenth, another eighteenth.
Re-running
the same analyses as on the 2000 data here, then, produced essentially the same
results. The within-judges (each
judge correlated with the average of the other eight) correlation for ranked sum
scores yielded an agreement percentage of 95.3 – slightly higher than the
93.5% found for the 2000 Ladies’ event. And
again, it slipped (by almost two percentage points, to 93.7%) when ordinals were
used.
Similarly,
the correlation of the ordinals with the BOM final standing was 94.5%, almost a
full percentage point lower than that of the mean mark with the individual marks
(95.3%). It seems, then, that if
there was anything atypical about the 2000 World’s, it was the relatively good
performance of the OBO data in the ladies’ event.
So
it appears that these findings are fairly robust, as statisticians like to say.
So robust, in fact, that the burden of proof is now on those who wish to
defend the existing systems.
7. But does it really matter?
I
think I've shown, in the preceding discussion, that majority-based scoring
systems produce less agreement among judges than do means-based systems.
But how much difference does our choice of scoring system make, anyway?
That is, given the high degree of agreement among judges that we have
noted throughout, how different can the final standings become if one system is
used rather than another?
These
questions are easily answered with the data at hand. Thus, if the three competitions here considered had been
decided solely on the basis of the total score achieved by each skater, summed
or averaged across the nine judges, the following changes over OBO or BOM
standings would have resulted:
In
the Men's Competition, Michael Weiss would have nosed out Elvis Stojko for the
Silver (11.29 to 11.28); Stanick Jeannette and Zhengxin Guo would have reversed
their positions at eighth and ninth places, as would Vitali Danilchenko and
Alexander Abt, in eleventh and twelfth place; as well as Ivan Dinev and Sergei
Rylov in twentieth and twenty-first positions.
In
the Ladies' Competition, seventh placed Viktoria Volchkova would have dropped to
ninth, with Mikkeline Kierkegaard and Julia Sebestyen moving up a point each to
seventh and eighth, respectively. Zoya Douchine and Tatiana Malinina would have
switched places at eighteenth and nineteenth.
In
the 1997 Pairs' Competition, Woetzel & Steuer's first place finish would
have been straightforward. And
Berezhnaya & Sikharulidze and Savard-Gagnon & Bradet would have reversed
their positions at twelfth and fourteenth respectively.
I doubt that anyone could mount a successful argument for any of these patterns in preference to the other, on the basis of the performances actually given at these events. But it is apparent that the fundamental approach used here - seeking the most agreement among all judges, rather than absolute agreement among the most judges - does not lead to conclusions far different from those that would have occurred under the majority-based systems.
8. More absurdities of majority-based systems
The
situation of awarding first place to competitors ranked in second position by a
majority of judges is only one of the logical self-contradictions that arise
with the use of majority-based scoring systems.
Two other fairly glaring problems can also be demonstrated quite easily.
These are the lack of pairwise consistency of the systems, and their
selective disenfranchisement of judges.
Pairwise
consistency means, in a simple example, that if three skaters - call them A, B,
and C - finish in first, second, and third place as determined by BOM or OBO
scoring, then A should be rated higher than B by majorities of judges, and B
rated higher than C. This seems
straightforward. Unfortunately,
Edmund Russell has shown (“Amateur figure skating: is the ranking system out
of date?” 1995 Proceedings of the American Statistical Association) that it is
possible for these skaters to have been rated in such a manner that a majority
of judges rate A over B, B over C -- and C over A!
This
is, of course, wholly absurd; and, more trenchantly, the decision to declare one
of these a "winner" on the basis of nothing more than an artifact of
the scoring procedure seems fundamentally unjust.
(It is also, by the way, only a formal version of the fairly common
problem of skaters’ relative rankings changing following the performances of
subsequent skaters – the issue that OBO was supposed to fix.)
Selective
disenfranchisement of judges is also a common issue, but one which has not, as
far as I know, ever been seriously examined.
It is, of course, obvious that disenfranchisement occurs – when the
majority that decides, those who are not in the majority have no say.
But the full ramifications of this fact may not be immediately apparent.
For
example, if majority rule was instituted only to guard against minor errors –
momentary inattention, say, or home-country bias – one would expect each of
the nine judges on the panel to be members of the disenfranchised minority
approximately the same number of times each, at any event.
But this doesn’t seem to happen. At
the 1997 Pairs, for example, one judge was rejected from the majority only twice
in her 23 judgments, while two others were rejected eight and nine times each.
Similarly, at the 2000 Men’s (scored by OBO, rather than BOM, but with
the same effects), two judges were rejected nine times each, and a third eight
times; and at the Ladies’ competition, two judges were rejected eight times
each. And what does this tell us?
Quite
simply, that to the extent that these three competitions are representative of
international competitions generally, our scoring systems indicate that more
than one quarter of our judges are so bad at their jobs that we cannot trust
more than two out of every three judgments they make!
Does
anyone believe that? I certainly
don’t, since all the data I’ve seen and cited here indicate that they appear
amazingly good at what they do. I
doubt the ISU believes it, because it would give them serious reason to
reconsider not only their training programs, but also the whole idea of using
judges. It is, in fact, patently
absurd. But it is undeniably what
our scoring systems are telling us, and what they are doing to the judges.
9.
A summary
Let
me pull together now all the material that has been documented above, in the
simplest way I can.
Briefly,
beginning with the assumption that interjudge agreement is the only meaningful
criterion for deciding between scoring systems in figure skating competitions,
I’ve shown that majority based systems such as BOM and OBO yield less
agreement than do simple means.
I’ve
indicated that in actual practice, using simple mean scores produces results
that are practically indistinguishable in their final effects from those
produced by majority based systems. If there was a meaningful distinction to be
made, it seemed to favor the mean-based system.
It’s
as if, where other sports use standardized rulers, figure skating uses an
elastic one. Under majority
systems, it will measure eleven inches for some skaters, up to thirteen for
others. With means-based systems,
it measures between eleven-and-one-half and twelve-and-one-half inches.
While this may not seem like a very large difference, there does not seem
to be any excuse for not using the best system we have.
And
I’ve shown repeatedly that BOM and OBO lead inevitably to absurd situations
(ranging from the assignment of ranks different from the majority rating in a
supposedly majority-based system, to the unnecessary disenfranchisement of more
than a quarter of the judges).
And
I have barely mentioned, so far, any of the other benefits of mean-based
scoring. These include:
Means-based
scores are easier to understand;
Means-based
scores are comparable from event to event;
Means-based
scores never lead to reversals of position during a competition;
Means-based
scores involve no distortions of judges’ decisions;
Means-based
scores treat all judges equally, as they should; and
Means-based
scores do not favor performance over technique.
(Does
this last need explanation? Very
simply, the tiebreaker in the long program is the performance score, while in
the short program it is the technical score.
Since the long program is weighted about twice as heavily as the short in
most competitions, when tie-breakers are used, they tend on average to favor
performance.)
Given
all these advantages, it seems fair to ask why we continue to use scoring
systems that deprive us of their benefits.
Rationally, the only answer that seems available is that means-based
systems also have disadvantages that majority-based systems do not, or that
majority based systems have advantages that means-based systems do not, and that
– for one reason or another – the balance favors majority-based systems.
Less rationally, other explanations – habit or venality, for example
– may also play a role.
All
these issues, as well as those of the effects of abandoning our majority-based
systems, need further exploration. But
that’s primarily a matter of weighing different arguments and suppositions,
alternatives and possibilities: empirical
data, drawn from actual competitions, cannot help us make any of these
decisions.
The
material I have presented so far, on the other hand, has been fundamentally
empirical in nature: real scores
taken from real events. I’d
prefer to leave it that way for now. An
attempt to deal with the hypothetical issues follows in what is, essentially, a
sequel to this paper, under the title of “What can we do about cheating
judges?”
Entire contents of this site copyright by Dirk L. Schaeffer