Free Web Hosting Provider - Web Hosting - E-commerce - High Speed Internet - Free Web Page
Search the Web

 

 

Why do we have judges at figure skating events?

 

 

                                                                                                Dirk L. Schaeffer

                                                                                                Vancouver,  B.C.

 

 

1. Introduction

 

A few months ago I became intrigued with the results of the 2000 Men's World's Figure Skating Championships.  At this competition, in the Free Skate (long program) event, one skater was ranked fourth by one judge and fifteenth by another, while another was ranked fifth and sixteenth by two different judges, and a third eighth and seventeenth.  Since such discrepancies appeared excessive by any standards, I wanted to examine these scores more closely to see if I could find an explanation.

 

That explanation – or part of it – turned out to be easy enough to find, but in the process of looking for it, I stumbled across some much more interesting and significant numbers, and it is these that I want to report on now.  But I don’t just want to tell you what they are, since that would reduce them to simple facts, for you to believe or not, according to whim.  They’re important enough, I think, for me to want you to believe and understand them, and the only way I can hope to achieve that is to take you with me on the journey that led to them.  This may take some time; I hope I can make it entertaining enough to hold your attention.  But it is a long road to travel, and we shall have to start it at the very beginning – with the question posed in the title: why do we have judges?

 

We all know the answer:  because we want a system that will fairly evaluate skaters’ abilities, and we have do not have sufficient objective standards – such as height, speed, number of rotations – to measure the full range of skaters’ programs.  Hence, we select and train experts – judges – to make these evaluations for us.

 

But this simple answer has at least one ramification that may not be as obvious.  Just as we have no objective criteria to evaluate skaters, we can have no objective standards to evaluate judges either.  And this means that the only criterion for judges’ abilities that we can devise is that of their agreement with other judges.  It also means that if we are concerned about scoring systems – which is what I was concerned about when I started questioning the discrepant judgments at this event – the best scoring system is the one that results in the greatest agreement among the judges.  Our present judging systems (OBO – one-by-one – or it’s uncle, BOM – best-of-majority) were never designed with this question in mind, however: they strive not for agreement among all judges, but only among the most judges, rejecting – skater by skater – those that do not agree with the majority.

 

I know of only two alternative scoring systems that could be used, each in two forms:  Borda counts, and average marks.  Borda counts are the sums or averages of the ranks assigned to each skater by the set of judges; their variant form is the trimmed Borda, which throws out the highest and lowest ranks in each set.  Average marks are essentially just that:  the sum or average of all the raw marks assigned by the set of judges; the variant form is the trimmed mean, again throwing out the highest and lowest.  (Another possible variant here would be that of differentially weighting the technical and performance scores before calculating the sum of the two; but I don’t know of any attempts to implement this, or even suggest what weightings to use.)

 

Now in most cases there may be very little difference among these approaches in the final rankings of a set of skaters at any event that they produce; but that's not the point.  The point is that if one system achieves more agreement among the judges than the others do, that is the system that we should be using, if we are at all serious about this business of judging.  Indeed, and particularly after the intense discussion occurring when the International Skating Union switched from BOM to OBO a few years ago, we have every right to ask that proponents of majority systems demonstrate that they are in fact better at achieving agreement among judges than any other system.

 

Of course, nobody has offered any such demonstrations.  And that became the more interesting question that the data of the 2000 men’s competition raised for me.  Let me confess, however, that I instantly rejected Borda counts as a working alternative, for reasons of the systems known weaknesses, as well as many others that became apparent as I worked my way through the data.  For all practical purposes, I became concerned here only with the distinctions between equality-based and majority-based scoring systems.

 

Let me now describe what I did and, more importantly, why I did it, in analyzing these numbers: taking us into the arcana of inferential statistics.

 

(By way of introduction, you should understand that there are actually two kinds of statistics, called “descriptive” and “inferential”.  Descriptive statistics are fairly easy:  the are merely numbers that summarize a set of numbers – for example, a baseball player’s batting average, or the incidence of HIV in a population.  Inferential statistics are tougher: these procedures relate statistics to the laws of probability - the dreaded “bell-shaped curve” - in order to allow us to make predictions from them.  For example, while a baseball player’s batting average tells us little about how well others may do, correlating batting averages with such factors as home-vs-away, distance to the right field wall, or handedness of the pitchers, will allow far more accurate predictions.  Similar, incidence of HIV in any given location will tell us little about it’s incidence in other locations, until we correlate it with factors such as poverty, drug use, and sexual practices.  Inferential statistics can do this, when they are successful, because they tap into the underlying principles that affect or control the numbers in question; and it is these underlying principles, rather than any specific predictions, which were of concern to me.)

 

 

2. Measures and methods

 

The statistic that directly measures the degree of agreement between two sets of numbers - say, the marks given to the same skaters by two judges, or the marks given by one judge and some other measure, such as final standing - is called the "co-efficient of correlation," or, more simply “correlation”. It is currently one of the most commonly used statistics in health, social sciences, biology, and most other disciplines.  It’s a number that ranges from +1.00 to -1.00, and the closer it falls to either extreme, the higher the agreement between the two sets.  "Percent agreement" - the statistic used throughout these analyses - is simply the square of the correlation. (There are reasons for squaring this number, but they are relatively arcane, and do not make any difference here.)

 

However, correlations apply only to two sets of numbers, and in the case of these skating data I had nine judges to deal with in order to get an overall index of agreement for any measure.  How was I to handle this?

 

Two methods presented themselves:  one was to correlate the judgments of each judge with all eight others - resulting in a total of 36 correlations among the nine judges - and average those.  The other was to correlate each judge's marks or ranks with the average of all eight other judges, and then average those nine correlations to get the single index.

 

I chose the second method, for a variety of reasons, two of which seemed fairly important.  Consider, for example, the case of a single judge who is far, far out of line with all others.  Under the first system, his aberrant scores will pull eight of the thirty-six correlations - almost one quarter - down.  Under the second, it is only one of nine correlations that will be thrown off.  The second reason was that as long as the same method was being applied in all cases, it didn't really make much difference which I chose, and the average-of-eight-others method was by far the cleaner.

 

It was also apparent at the start of this investigation that I would not get very far in my search for what went wrong in the judging of the men's competition if I did not have some other, more reasonable, set of numbers to compare these data to.  For this, the Ladies’ long program of the 2000 World's seemed the most obvious choice:  although a few extreme rankings could be observed here, in general this appeared a reasonably uncomplicated, "normal" event.  The results for this competition are therefore included throughout. 

 

3.  Preliminary findings

 

But what can such statistics tell us?  For starters, let's look at the results for the presentation and technique scores, and their sum, for the two sets of skaters.  Here, and throughout, the numbers presented were found by first calculating the correlation of the judgments of the entire set of skaters made by any one judge with some other measure (such at the average judgments of all eight other judges, or final standing), and then averaging the correlations describing all nine judges, to get one single “percentage agreement” number to describe the entire set.)

 

 

 

Percent agreement

Men's     Ladies    Source 

 

87.6%     92.7%     Presentation

92.7%     95.1%     Technique

92.5%     93.5%     Sum

 

 

There are several things we can learn from even these, very preliminary numbers.

 

First, and not too surprisingly, they are all extraordinarily high. Correlations of this magnitude are found only in the rarest of cases, where - as in this case - the judges or raters have had extensive experience in attending to common cues and reaching shared decisions.  By comparison, correlations indicating agreement levels of 10% or less are often sufficient to achieve statistical significance and are then used to underpin legislation, determine the direction of political and advertising campaigns, and the like.

 

Second, agreement is consistently higher among the judges of the ladies' event than among those of the men's.  Of course, we suspected that anyway, given the number of widely discrepant judgments in the men's event.  But looked at in this context, the numbers also suggest a simple explanation for some of the difficulties there: the men's competition may simply have been harder to judge, in the sense that there were more skaters of nearly equal ability and talent in that event than in the ladies'.

 

Third, technique is more reliably judged, in both cases, than presentation.  We could have guessed that too, just by considering our own abilities and discussions in newsgroups and the media. But in this case the difference is far more pronounced in the men’s event than in the ladies’, and thus more clearly locates the source of the problems found there.

 

None of these conclusions are particularly surprising.  But they are quite gratifying in suggesting that the percentage agreement statistic is a realistic one, which can provide us with useful summaries - particularly, in this case, in highlighting the differences between the men's and ladies' judges - of judges' behavior.

 

But a fourth conclusion from these numbers may be a little more interesting: namely the fact that agreement on technique is higher, in both groups, than agreement on the summed scores.  There are several things to be noted about that.

 

For example, if one were to be very zealous in insisting that agreement among judges is the only thing that matters, it might seem that we could improve things simply by not trying to judge presentation.  But of course we can't do that, since there is general agreement that both elements are essential for a skating performance.  Only later, after both elements have been summed and become subject to a variety of actuarial manipulations, can we argue that if two procedures lead to differing amounts of agreement, the one that does better should be accepted.  But at this level, that argument cannot be made.

 

More importantly, the decline in agreement from techniques to summed scores suggests - but only suggests - that one aspect of what I (and most others) assume to be the judges' thought processes may not work as well as we thought.

 

The reasoning goes something like this: judges of varying backgrounds may value either technique or artistry more highly than the other (as Russian dance judges, for example, were said to value expression while the classical British tradition favored technique).  Assigning marks to both these, and simply summing them - which has the effect of weighting them equally - could provide an opportunity for these summed scores to cancel out such differences, theoretically allowing agreement on summed scores to be higher than agreement on either element. But this has not happened here.  

 

But this is a minor issue in the scoring procedure, since summing the raw presentation and technique scores is only the third step in the judges' marking processes.  They are also required to translate these raw scores into ordinal ranks, and to adjust for ties by assigning a higher value to the presentation marks than the technical.  Naturally, we would like to know what the effect of these transformations is.

 

4.  Ranking and tie-breaking

 

This ordinal rank, the argument goes, is what the judge actually has in mind when she assigns marks to a skating performance, using her rating of technique and presentation merely as numbers which, when transformed, will lead to the desired rank.  (And which also serve as convenient reminders of the performance of a prior skater.)

 

But as I said, this move from raw sums to ordinals involves two different activities: one is tie-breaking, and the second is ranking.  The first of these is a non-arithmetic transformation and cannot be modeled by any arithmetic system, but the second can easily be duplicated.  That is, we can simply translate the raw scores into ranks, leaving ties as they occur, and compare agreement values under these conditions to the agreement values based on the ordinals.  This will give us an indication of the effect of ranking alone, as compared to the joint effect of ranking and tie-breaking, so that subtracting the first of these from the second will allow us to estimate the effect of tie-breaking in isolation.

 

Let's look at those numbers.

 

Men's  Ladies  Source

 

92.7%     95.6%     Ordinals

92.9%     95.6%     Ranked marks

 

And what do these tell us?  Well for starters, they're all higher than any of the number in the preceding set, indicating that ranking can improve agreement.  But unfortunately, when we subtract the second number from the first, we get negative scores in the men’s event and no difference in the ladies’, meaning very simply that the tie-breaking procedures currently used either don't do us any good, or actually do us bad:  that is, they decrease agreement among judges.

 

(When this became apparent to me, I began looking at the issue of tie-breakers more generally, and after a while that sort of got out of hand.  My conclusions on that topic then formed the basis for a separate paper, since most of them are not directly relevant to the issues of concern here.  In that other paper, I argue that tie-breaking procedures are stupid, irrelevant, ineffectual, biased and counter-productive; with the numbers given here used to shore up the "ineffectual" part of the argument.)

 

But what, generally, does ranking do to any set of real numbers?  Well, it sort of depends on the numbers themselves.  If they are fairly evenly spaced along their line, ranking will smooth out minor differences and lead to slightly higher agreement statistics.  Of course, these don't testify to more real agreement - the real numbers don't change, but they look neater.  If, on the other hand, the raw numbers are bunched on their scale, several crowding close together while leaving large gaps in other places, ranking can artificially create differences where none, or few, exist.

 

We can see the first alternative at work in the results for the ladies' competition, where ranking alone creates a jump in agreement of about two percentage points (from 93.5% to 95.6%).   By all indications, this was a straightforward competition where judges may have had some disagreements, but no major differences.

 

But the numbers look far different for the men's competition, edging up less than one percentage point as a result of the ranking procedure (from 92.5% to 92.9%), and falling back half that way again (to 92.7%) when tie-breaking is added. 

 

Let me try to illustrate this by use of some admittedly extreme cases.  You'll remember that in the judging of the men's event Vincent Restancourt was ranked fourth by one judge, fifteenth by another.  This certainly appears to indicate clear disagreement, about as bad as it can get.  But just looking at the summed scores for these two judgments shows an entirely different picture: the one judge had scored him at 11.0, the other at 10.7.  Similarly, two judges placed Anthony Liu at eighth and seventeenth place, a nine-point spread.  But their raw marks place him at 10.9 and 9.7 - again a much smaller difference.

 

So what accounts for the discrepancy in these numbers that first started this investigation?  Mostly, it seems, a very confusing scoring system which led to judges providing rankings that they probably never intended; and partly, a tough competition, where the bulk of the skaters, in the middle range, were of virtually equal ability.

 

But we're getting ahead of our story here, because our investigations have only taken us as far as the judges' rankings.  There's still the OBO transformation, which combines the scores of all judges (or a shifting majority of them) into a single final standing, to be considered.

 

5.  OBO and direct ranking

 

In going from raw summed scores to final placements, one-by-one or BOM systems twice invoke ranking and tie-breaking procedures: first at the level of individual judges to determine ordinals, and then at the level of the judging panel as a whole to translate these ordinals into majority judgments, which are again ranked to determine the final placements.   How can we duplicate these procedures without using non-arithmetic transformations such as tie-breakers and majority judgments?

 

Just as we used simple ranks to compare to ordinals in the last set of analyses, we can use the averages of these ranks as a straightforward comparison to the OBO standing.  This leads to our final set of numbers here - which now compare each judge with an outside criterion (OBO standing or average of the nine ranks) rather than compare each judge with the average of the eight others.

 

Men      Ladies    Source

 

93.7%    96.4%     OBO final standing

95.5%    96.4%     Mean scores

 

Very briefly, for the ladies event, both OBO and averaged scores produce the same amount of agreement; for the men’s event, OBO does markedly worse.

 

But let’s put these numbers in context.  Remember, all these judges were trained and tested and re-trained and examined under a majority-agreement system; all of them were taught that their major task was that of ranking skaters, not of marking them on an absolute scale.  But despite all that training and practice, it turns out that the majority system does no better than an equality-based approach under normal circumstances, and does markedly worse (!) when the task becomes more difficult.

 

But this is clearly heresy.  And even I wasn’t about to believe it at first blush.

 

 

6.  Replication

 

When a researcher is faced with data and findings that contradict what everyone knows to be true, the first thing to do - if he is serious - is to question his data.  And the first question to ask is that of their representativeness: could the same results occur for another data set, or were these, for whatever reason, unique to the 2000 World's Championships?  So I scoured my files for more data, noting as I did so that the complete set of the 1994 World's and Olympics results, which I had used to analyze home-country bias some years ago, was useless here.  At that time, I believed, as we all did, that it was in fact only the judges' ordinals, and not the raw marks, that properly reflected their judgments, so I confined myself to that set of numbers.

 

Fortunately, however, I did manage to turn up a complete set of data for the long program of the Pairs competition at the 1997 World's Championships, which Deb England had sent me apropos of some conversation at that time, and which seemed to meet my needs perfectly.  Like those I'd been using, it was a World Championship event, but represented a different discipline and was scored according to a slightly different system, the BOM, rather than the OBO procedures in place now.

 

So I redid most of the calculations above - except those for the technical and performance scores - using this new data set.  But even before I did so, some of the wierdnesses of majority-based systems became quite apparent.  For example, first place at this event was won by Woetzel & Steuer of Germany: but they were actually ranked second by a majority of the judges (six, in fact).  Eltsova & Bushkov of Russia came in second, with four first place marks, compared to the three garnered by Woetzel & Steuer.

 

Now I know you understand how this happened, but try to put yourself in the place of somebody who doesn't fully understand these how systems work, and explain to him that, yes, in figure skating the final standings are determined by the majority of judges, and yes, Woetzel and Steuer were ranked second by the majority of the judges, and yes, they finished first.

 

For the rest though, the judging seemed fairly straightforward, so that it was not until you got down to Berezhnaya & Sikharulidze, in twelfth place, that real disagreement occurred, with one judge placing them tenth, another eighteenth.

 

Re-running the same analyses as on the 2000 data here, then, produced essentially the same results.  The within-judges (each judge correlated with the average of the other eight) correlation for ranked sum scores yielded an agreement percentage of 95.3 – slightly higher than the 93.5% found for the 2000 Ladies’ event.  And again, it slipped (by almost two percentage points, to 93.7%) when ordinals were used.

 

Similarly, the correlation of the ordinals with the BOM final standing was 94.5%, almost a full percentage point lower than that of the mean mark with the individual marks (95.3%).  It seems, then, that if there was anything atypical about the 2000 World’s, it was the relatively good performance of the OBO data in the ladies’ event.

 

So it appears that these findings are fairly robust, as statisticians like to say.  So robust, in fact, that the burden of proof is now on those who wish to defend the existing systems.

 

 

7.  But does it really matter?

 

I think I've shown, in the preceding discussion, that majority-based scoring systems produce less agreement among judges than do means-based systems.  But how much difference does our choice of scoring system make, anyway?  That is, given the high degree of agreement among judges that we have noted throughout, how different can the final standings become if one system is used rather than another?

 

These questions are easily answered with the data at hand.  Thus, if the three competitions here considered had been decided solely on the basis of the total score achieved by each skater, summed or averaged across the nine judges, the following changes over OBO or BOM standings would have resulted:

 

In the Men's Competition, Michael Weiss would have nosed out Elvis Stojko for the Silver (11.29 to 11.28); Stanick Jeannette and Zhengxin Guo would have reversed their positions at eighth and ninth places, as would Vitali Danilchenko and Alexander Abt, in eleventh and twelfth place; as well as Ivan Dinev and Sergei Rylov in twentieth and twenty-first positions.

 

In the Ladies' Competition, seventh placed Viktoria Volchkova would have dropped to ninth, with Mikkeline Kierkegaard and Julia Sebestyen moving up a point each to seventh and eighth, respectively. Zoya Douchine and Tatiana Malinina would have switched places at eighteenth and nineteenth.

 

In the 1997 Pairs' Competition, Woetzel & Steuer's first place finish would have been straightforward.   And Berezhnaya & Sikharulidze and Savard-Gagnon & Bradet would have reversed their positions at twelfth and fourteenth respectively.

 

I doubt that anyone could mount a successful argument for any of these patterns in preference to the other, on the basis of the performances actually given at these events.  But it is apparent that the fundamental approach used here - seeking the most agreement among all judges, rather than absolute agreement among the most judges - does not lead to conclusions far different from those that would have occurred under the majority-based systems.

 

8.  More absurdities of majority-based systems

 

The situation of awarding first place to competitors ranked in second position by a majority of judges is only one of the logical self-contradictions that arise with the use of majority-based scoring systems.  Two other fairly glaring problems can also be demonstrated quite easily.  These are the lack of pairwise consistency of the systems, and their selective disenfranchisement of judges.

 

Pairwise consistency means, in a simple example, that if three skaters - call them A, B, and C - finish in first, second, and third place as determined by BOM or OBO scoring, then A should be rated higher than B by majorities of judges, and B rated higher than C.  This seems straightforward.  Unfortunately, Edmund Russell has shown (“Amateur figure skating: is the ranking system out of date?” 1995 Proceedings of the American Statistical Association) that it is possible for these skaters to have been rated in such a manner that a majority of judges rate A over B, B over C -- and C over A!

 

This is, of course, wholly absurd; and, more trenchantly, the decision to declare one of these a "winner" on the basis of nothing more than an artifact of the scoring procedure seems fundamentally unjust.  (It is also, by the way, only a formal version of the fairly common problem of skaters’ relative rankings changing following the performances of subsequent skaters – the issue that OBO was supposed to fix.)

 

Selective disenfranchisement of judges is also a common issue, but one which has not, as far as I know, ever been seriously examined.  It is, of course, obvious that disenfranchisement occurs – when the majority that decides, those who are not in the majority have no say.  But the full ramifications of this fact may not be immediately apparent.

 

For example, if majority rule was instituted only to guard against minor errors – momentary inattention, say, or home-country bias – one would expect each of the nine judges on the panel to be members of the disenfranchised minority approximately the same number of times each, at any event.  But this doesn’t seem to happen.  At the 1997 Pairs, for example, one judge was rejected from the majority only twice in her 23 judgments, while two others were rejected eight and nine times each.  Similarly, at the 2000 Men’s (scored by OBO, rather than BOM, but with the same effects), two judges were rejected nine times each, and a third eight times; and at the Ladies’ competition, two judges were rejected eight times each.  And what does this tell us?

 

Quite simply, that to the extent that these three competitions are representative of international competitions generally, our scoring systems indicate that more than one quarter of our judges are so bad at their jobs that we cannot trust more than two out of every three judgments they make!

 

Does anyone believe that?  I certainly don’t, since all the data I’ve seen and cited here indicate that they appear amazingly good at what they do.  I doubt the ISU believes it, because it would give them serious reason to reconsider not only their training programs, but also the whole idea of using judges.  It is, in fact, patently absurd.  But it is undeniably what our scoring systems are telling us, and what they are doing to the judges.

 

 

9.  A summary

 

Let me pull together now all the material that has been documented above, in the simplest way I can.

 

Briefly, beginning with the assumption that interjudge agreement is the only meaningful criterion for deciding between scoring systems in figure skating competitions, I’ve shown that majority based systems such as BOM and OBO yield less agreement than do simple means.

 

I’ve indicated that in actual practice, using simple mean scores produces results that are practically indistinguishable in their final effects from those produced by majority based systems. If there was a meaningful distinction to be made, it seemed to favor the mean-based system.

 

It’s as if, where other sports use standardized rulers, figure skating uses an elastic one.  Under majority systems, it will measure eleven inches for some skaters, up to thirteen for others.  With means-based systems, it measures between eleven-and-one-half and twelve-and-one-half inches.  While this may not seem like a very large difference, there does not seem to be any excuse for not using the best system we have.

 

 

And I’ve shown repeatedly that BOM and OBO lead inevitably to absurd situations (ranging from the assignment of ranks different from the majority rating in a supposedly majority-based system, to the unnecessary disenfranchisement of more than a quarter of the judges).

 

And I have barely mentioned, so far, any of the other benefits of mean-based scoring.  These include:

 

Means-based scores are easier to understand;

 

Means-based scores are comparable from event to event;

 

Means-based scores never lead to reversals of position during a competition;

 

Means-based scores involve no distortions of judges’ decisions;

 

Means-based scores treat all judges equally, as they should; and

 

Means-based scores do not favor performance over technique.

 

(Does this last need explanation?  Very simply, the tiebreaker in the long program is the performance score, while in the short program it is the technical score.  Since the long program is weighted about twice as heavily as the short in most competitions, when tie-breakers are used, they tend on average to favor performance.)

 

Given all these advantages, it seems fair to ask why we continue to use scoring systems that deprive us of their benefits.  Rationally, the only answer that seems available is that means-based systems also have disadvantages that majority-based systems do not, or that majority based systems have advantages that means-based systems do not, and that – for one reason or another – the balance favors majority-based systems.  Less rationally, other explanations – habit or venality, for example – may also play a role.

 

All these issues, as well as those of the effects of abandoning our majority-based systems, need further exploration.  But that’s primarily a matter of weighing different arguments and suppositions, alternatives and possibilities:  empirical data, drawn from actual competitions, cannot help us make any of these decisions.

 

The material I have presented so far, on the other hand, has been fundamentally empirical in nature:  real scores taken from real events.  I’d prefer to leave it that way for now.  An attempt to deal with the hypothetical issues follows in what is, essentially, a sequel to this paper, under the title of “What can we do about cheating judges?” 

 

 

 

Entire contents of this site copyright by Dirk L. Schaeffer