Free Web Hosting Provider - Web Hosting - E-commerce - High Speed Internet - Free Web Page
Search the Web

 

 

What can we do about dishonest judges?

 

                                                                                                                Dirk L. Schaeffer

                                                                                                                Vancouver, B.C., Canada

 

 

1. Introduction

 

 

In an earlier paper (“Why do we have judges at figure skating events?”) I posed several apparently simple-minded questions (why do we have judges, what happens to the minority when majorities rule?) and found they led to some unexpected conclusions. The most notable of these was that there were many, many ways in which the simple averaging of raw scores was a far preferable method of evaluating skating performances than the majority-of-judges systems now being used.  I will not repeat the list here.  Rather, I want to look now at the obvious questions this analysis raises:  why do we keep using majority-based systems, and what would happen if we replaced them with simple raw scores?

 

But let me start with a little bit of history, specifically that of majority-based scoring in figure skating.  I wish I could tell you how our present system, characterized by multiple judges and rank placements for each judge, as well as majority rules, got started, but I haven’t been able to find any details on the actual history.  That doesn’t matter much, however, since it is easy to create reasonable scenarios, and the bottom line – during most of the last century – was that it worked.  Both the ability to reach agreement among judges of often very differing minds, and the relative ease of scoring would have made it appear as a quite suitable, if not ideal, method for determining competitions.  (Although, of course, it had its problems, too:  resulting in the banning of several judges and, in one case, the whole Russian judging panel on grounds of inappropriate marking.)

 

What seems most important now, however, is that two major changes in the sport have occurred since then.  First, computers have made all forms of scoring easy and instantaneous.  And second, the stakes – limited for most of the last century to whatever pride home countries or coaches or competitors could take in winning – have become much higher.  Which raises the question of cheating.

 

Before I turn to that, however, I want to clear away one possible source of confusion: the question of trimmed means.

 

2. Trimmed means  

 

Inevitably, when we talk about the scoring of figure skating events, we come up against suspicions of bias on the part of the judges.  And one of the first fears we have is that of giving any single judge, who may cast an extreme and unreasonable vote, too much power.  Whatever other reasons may account for their development, majority-based systems were clearly designed to inhibit that power; and most of us seem to feel that any alternative system should also contain some checks against it.  Hence, even when we talk of means-based alternatives, we tend to think of trimmed means, rather than the average of the full panel.  Indeed, I’ve more or less done the same myself, as recently as six months ago.

 

But it turns out that trimmed means are not a useful alternative, for several reasons.  A small one is that agreement percentages for trimmed mean data are as poor as they are for ordinals.  Another small one is that trimmed means deprive us of the opportunity of monitoring individual judges.  (This is not immediately obvious, but the proof is too long to go into here, since the issue is fairly minor anyway.)  The big reason is the dreaded “mean shift”. 

 

 

The mean shift is a phenomenon that can occur when one judge uses a marking scale that is measurably higher or lower than the other judges.  This can lead to a situation where all judges agree in the ranking of a group of skaters, but that ranking gets overturned or jumbled when trimmed means are used.  Edmund Russell laid this out very neatly in his 1995 paper “Amateur figure skating: is the scoring system outmoded?” and most of us have encountered his, or similar demonstrations since then.

 

It’s easy to over-estimate the significance of this effect, since the conditions under which it can arise at all are quite rare.  But this doesn’t seem to matter as much as the fact that, rare or not, it CAN occur, and can lead to some extraordinarily distorted standings when it does – as distorted, in fact, as those that can occur with majority-based systems.  And since I have argued that the deficiencies of the latter are sufficient to require alternatives, I cannot endorse trimmed means as a serious contender for that role.

 

Thus, when I discuss means-based systems in this paper, I shall be referring just to that, and not any trimmed or otherwise diddled-with alternative.

 

3. Majority-based systems and bias

 

 

If majority-based systems are not as good at achieving interjudge agreement as are means-based systems, if they are prone giving absurd conclusions, and if they are sufficiently complex to alienate large numbers of potential fans, what justification is there for our continued use of them?

 

One answer that is often proposed is simple habit.  Our judges have all been trained under these systems, it is argued, and it would be virtually impossible to retrain them all.  But that argument is totally nullified by the data presented in my earlier paper:  judges trained to use ordinal and majority-based systems actually do better when their decisions are re-scaled as means-based judgments.  Which is to say, they don’t need retraining:  the existing system was too complex for them to learn in the first place.

 

Another answer – as I implied earlier – is that they appear to guard us against bias, dishonesty, and cheating.  (And that, by the way, is the only other justification for their continued use that I have been able to find.)  But let’s look at this more closely.

 

First and most fundamentally, we can ask the question of whether this is something that scoring or judging systems should be doing in the first – or any other – place.  We do not, after all, expect time-clocks, measuring tapes, or weigh scales to guard against the use of steroids; and judges in figure skating perform the same role as those instruments in the more objectively measured sports.  All we ask of the instruments is that they be reliable, consistent, and accurate – which is, I think, all we should ask of judges, and which is what the rigorous training schedules we subject them to attempts to ensure.  (And does, quite astoundingly:  the measures of interjudge agreement presented in my earlier paper, and in several of Ed Russell’s papers for the American Statistical Association, indicate that figure skating judges are just about as reliable and consistent as it is humanly possible to be.)

Of course, asking our clocks and tape measures to be accurate and consistent also implies that they are free of bias, which is more likely in human judges. But when we suspect that they are not, we send them out for repair, rather than try to devise some kind of majority system of multiple clocks to compensate for that suspicion.

 

But what are we actually dealing with when we talk about “bias”?  My best guess is that it comes in about five different forms.

 

First, there is simple inattention or distraction:  for whatever reason, some judges may miss some aspects of a skater’s performance.  While this problem is real, the solution is simple and already in place:  use more than one judge.

 

Second, there is the question of differing judgmental abilities:  some judges may simply be better at what they do than others.  This problem is less real (ultimately, we have no way of telling who is “really” a better judge since we have no way of telling what is “really” a better performance), but the solution is again in place: rigorous training procedures.

 

Third, there is what might be called “principled” bias: the tendency of some judges to prefer certain types or skating or certain approaches to the discipline; the way Russian dance judges are said to prefer “expression,” while British judges prefer “technique”.  Some years ago, for example, I was able to show that it is possible to identify judges who seemed to prefer “artistic” skaters, on the one hand, or “technical” skaters, on the other.  While these preferences would color their judgment to a measurable extent, I concluded that it was far too small to be worth worrying about.  Beyond that, this sort of “bias” is, to some extent, deliberately – and reasonably – built into our scoring systems, which evaluate competitors separately in each of these areas.  All of this seems, reasonable, principled, and fair: as a form of bias, this is wholly benign.

 

Fourth, there is “home country” bias: the tendency, actively encouraged by many national skating organizations, of judges to rate their own country’s competitors more highly than judges of other countries do.  Again, I examined this issue statistically some years ago, with somewhat mixed findings.  Briefly, some 75% (as compared to the 50% to be expected if there were no bias) of the judges I examined – using all data, including qualifying, short, and long programs for the men’s, ladies’, and pair’s events, and all four dance events, at the 1994 Olympics and World’s competitions – showed some evidence of this form of bias.  I considered this less significant than the fact that a full 25% did not show such bias.  Virtually all of this bias was at levels that were either statistically or substantively – or both - insignificant.

 

One instance, however, had serious repercussions.  Here, a judge shown to be biased on purely statistical grounds cast the deciding vote giving Oksana Baiul the gold medal at the Olympics, and two deciding votes that kept Olympic gold medalist Alexei Urmanov out of third place in both the short and long programs at the World’s. (The offending judge, Alfred Korytek, has more recently been suspended for more blatant cheating.)

 

I have no idea of how common such extreme home-country bias is today, as compared to six years ago. The occasional data I’ve looked at – such as the Men’s and Ladies’ long programs of the 2000 World’s – suggest that it is in fact declining, rather than increasing as the stakes in figure skating competitions get higher.  If that is true, it is probably only because it is so obvious and easy to detect.  Korytek, who played the game skillfully in 1994 – including rating some (low-ranking) Ukrainian skaters far lower than the rest of the panel did, to be make his over-all record look less biased – has moved on to less easily identified forms of chicanery, such as collusion.  These cases then move into the final category.

 

Before we get to that however, I should note that our majority-based systems appear, indeed, to be effective at limiting the impact of this form of blatant home-country bias.  But this success may be more apparent than real, as Korytek’s 1994 record demonstrates.) 

 

The final form of bias, then, is one that might be called simply “venal”, or in common terms, outright cheating.  Trading favors, bribery, and any other form of advance rigging of the results all fall into this category, and the recent boasts by members of the French figure skating association testifying to their sophistication at “political” approaches to judging decisions would appear to be at least a borderline instance. 

 

I doubt that anything can be done about this at present. Certainly the present scoring systems do nothing to make it any less likely than any other system would.  Indeed, I am coming to the tentative conclusion that the devious scoring methods currently in place make it easier to disguise such chicanery than the more open and direct methods of raw score judgment – but that is, at this point, at best an educated guess on my part.

 

All in all, then, this consideration of the sources and identification of bias and other threats to the validity of judges’ ratings, gives very little support to those who are content with the present systems.  They don’t seem to be particularly effective at stopping anything but the most blatant forms of home country bias, and all rational considerations indicate that more open scoring systems would at least allow us to be able to identify bias or cheating when it does occur, better than do the current systems.

 

 

4. Judges who cheat

 

In the discussions of cheating judges that I have encountered so far, we have invariably talked about dishonest judges as if they existed in some kind of vacuum inhabited only by judges and performing skaters.  So let’s pose another one of those simple-minded questions to see where it will lead us:  why do judges cheat?

 

The typical answer is: to enhance the standings of their countries’ athletes, or something like that.  But think about that for a moment.  It’s clear, for example, that by themselves judges have no reason – other perhaps than patriotism – to cheat, and much to lose by doing so, if they get caught.  And that means that if they do cheat, it is because someone else – a skating association, an over-zealous coach, a gambler – wants them to and is able to bribe or coerce them into doing so.  My point is that judges, both honest and dishonest, are only a part of a very large web, involving many players, each with their own motives.

 

That large web then contains not just the judges, but also their FSAs, the referees of the events, the disciplinary committees, and the ISU; as well, perhaps as any number of outside parties, such as the friends of Tonya Harding.  To some extent, all or most of these individuals or agencies are involved whenever cheating occurs, either by making it easier or more difficult for the judges to act in an improper manner. 

 

It seems clear that at the present time, many aspects of this larger web not only do little to stop cheating but actually make it much easier.

 

The primary agent – if only because it ultimately has all the power – is, of course, the International Skating Union, an organization whose dedication to secrecy appears to be matched only by its devotion to avoiding anything that might look like blame.  Certainly its response to Jean Senft’s attempts, last season, to bring to light what appeared to her to be a potential instance of collusion among judges – by first ignoring, and then suspending her, rather than investigating the alleged colluders – is worthy only of the Spanish Inquisition, and sends a loud and clear message to all would-be cheaters that they have little to fear from their governing body.

 

I suspect that referees and the referee system are equally conducive to dishonest practices, but that may be only because I simply do not understand them.  The nub here is the matter of post-event reviews of the judges’ ratings, performed by the referees, acting – as near as I can tell – more or less on their own best judgment.  Here, judges’ marks for all (or perhaps only the top five or ten) skaters are reviewed, and judges are asked to justify any out-of-line marks they may have awarded.  In theory, this sounds like a reasonable system, but despite having discussed, or attempted to discuss, this with a number of judges, I have been unable to discover any principle or rule that would determine which judges or judgments are “out of line” and which are not.  At least in their training sessions judges are instructed as to the degree of agreement that is required of them in order to pass on to the next level of certification; once all training is past, however, not even that much exists in the way of criteria for acceptable performance.  Again, whether intentional or not, such vagueness sets a fertile field for chicanery to flourish in.

 

What all this leaves us with, then, is the awareness that if cheating exists in figure skating at all, judges are only the visible tip of a very large iceberg which may encourage or condone dishonest activities – by looking the other way, and avoiding publicity as far as this is possible.  To look for a method of judging that would fix this system is roughly the equivalent of looking for improved automobile speedometers as the solution to drunk driving.

 

All of this discussion has, however, been quite hypothetical:  none of us know more about bias or cheating than that it has occurred in the past at sufficiently high numbers to get the entire Russian panel of judges suspended for a year, and to lead to some suspensions this year.  Since the ISU neither publicizes such suspensions nor publishes any guidelines to indicate the conditions under which suspensions occur, it is impossible to tell how much more often suspensions occur or – more importantly – should occur but do not (as with the off-again-on-again waltz involving Senft’s accusations last season.)

 

What seems very clear, however, is that judges, and the marks they give, are not the best way to attack this problem.  Indeed, virtually the only advantage they may seem to promise is that of nipping some bias in the bud:  that is, allowing a responsible majority to disenfranchise a biased judge prevents any damage that bias may do before it can occur.  This may seem to be a far more powerful tool for ensuring the legitimacy of marks than just identifying biased or dishonest judges for later punishment or other action, which would seem to be the best that means-based systems can do.

 

But these are just “seemses” and a quite different scenario can be laid out for the use of means-based scoring.  That scenario answers our final question:  how would a means-based system work, in actual practice?

 

5.        O brave new world

 

So what would happen if we switched to a simple means-based scoring system?  How would it work, how would it address the challenges allegedly addressed by BOM and OBO, and what problems would it raise?

 

First, of course, we would reap all the benefits of a means-based system, as compared to those we now have.  For starters, these include:

 

Means-based scores are easier to understand;

 

Means-based scores are comparable from event to event;

 

Means-based scores never lead to reversals of position during a competition;

 

Means-based scores involve no distortions of judges’ decisions;

 

Means-based scores treat all judges equally; and

 

Means-based scores do not favor performance scores over technical.

 

In actual practice, means-based scores would work much as scoring does now, only far more simply.  That is, judges would assign technical and performance scores to each skater; these would be added or averaged across all judges, weighted appropriately for the program (one-half for the short program in the men’s, ladies’, and pairs’ events, and so on), and carried over into the next program.  At the end of the competition, the highest score wins, and so on down.  Should ties occur, sensible tie-breakers (discussed in my other posting) would be invoked — say, allowing them to stand for marks below fifth place, and holding a tie-breaker competition for marks between first and fifth places.

 

What problems would this present?  There appears to be only one (other than the non-issue of judges allegedly having to learn a new scoring system): dishonesty.

 

How well can majority-based systems handle this?  Well, judging by nothing more than Korytek’s success at the 1994 Olympics and Worlds, not very well. Indeed, the only support for the proposition that majority-based systems are up to this task at all, come from Ed Russell’s Monte Carlo studies of the effects of bias in several different judging systems, including best-of-majority, Borda and trimmed Borda counts (these are simply summed, or averaged, ranks) and trimmed means.  (Monte Carlo demonstrations, by the way, are large sets of hypothetical data, using random numbers generated according to specific rules, to indicate what might happen in the “long run”.  They have nothing to do with actual events whatsoever.)  But these show only that under the somewhat circular reasoning he used (defining majority-based judgments as “correct”) even the majority systems failed as much as 20% of the time under the mildest bias conditions, and about 90% of the time under the most severe.

 

Once we recognize, then, that majority-based systems have only the limited virtue of being better than some other alternatives, it becomes more reasonable to phrase this issue in simple cost-benefit terms.  How much do we stand to gain by any set of procedures we adopt, and how much will that cost us?  And this makes it possible to look at a wide range of alternative strategies.

 

Two that have been suggested recently, for example, are those of hiring an independent pool of ISU (rather than national) judges, and of restricting the selection of judges in various ways.  The first of these seems reasonable, particularly when accompanied by close monitoring – more on that below – and the threat of loss of a fairly prestigious job.  It seems doubtful that the existing power groupings within in ISU – many of which are quite content with the present arrangements, because they actually leave more, rather than less, room for chicanery than the more transparent means-based systems would – will accept this.  But then, it seems doubtful that they will accept any change more drastic than the cosmetic shift from BOM to OBO.

 

On the other hand, the ISU appears to have accepted at least a limited form of panel adjustment by moving to random draws immediately prior to the competition.  But this too seems fairly minor.  More reasonably, we might ask that any country with a competitor seeded in the top five positions be ineligible for that judging panel, although this would probably permanently disenfranchise Russian and American judges.  (And, predictably, this new system does just the opposite, by restricting the judges’ pool to those countries actually represented at the event.  Doesn’t this appear to endorse home country bias as something judges are expected to engage in?  What earthly justification can be offered for this restriction?)

 

Any change, however, should be accompanied by better monitoring and evaluation of the judges’ evaluations, and much of this can be done on-line.  I can even envisage a system that works more or less as follows. 

 

Our computers now allow us to monitor at least the most egregious errors on the judges’ part as they occur.  These are home-country-bias and deviation from the majority (as measured by the sort of correlations described above).  Such monitoring could result in either or both of two actions.

 

First, any infraction (error) on the judge’s part would be assigned a pre-determined number of penalty points, which would work much as driver’s points do now.  Collect enough, and you’re suspended; collect too many, and you’re banned.

(Just how many are enough or too many is still to be determined, as are the actual criteria for the errors).  As with driver’s points, some of these would continue to accumulate over time; others would be erased after some length of time. 

 

Second, for the most serious infractions there would be the option of simply removing that judge from the panel entirely. Recalculated numbers based on the eight-(or six-)judge panel remaining can be made available as quickly as results are now:  the necessary calculations can be done in nanoseconds. (And if throwing out a judge in the middle of a competition seems draconian, please note that the judging system currently in use for artistic gymnastics allows just that.)

 

Any of these systems, backed by the sort of strong enforcement outlined above, would seem to address most of the issues of potential chicanery more directly, and at far less cost, than our present attempts do.

 

What they can’t immediately deal with, however, is collusion.  But then neither can majority-based systems, which fail about 50% of the time when two judges show only the mildest bias, at least 70% of the time when three or more judges are biased, according to Russell’s simulations.  But, strange as it seems, raw score systems may be better able to handle this, even today, than are our majority systems.  The main reason for this is that at present it takes only  a small nudge (scoring a performance 5.3 for presentation and 5.2 for technique in the long program, rather than 5.2 and 5.3) to bring about a change in ranking, although this would not have any effect at all in a means-based system.  It follows that in order to cheat in a means-based system, somewhat more blatant procedures have to be used; and these will be that much easier to identify.

 

6.        Summing up

 

The answer to the question posed in the title of this posting is then quite simple, although not terribly satisfying:  what we can do about dishonest judges is, first and foremost, identify them.  Once that’s accomplished, the rest may or may not be quite straightforward, depending on whether our governing organizations continue to value their status more than the sport.

 

And this identification would be far easier if we switched to a means-based scoring system, because such systems, being more transparent, both allow clear rules and criteria to be drawn, and allow more direct analysis of the judges’ actual behavior.  Thus, I am currently working on techniques that will, in fact, be able to statistically demonstrate collusion when it occurs.  I’m persuaded that this can be done by the simple recognition that since collusion must necessarily involve the manipulation of numbers, it must therefore be visible, in some way, in the numbers themselves.  “Seeing” these patterns is somewhat simpler when the numbers involved possess basic mathematical properties – which means and raw scores do, and rankings don’t.  And that reasoning applies at all times, again arguing strongly for a means-based system as the best procedure for identifying collusion, when telephone tapes and surreptitious foot-signals are not available to provide actual smoking guns.

 

On balance, then, I would add the notion that means-based systems are better able to identify chicanery and – given enforcement such as on-line rejection of judges – better able to deal with it than are majority-based systems, to the list of reasons for preferring this system to anything we now have.

  

 

 

 Entire contents of this site copyright by Dirk L. Schaeffer