College Policy Debate Forums
December 11, 2018, 03:17:34 PM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: IF YOU EXPERIENCE PROBLEMS WITH THE SITE, INCLUDING LOGGING IN, PLEASE LET ME KNOW IMMEDIATELY.  EMAIL ME DIRECTLY OR USE THE CONTACT US LINK AT THE TOP.
 
   Home   Help Search Login Register  
Pages: [1]
  Print  
Author Topic: Speaker Points at Wake and elsewhere  (Read 9983 times)
glarson
Sr. Member
****
Posts: 477


« on: November 01, 2013, 12:00:20 PM »

I want to begin by strongly affirming Wake's plan to create a desription of "community" standards for the assignment of points.  It is ultimately a human activity and community-based human solutions are always best. 

That said, it has been very easy to see in the last few days why speaker point inflation occurs.  Independent of organized efforts to change the assignment of points, each individual judge faces pressure to ensure that their points aren't "lower" than current norms and that speakers that they believe constitute the "best" are appropriately rewarded.  So both individuals and groups of individuals have appeared to react to what they perceive as inflation with more inflation of their own.  And so it goes inexorably higher.  And unfortunately, any time a scale becomes unsettled, it also becomes unreliable when aggregations of scores are used for breaking, for seeding and for speaker awards.  This doesn't just happen when tournaments like Kentucky and Harvard observe changes in points awarded.  It occurred when we moved to tenths-no ties, when we experimented with 100-point ballots, when folks adopted half-points, and so on.  Unless individual judges have substantially the same strategy or method for assigning points, aggregating them is very difficult.

As I noted above, I prefer a human solution to a human problem.  But debate is not the first or the last activity to have disruptive moments and inflationary moments in the assignment of scores to performances.  Every activity that uses subjective human scoring confronts the same issue.  And a rachet that responds to someone else's presumed inflation with a readjustment of one's own scores is not a good solution at all.

While I hestitate to suggest it, there are statistical methods to address the sampling errors and measurement variance that occurs.  It is not as if I haven't foisted on the community technological solutions to our very real problems in the past.  Another court will have to decide whether that has been good or bad.  But here goes:

32 years ago I strongly suspect that I was the first to introduce the z-score transformation into the list of speaker point calculations.  I did it when I had to do it by hand with a calculator for a set of awards that we annually offered at Wheaton - "I coulda, I woulda, I shoulda won a speaker award."  I did it because folks complained that they were observing very different scoring strategies within the CEDA community of the time.  Indeed, my own boss was one who prided himself on maintaining a traditional scale where 25 was great and scores in the teens weren't unheard of.  In the intervening years, z-score or its cousin, judge variance, have remained as the third or fourth or fifth tiebreaker to break those seemingly unbreakable ties.  But it has never done much more than that.  There's a good reason I've never promoted it more heavily.  While it answers some problems, it fails to answer others and may even create its own.

But in the last few days, I've begun testing a second-order transformation that addresses the weaknesses of the z-score and potentially addresses the kinds of anomalies that we are confronting (at least until we can return to something resembling a community consensus).  I suggested it as an option to Wake, but given its complexity they wisely suggested that I illustrate how it works with real data sets before anyone is tempted to use it in tournament conditions.  So over the next two weeks, I will recompute Kentucky, Harvard and then the Wake data to illustrate how a statistical transform would create different seeds and speaker awards.  I should say up front that while the results would be "different" it is not really possible to argue a priori or even a posteriori that they are best or that they are "correct."  That's the problem with all of our statistical modeling.  While we want to aggregate or compare across rounds, teams, judged, etc. it is really the case that every debate round is statistically a n of 1.  During a tournament, we are not really getting 8 successive approximations or measurements of a debater's abstract "skill."  We are scoring 8 different performances and trying to figure out how we can fairly aggregate them together to decide who DID IT best - not who IS best.

So here's the issue in a nutshell (or at least one of the issues).   When a judge assigns a set of points to speakers in a round, he/she does that as part of two subsamples of the overall population, both of which can create skews.  First, we are all aware that any given team faces a decidedly non-random sample of judges in the tournament, who may or may not give points in accordance with the overall distribution at the tournament.  This becomes more poignant if we fear that some judges will intentionally award points that are not consistent with the overall distribution (for whatever reason).  That is one reason why z-scores are used in statistics – to control for observer variance.  BUT – and this is a big BUT – if we were to decide to normalize a judge’s points, we need to also be concerned that just as a team didn’t get a random sample of judges, each judge doesn’t get a random sample of teams.  So a judge that has an unusually high average “might” be a point fairy and deserve to have the impact of their high points reduced OR they might have simply judged better rounds with better speakers.  The typical z-score not only doesn’t control for this but actually inappropriately depresses the scores of judges who judge a stronger sample of teams. 

That’s why we would have to do a second normalization of the judge scores to control for differences in the sample of teams that they judged.  Fortunately, such a set of tools are available.  I will begin with the historical data sets from Kentucky which I already have in a form I can manipulate it and Harvard which I anticipate getting in the next week or so.  In the meantime, my hope is that community norms will begin to re-emerge at Wake which would be demonstrated by showing that the transformations become unecessary.  If that doesn't appear to happen, we'll have to discuss as a community how we can go forward. 
Logged
lgarrett
Jr. Member
**
Posts: 54


« Reply #1 on: November 01, 2013, 12:30:32 PM »

Gary, thank you for doing this. You provide invaluable and under-appreciated contributions to the community.
Logged
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines Valid XHTML 1.0! Valid CSS!