College Policy Debate Forums
January 16, 2018, 11:36:10 AM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
   Home   Help Search Login Register CEDA caselist Debate Results Council of Tournament Directors Edebate Archive  
Pages: [1]
Author Topic: Speaker Points at Wake Forest  (Read 7018 times)
Jr. Member
Posts: 86

« on: November 21, 2013, 12:00:43 PM »

We first want to congratulate all of those who won speaker awards at the Shirley.  When one enters a competition, pours their energies into success and then achieves it, that should be congratulated.  I personally attended the impromptu speaker awards at Kentucky, clapped for those at Harvard, and then felt quite privileged to call out the names of very talented speakers at our tournament.  Regardless of what we think about speaker points in general, the accomplishments of debaters who win the awards deserves credit; we wish for that message to continue.

A secret meeting wasn’t the only thing that happened this weekend that was not in the competitive or educational Spirit of the Shirley.  Our community has been battling about speaker points all year, from GSU through Kentucky and at Harvard, so we made a concerted effort to publish and promote a speaker point scale that we hoped would return some reliability and fairness to the speaker point system.  If judges promote a specific agenda, whether stylistic or competitive, with speaker points it will be difficult for anyone to view the competition, especially the determination of the lower parts of the elim bracket, as fair to competitors or representative of their achievements.  When there is speaker point inflation, some students will feel cheated and those that do clear will have their achievements viewed by many as hollow.  Either condition puts our community in jeopardy, where arms racing is the best case scenario and ultimate incompatibility is the worst.

It was clear from the Shirley that a large number of judges followed the suggested scale published by the tournament.  In fact, there were very few who were outliers.  Perhaps this is an oversimplification, but we think less than 10 judges had a very substantial impact upon which 5-3 teams cleared.  Motivations aside, this was similar to what happened at Harvard and arguably Kentucky.  Those who did follow their own scale, especially those who adopted a 29.3-30 scale had a statistically significant impact upon which teams broke.  In fact, if a team (due to their pref sheet, their opponents pref sheet and random luck) got two judges who followed that scale, their chances of clearing and receiving a speaker award were much higher than their competitors who did not.

Dallas Perkins said it well at the Harvard awards assembly, “any judge that uses speaker points to achieve an agenda besides identifying the best speakers does an injustice to our activity” (paraphrased).  The idea that “I will give higher points to people who pref me” seems fundamentally unfair.  Unfair to whom?  Unfair to the the entire pool of debaters and fellow judges.  Debaters, even though two teams from the same school might have the exact same pref sheet, their opponents, judge conflicts, judge commitments, and availability of judges will all factor in who gets placed.  Unfair about what? The entire activity… If all debate is to be cherished there must be an equal opportunity for all debaters, no matter their style, to make it to the elimination rounds and achieve speaker awards.  

Are there historical injustices that need the attention of tabrooms? Yes.  We felt like our tabroom worked in the innovative Spirit of the Shirley.  We eliminated extra rounds in the prelims.  Gary ensured that all judged received “in-it” debates.  We ensured proportional representation across the elims.  These are not THE solution.  Dialogue must continue on these questions.  But, it is hard to deny that ANY judge who follows their own scale does poison the well of debate for all.

It should be obvious to everyone by now that Wake Forest is not interested in choosing a single style of debate.  We have a squad that has attempted to be as diverse as the debate community we participate in and to be a leader in diversity at our home institution.  We challenge those squads with the resources to do the same to follow suit.  Shout out to Liberty, Wayne and many others… We think that we all get better by both having debates with like-minded and different-minded approaches to debate.  The team that won our tournament has done a great job of adapting to engage others on their merits in the very vast majority of instances.

Attached you will find the data from the Wake Tournament.  Gary Larson’s words: “The data I’m giving you give the average and stdev for each judge (against a population avg of 28.46, median of 28.4, and stdev of .51 – excluding all of the penalty zeroes and bye points).  The next columns show the average and stdev that the students they judged received in all of their rounds.  This permits you to determine whether they saw stronger or weaker students.  While this 2nd order comparison is the one I think is most valid, it should be noted that it is subject to “collusion” effects.  An increase in points by several judges who may all end up judging the same subpopulation increases the average for that group thereby blunting the second order z-score by making the group seem stronger across all of their rounds.”

- Focus on the difference in points - the last column - not the average - 2 examples...
While Shanara Reid-Brinkley has an extremely high average, one can also note she was statistically in line with the points given to the debaters she watched.  In other words, she judged a really good debate (one of the teams she judged in her only round, won the tournament).  Similarly, Edmond Zagorin’s average points were extremely low, but others agreed with the points he gave the speakers.  Despite high or low averages, these judges appeared to have followed the scale.

- Some of the data is thrown off by those who were judged by multiple critics who used a scale at one extreme or the other.  If a debater had two judges who used high points, their average was higher, thus the difference for any individual outlier seems smaller.

A note on our delay on this question….Gary Larson has indicated he is working on some conglomerated data on several tournaments this semester.  Our hope was to be able to wait until it was completely ready.  Current events have caused us to release a statement now.  Expect more data from Gary to follow.

One major problem with speaker points aside from ideology questions….Speaker points use a scale typically reserved for artistic events to measure a competitive enterprise.  Discrepancies involving judges are inherent in any event where people have differing perceptions.

What will not work… current speaker points as usual.

A couple of recommendations to others running tournaments:
1 - Margin of Victory as the first tie-breaker - This measurement of success utilizes a universal scale.  While it is relatively new, it is the tie-breaker used at most tournaments that utilize more than 1 judge (Round-robins, NDT, various district tournaments).  We employed a (3,2,1) version asking the judge whether or not they saw the debate as a 3-0, 2-1, or a 1-2 with themselves on the bottom.  Others could employ a (5,4,3) asking the judge whether or not they saw the dbeate as a 5-0,, 4-1, or 3-2.  There are coding/electronic ballot issues at play, though.

2 - Second order z-scores - Gary will write more about this.  My second-hand understanding is that this would compare speaker points of judge versus the average and overall rankings of the debaters.  Please, please wait for Gary on what this means.

3 - Still publish a speaker point scale - The vast majority follow it.  Using second order z-scores does account for outliers.  But having a scale still helps those scores be more accurate.

One additional note...Some will say use opp wins.  This does not work with the STA/CAT algorithm that already includes opposition strength in prelims.  It has plusses and minuses to when only pairing using speaker points.  But, clearing under this world with the CAT/STA would inherently involve a question of luck about whether or not one was pulled up or down.  

Finally, we issue a challenge to all judges to follow a scale given by a tournament.  A public acknowledgement of those who have inflated/deflated whether at Harvard, Wake, or any other tournament (some of you for the second time), might create confidence that a split is not necessary or inevitable.

Justin Green on behalf of Len, Jarrod and Justin

* Points at Wake.xlsx (30.38 KB - downloaded 705 times.)
« Last Edit: November 21, 2013, 12:49:57 PM by JustinGreen » Logged
Full Member
Posts: 243

« Reply #1 on: November 21, 2013, 12:35:23 PM »

Very useful chart. Would it be fair to say that a judge who is trying to avoid having their assignment to judge either help or hurt the debaters they judge would endeavor for something close to a zero in column 7?

I noticed two sets of judges were highlighted based on having the same average points. Some of these judges were part of the Harvard speaker point discussion, but several were not. Is the highlighting intend to show medians, modes, the "clearing line" or something else?
Jr. Member
Posts: 77

« Reply #2 on: November 21, 2013, 12:40:59 PM »

If folks want to sort by difference rather than average speaker point (which, as Justin says, is potentially more useful), you can;

Select columns 1-8 by left-clicking on the '1' at the top of column 1, and dragging it over to the '8' until all 8 columns are fully highlighted.

Select "sort and filter" in the top right of the (it has an A-Z picture next to a funnel and is second from the right)

Select "custom sort" from the draw-down menu of options

Select "difference" from the column on the left labeled "sort by"

Sorry if this is obvious, it just took me a depressingly long time to learn this on my own and figure a few others may also be in the dark.
Jr. Member
Posts: 86

« Reply #3 on: November 21, 2013, 12:52:54 PM »

The highlighted were the medians - had nothing to do with Harvard or any other tournament.  Anything there is pure coincidence.  I don't think they are that relevant to a larger discussion.  I uploaded a new version without the highlighting to avoid confusion.  Yes, the ideal judge has a score in the difference column as close to zero as possible.  But, outliers have an effect on the reliability of that score as well.
Posts: 1

« Reply #4 on: November 21, 2013, 01:22:46 PM »

I sorted the data on the 'Diff' column (which is Average minus Sample Average) and logged the rounds for the 9 judges with the largest differential (between 1.20 and 0.40).  I compared that log with the 12 5-3 teams that broke.  When doing so, and it is hard to claim that any 1 of those 12 teams was materially impacted by this 9 judge group. 

The results show that 5 teams received 1 of these 9 judges on 1 occasion (and 2 of the 5 was a head to head match-up of ultimate 5-3 teams).  No team was judged by 1 of the 9 on more than 1 occasion. 
Pages: [1]
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2013, Simple Machines
SMF customization services by
Valid XHTML 1.0! Valid CSS!