College Policy Debate Forums
September 23, 2018, 11:46:59 AM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: IF YOU EXPERIENCE PROBLEMS WITH THE SITE, INCLUDING LOGGING IN, PLEASE LET ME KNOW IMMEDIATELY.  EMAIL ME DIRECTLY OR USE THE CONTACT US LINK AT THE TOP.
 
   Home   Help Search Login Register  
Pages: [1]
  Print  
Author Topic: What Are We Trying to Do With The 100 Point Scale?  (Read 8437 times)
Ryan Galloway
Full Member
***
Posts: 121


« on: November 09, 2009, 10:20:10 PM »

This is a follow-up on Stefan's rubric for evaluating the 100 point scale.  My basic thesis is that the 100 point scale is causing wide variations in points, and it is having the greatest impact not at the top of the scale (those winning speaker awards), but rather in providing highly inconsistent information for coaches who are attempting to bolster the debating skills of younger/improving debaters.  I would favor a switch back to 30, with .1 increments.

Stefan has identified the goals of the 100 point scale well...I'll first identify the upsides he identified and talk about the down-sides.

"I think it is fair to say that the move to a 100 point scale was established with two goals in mind:

1)   Increasing variation in the awarding/distribution of speaker points
2)   Reducing point inflation"

These goals have been accomplished.  I feel the benefits of goal 2 are over-stated, however.  I worry more about point conflation than point inflation.  I don't think judges were being very careful with the old 30 point scale, and too many ballots would come in with the points the same for every debater in the round.

Stefan continued:  "That said, there seems to be two downsides associated with the scale:

1)   Inconsistent variation
2)   Some judges assigning points that are so low and off the scale that people now (want to) strike them solely because their points are inconsistent with the scale."

This first argument should be carefully evaluated, and I feel is such a significant downside that it may require moving away from the scale altogether, in favor of the 30 point scale with .1 increments.

My primary argument is that points are frequently a benchmark that debaters and coaches use to measure improvement.  They also point to "problem signs" in a given round where a debater might need to give a rebuttal re-do, or get some extra help.  They can also be used to determine whether a debater is in the right division, or better pair debaters on a squad.

I am far less concerned about who lands in the top 10 speakers (my read is that the speakers are relatively consistent, but may get jumbled around in the top 10/20 a bit).  My argument is that for debaters not in the top of that pack, the information gleaned from coaches is being scrambled to such a degree as to almost be completely useless.  I'll cite the narrative example, and then some specifics.

One of my favorite debaters on my current squad is Jayme Cloninger.  Jayme debated home school in high school, which is pretty different than college debate.  Her goals aren't to win speaker awards, but to get better at debate to where maybe she can someday clear consistently at Varsity regional tournaments and potentially qualify for the NDT.  Jayme has used points as a benchmark in the past.

We've looked at her points from last year, and she decided early on that improving from about a 27.5 average to a 27.75 average would be a good goal.  Knowing the 30 point scale the way I do, at an 8 round tourney, I can look at "total points."  I know a 27.5 average is 220, a 28 is 224, a 28.5 is 228.  It's pretty easy to look at those points for debaters setting benchmarks.

Further, the old 30 point scale is pretty standardized.  I've seen many tournaments where a debater got the same points in every single round.  When I was at Augie, a debater on the rise for my squad got 27's in every round at Kentucky, save round 7, where she got a 27.5.  We had prepped carefully for that debate against the Catholic strategy we heard in round 5, and she executed better in that debate (I watched both debates).  It was fair to say that "whatever a 27 is" Corrina was a 27 at that time.

We could use that information to measure and evaluate improvement.  We could set benchmarks as to how to get better. 

Getting back to Jayme, she got the same points in every round at districts last year save 1.  Here were her points from Liberty, using the 100 point scale. 

Round 1:  72
Round 2:  87
Round 3:  75
Round 4:  86
Round 5:  91
Round 6:  82

This information is basically useless to me.  I get the sense that rounds 1 & 3 weren't the best, but the swing is 19 points.  She goes from a C- to an A- at the tournament.  Maybe she is wildly inconsistent (last year leads me to believe she isn't).  Or maybe the new scale is just broken--it isn't providing us the information coaches need to help debaters improve. 

None of this should be viewed as an indictment of any of the judges Jayme had--she had some EXCELLENT critics this weekend.  It is merely that the standard I can use to evaluate success in a given round is basically out the window as we all adapt to that scale.  I'm frustrated that the ability to measure goals we set together for her improvement for half her junior year are basically gone while we experiment with the speaker point scale. 

After GSU, Jayme was eager to check out her points (she was excited to be 3-4 at the tourney, as that was her goal for the tournament).  Instead of being able to look and see "220" "222" "223.5"--numbers that would have instant meaning to me, I saw 596.  This number is useless to me.  She asked me what that meant, and I just said words to the effect of, I have no idea, it's on the 100 point scale.

I heard a judge this weekend say they were using "75" as an average.  A couple of posts on edebate convinced several young coaches to deviate from the "87" as average from GSU, even though folks like me were fine with that, because any norm is better than not having one at all.  Some are using "85" as an average.  Some are making up their own scale. 

The truth is that no matter how hard we try, many judges just aren't going to "get the memo" as to how to use the 100 point scale.  John Wilkerson asked me about 100 points this weekend, because he'd never judged on it.  A couple of edebate posts convinced some judges, but not others.  We are in a world of pretty unstable points right now, and that is hurting the ability of coaches to help their debaters.

Many have argued that "we'll eventually figure the new scale out."  My argument is we already figured the old scale out.  Veteran coaches "know a 28 when we see it" if we can't explain exactly why.  This information helps us help our younger debaters.  The status quo instability could take years to settle down what we mean.  And then, once we've solidified the scale, aren't we back to where we started?  Why not make a mild tinkering with the old system to help us out?

I'm also intrigued that Harvard's "solution" to the problem was to essentially use the old scale with .1 increments, and then "translate" that to the 100 point scale.  So people are using the old scale to create the new scale.  It seems like we are taking an unnecessary step of translation.  Why not just directly allow .1 increments?  Why go from English to Spanish to German and then back to English?  The .1 increments seem to solve the wild instability problem while allowing for meaningful deviations on the scale.

I would also encourage tournaments using the 100 point scale to use a different metric for clearing on points in the interim.  I'm unconvinced that speaker points are a very meaningful measure of very much right now.  I have opposed "opp wins" in the past, but it may be a far better metric than speaker points until the scale calms down.

I frankly think a lot of damage is being done to a method of evaluation for younger debaters in the name of minor improvements in our speaker point scale.  I would greatly welcome a tournament today using the 30 point scale--either with the old .5's or the .1's.

Perhaps others do not use points the way I do.  But I seriously used to evaluate "220" as a critical benchmark of improvement for a debater--it meant you averaged 27.5--which most considered to be average.  If you were getting to "224" then you and your partner would probably clear or barely miss on points.  "228" meant speaker award range.  These numbers had meaning, but the almost random number I'll see at the end of a tourney now means almost nothing.  It's not really even worth looking at it.

Experimentation is good.  The 100 point scale effort is a noble one.  However, after seeing the "trial run" at Wake last year and seeing several tournaments on it this year, I think 30 with .1's is a better option.  For better or worse, judges, coaches, and most debaters know what 30 points means, and it helps measure improvement.  Adding in the .1's allows us to measure improvement while providing for some variation on the scale.

Thanks for reading, and much love to everyone trying to make debate a better place.


Logged
hansonjb
Full Member
***
Posts: 223



« Reply #1 on: November 09, 2009, 10:58:12 PM »

i absolutely agree with ryan.

i don't get the 100 point scale. sure, we want more DIFFERENTIATION but we don't want RANDOM VARIATION based on what a critic thinks the 100 point scale means.

the good old 30 point scale with .1 increments would give differentiation while avoiding rampant random variation.

further--the best implementation of the 100 point scale that i have seen, harvard's, which included explicit instructions for how to give the points: 1) was ignored in many cases and led to wide variations. this means that debaters get speaker awards/advanced to elims BASED ON LUCK OF DRAW OF WHO THEY GOT AS A JUDGE; 2) the scale INHERENTLY WILL LEAD TO EVERYONE SCORING IN THE 90s as time moves on. how many of you profs give students of the caliber of most policy debaters a "c" or even a "b"? soon, 90 will be considered low.

go back to the 30 point scale just add in .1 increments. seriously.

Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
ScottyP
Jr. Member
**
Posts: 52


« Reply #2 on: November 09, 2009, 11:07:13 PM »

http://www.ndtceda.com/pipermail/edebate/2005-October/063965.html
Logged
hansonjb
Full Member
***
Posts: 223



« Reply #3 on: November 09, 2009, 11:45:58 PM »

gary's point is well taken in the context of .1 increments versus .5.

his point doesn't apply to .1 increments versus the 100 point scale where variation is randomly based on the judge you get.

yes, there will be some level of randomness with .1, but not anything like the 100 scale and again, as ryan correctly notes, judges get the 30 point scale. we've got several years before the 100 point scale is handled consistently and my argument is that when that happens, it will be scores in the 90s; 80s will be perceived as b's and hence unacceptable/a slap in the face.

in concrete terms, if i give a student a 28.5--that's a good, solid debater. a 28.4 just slightly less so; a 28.6 just slightly more so.

if i give a student an 84, 85, or 86--i'm very frankly going to feel like i'm telling that debater they aren't so good--even with a harvard scale staring me in the face. i get that the two are not any different but that's how i will react.
Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
V I Keenan
Jr. Member
**
Posts: 78


« Reply #4 on: November 10, 2009, 10:56:54 AM »

Jim's "feelings" about giving an 80-something are a both a good indicator as to why the "grade metaphor" is bad for accomplishing the differentiation we're looking for and why a transition to 100 points is going to come with baggage.

Many debate coaches as educators assign grades, and giving something on a 100 point scale does feel a lot like that.  Every debate judge was at one point a student and assigned grades and has their own personal history to bring to assigning a "score" to others based on their own experiences being scored.  Basically, because of our experiences it's really hard to distance using the 100 point scale from the experience of grading unless we make an active effort to do so and establish a new paradigm for ourselves as critics and as coaches explaining this evaluation tool.

I admit I've found it interesting how difficult it has been for people to be a "hard grader" - to know that a certain score is what a student probably deserves in a certain scale, but to feel bad about the number because of its connotation in other academic contexts.  I also find it amusing that a number of individuals seem to feel like "B" is a "bad" grade ... but that's a whole other commentary on grade inflation and the educational system.

Yes, debaters tend to be smart.  That doesn't mean they necessarily have always done the assignments asked or met the criteria of a rubric, which means even if they are smart it doesn't necessarily mean they deserved an "A".  It also doesn't mean that they don't appreciate seeing "improvement" in their performance, which is another reason that speaker points aren't grades but a metric ... think of it like studying for a standardized exam.  You got 600 in verbal on Practice SAT #1, and 630 on practice SAT #2, and 670 on the actual exam.  The practice/work/effort pays off.  THAT is how speaker points should be maybe?  Enough of our students took standardized tests and prep to get into college, or will do it for the LSAT, that this shouldn't be too difficult a paradigm to explain. Could we work with that metaphor?

Or batting averages?

Or something?

But basically, until we change the "feelings" we have about numbers and their abstract meaning of worth, there will never be any real change in points in the community and inflation and distribution will continue to be problematic.

Speaker points are "symbolic"  - not actual.  One of the reasons I did like the 30 scale was that it was a lot easier to covey the idea of a "symbolic" meaning to both students and new judges because it did not tie into a previously established metric paradigm.  It is possible that the debate community will eventually create some kind of consensus on 100 point symbolism, but it actually will take time, and a lot more discussion (maybe changes in judge philosophies to mimic grading rubrics?), and the real question is if we have the patience to go through those growing pains because we believe the new system will ultimately be better, or will it have no comparative advantage to the sqo, meaning the frustration was pointless?

Logged
ScottyP
Jr. Member
**
Posts: 52


« Reply #5 on: November 10, 2009, 11:03:05 AM »

Jim,

Correct me if I'm wrong- the rubric for the Harvard scale was 30 points with decimals converted to the 100 point scale?
Logged
hansonjb
Full Member
***
Posts: 223



« Reply #6 on: November 10, 2009, 01:54:57 PM »

yes, harvard was 27 = 70, 28 = 80, 28.1 = 81, 28.2 =82, etc.
Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
A Numbers Game
Newbie
*
Posts: 7


WWW
« Reply #7 on: November 10, 2009, 04:16:08 PM »

Disclaimer: The following numbers are meant to help decide which scale to use. I don't have an opinion on which scale is best.

This is a follow-up on Stefan's rubric for evaluating the 100 point scale.  My basic thesis is that the 100 point scale is causing wide variations in points, and it is having the greatest impact not at the top of the scale (those winning speaker awards), but rather in providing highly inconsistent information for coaches who are attempting to bolster the debating skills of younger/improving debaters.

One specific way the 100-point scale might cause inconsistency is between judges who treat the new scale differently. This can be measured by calculating each judge's deviation from the points other judges give to the same debaters. Using the Mann-Whitney U test, we can normalize these scores and then get an overall measure of how much judges at a tournament disagree with each other about points.

Using this scale of disagreement, 1.0 would mean total agreement. The average disagreement (since 2003-2004, credit debateresults.com) for divisions using the 30-point scale is 1.67, with a standard deviation of 0.25. Using the 100-point scale, the average is 1.99 and the standard deviation is 0.26.

Some divisions using the 100-point scale have high disagreement: The UNLV open division this year had a disagreement of 2.61. On the other hand, the Wake tournaments from last year and the year before had disagreements of 1.72 and 1.61, respectively. Two more framing examples: Harvard had a 1.73 this year, and Liberty's open division had a 2.11 this year.

Why not just directly allow .1 increments?

One problem with going to a 0.1-increment scale is that some people may not "get the memo", as you mentioned. When USC used this system in 2008-2009, more than one out of every ten rounds was judged by a critic who did not, in that or any other of his or her rounds, use the new scale. Perhaps this would be alleviated by requiring what St. Marks required this year in their move to the 100-point scale: no two debaters in any round may be assigned the same points.

I would also encourage tournaments using the 100 point scale to use a different metric for clearing on points in the interim.  I'm unconvinced that speaker points are a very meaningful measure of very much right now.  I have opposed "opp wins" in the past, but it may be a far better metric than speaker points until the scale calms down.

I expected the switch to the 100-point scale to substantially improve the ability of speaker points to predict winners between teams with the same record. I was wrong.

It's hard to compare how well points predict winners because it's hard to find apples-to-apples comparisons. I compared some tournaments to their previous incarnations and found that speaker points as a tiebreaker in the 100-point scale are doing about as well as they were under the 30-point scale. This might just be because of the small sample size, however.

Perhaps others do not use points the way I do.  But I seriously used to evaluate "220" as a critical benchmark of improvement for a debater--it meant you averaged 27.5--which most considered to be average.  If you were getting to "224" then you and your partner would probably clear or barely miss on points.  "228" meant speaker award range.  These numbers had meaning, but the almost random number I'll see at the end of a tourney now means almost nothing.  It's not really even worth looking at it.

Perhaps tournament directors who choose to use the 100-point scale could put some very simple statistics in the packets: average points per round, average total points, etc. This, of course, won't fix the problem of judges who use different scales.

Logged
gabemurillo
Full Member
***
Posts: 165


« Reply #8 on: November 10, 2009, 04:32:15 PM »

Nothing profound to add - just wanted to echo support for the 30 point scale with .1 increments. The 100 point scale has brought so much stress to my judging. It seems like there is real tension between the original goal of the scale (real difference between speakers) and the outcome of the scale (obscenely high points). I am generally speaking a judge who tries to follow community norms when it comes to speaker points, which is why I circled yes on every ballot at Harvard and followed their recommended scale (ditto for the Missouri State tournament). At the very least I hope every tournament that uses a 100 point scale provides similar guidance. I just feel that it would be so much easier to adapt to a 30 point scale with .1 increments. I suppose my post here is less to engage in a debate about the mathematical warrants of the different systems, and more to say, as a person who judges a lot of rounds and teams on both ends of the argument and skill spectrum, the 100 point scale is not working. The one tournament I've been to where a 30 point scale with .1 increments was used was the easiest time I've had adjusting to a different point scale, and felt that I really had freedom to use the whole range of options offered by the scale.
 
Logged
hansonjb
Full Member
***
Posts: 223



« Reply #9 on: November 13, 2009, 04:49:26 PM »

http://cedadebate.org/forum/harvard/more-on-the-100-(99)-point-scale/
Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
antonucci23
Full Member
***
Posts: 138


« Reply #10 on: November 13, 2009, 05:58:00 PM »

A (mild) defense of the 100 point scale.  My feelings on this are pretty tepid, I admit - I implemented a .1 increment 30 point scale in a tournament in 04 or 05 and it appeared to work fine.

The primary defense is "forced destratification."  You have to change.  A 30 point decimal scale allows many judges to just go back to their pointing patterns of the last five years.  If some judges adopt nuance and others stay stuck in "block 28" mode, all the "nuanced" points simply register as statistical noise.  It's randomness.
7
Imagine a tied speaker point race for first going into round eight (or the third and final prelim, I guess, if present trends continue.)  One debater is judged by a "nuance" judge.  The other draws a judge who is stuck on .5 increments, and either rounds their points down or rounds them up.  Both speakers perform identically, but quirks make the point set either (28.8/ 29) or (28.7/28.5)

Totally arbitrary, random determination of the first speaker. It would be better, in fact, if the two judges gave the SAME points and let JVAR sort it out, because JVAR is slightly more meaningful.  The "nuance" judge FEELS as if they are assigning very meaningful points, because they've been trained to falsely foreground the "signaling" function of speaker points - they disregard the larger social context because they think it's all dumb math.  You feel free, you feel awesome and all precisely calibrated, but you actually screwed everything up, unless a certain critical mass of judges joins you.

The RKS scale forces individuals to re-evaluate their habits.  Absent some external stimulus, judges are often inherently resistant to change.  It probably helped shame outliers who somehow presumed that their idiosyncratic point scale "meant" something aside from total randomness.  It forced tournaments to publish rubrics (a practice that probably should have started years ago.)  The idea that we all "know" what constitutes a 28 forestalled a necessary conversation on this subject for years.  Slipping back into .1 might make us all feel more grounded, but I'm unconvinced that grounding's meaningful absent a critical mass that recognizes the defects of a scale with five increments.

BTW: The solution:

Use a 10 point scale with decimal increments.  For some reason I can't define, a "7" seems pretty good to me, whereas a "70" sounds terrible.

Call this the "hotornot" scale.
« Last Edit: November 13, 2009, 06:02:57 PM by antonucci23 » Logged
hansonjb
Full Member
***
Posts: 223



« Reply #11 on: November 16, 2009, 04:14:27 AM »

at the wnpt, a 23 team regional tournament that i just ran, i decided to go with our traditional 24-30 scale but with .1 increments.
i'd say 75% of the judges used the .1 increments. 25% used it infrequently or not at all.

here's the distribution--pretty standard, consistent but with differentiation (there are not many tie-breakers needed after drop h/l):

Drop H/L   Total Points   Z-Score
114.4   171.3   169.74
114.12   171.12   171.62
113.6   170.3   168.64
113.3   170.3   170.72
113.2   169.8   170.4
112.8   169.8   170.64
112.58   169.08   169.51
112.2   168.6   169.11
112   167.9   168.22
111.96   167.76   168.05
111.4   166.6   167.08
111.2   166.9   167.47
111.2   166.8   166.41
111.1   166.5   167.07
110.92   166.92   166.6
110.7   165.3   165.57
110.5   165.5   166.17
110.5   165.5   165.92
110.4   165.5   165.97
110.34   165.84   165.5
110.3   165.4   165.55
110.3   165.4   165.1
110.3   164.8   165.87
109.9   165.1   164.47
109.8   164.8   165.52
109.8   164.6   165.42
109.7   164.2   163.89
109.6   164.8   165.11
109.5   164.3   164.16
109.5   163.9   164.73
109.4   164   164.82
109.4   164   164.75
109.4   163.9   163.95
109.3   164.1   163.33
109.3   164   162.51
109.16   162.96   163.78
109.1   163.6   163.4
108.9   162.7   163.51
108.78   163.08   163.8
108.7   162.7   163.31
108.56   162.96   162.94
108.32   162.72   162.5
108.06   162.36   162.01
107.2   160.2   160.28
106.68   160.08   160.91
106.6   159.6   159.44

Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
hansonjb
Full Member
***
Posts: 223



« Reply #12 on: November 16, 2009, 02:16:19 PM »

for comparison, here are the 2008 wnpt speaks; a lot more ties.
24-30 scale. .5 allowed. why .9 and .2 below? because of byes.

Drop H/L   Total Points   Z-Score
113.5   170   170.75
113.5   169.5   168.37
112.5   168.5   168.04
112   168   167.76
111.9   167.4   167.52
111.5   167   167.6
111   167   165.66
111   166.5   166.6
111   166.5   165.64
110.6   165.6   164.56
110.5   165.5   164
110.1   165.6   167.41
110   164.5   164.12
110   164   163.74
109.9   164.4   162.99
109.5   164.5   164.97
109.5   164.5   161.97
109.4   164.4   165.08
109.4   164.4   162.58
109   163.5   165.45
109   163.5   162.81
109   163.5   162.68
109   163.5   162.31
109   162.5   161.3
108.8   163.8   165.27
108.7   163.2   161.94
108.5   163.5   163.79
108.5   163.5   162.61
108.5   162.5   162.77
108   162.5   160.97
108   162.5   160.47
108   162   162.75
108   162   160.4
107.3   160.8   163.64
107.3   160.8   162.03
106.5   159.5   161.4
106   158   160.94
105.7   157.2   157.71
105.7   157.2   157.71
105.2   157.2   158.27
104.6   156.6   157.82
96   144   147.68 (this was not a real debater; it was a maverick's "partner")
« Last Edit: November 16, 2009, 02:18:21 PM by hansonjb » Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
hansonjb
Full Member
***
Posts: 223



« Reply #13 on: November 16, 2009, 02:21:06 PM »

one final note: do the higher numbers for 2009 indicate the .1 system = higher scores? i doubt these numbers justify that conclusion. the 2009 tournament had, i believe, stronger top teams and the 2008 tournament had more "novice" teams. i believe that explains the higher numbers for the 2009 tournament, not the scale.
Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.0.19 | SMF © 2013, Simple Machines
SMF customization services by 2by2host.com
Valid XHTML 1.0! Valid CSS!