College Policy Debate Forums
December 15, 2018, 03:44:59 AM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: IF YOU EXPERIENCE PROBLEMS WITH THE SITE, INCLUDING LOGGING IN, PLEASE LET ME KNOW IMMEDIATELY.  EMAIL ME DIRECTLY OR USE THE CONTACT US LINK AT THE TOP.
 
   Home   Help Search Login Register  
Pages: 1 [2]
  Print  
Author Topic: Possible Wake Forest experiment - please read and comment  (Read 9033 times)
hansonjb
Full Member
***
Posts: 223



« Reply #15 on: November 06, 2011, 11:26:57 PM »

hi gary

i'm just wondering what metrics you will use to determine that your new system equalized opp difficulty better than current powermatching?

in my tests--i took the same tournament and ran it three ways
--1) the normal pairing powermatching--some high-low, some high-high
--2) sorting teams based on wins and then opp records (a very crude version of what gary is suggesting) and then pairing them within brackets
--3) sorting teams based on wins and opp records but powermatching with brackets of teams that had the same number of wins or 1 more or 1 less wins.

the normal way was the actual tournament results.

the 2 experimental ways were done with me projecting who would win based on overall results at the actual tournament, who had beaten who in rounds at the tournament, and included a few minor and modest 'upsets' to simulate the typical 'unexpected' results at tournaments.

it was a 6 round tournament fyi and had about 30 teams.

as i noted--the approach within the same win bracket didn't do much whereas the powermatching within a larger bracket (within 1 win or loss) produced a modest to very noticeable improvement in equity of opponent difficulty (i tried the experiment three times on each approach). this is pretty logical since having only teams within a bracket to hit greatly narrows the program's choice of teams to increase equalization (and probably more so at 'smaller' tournaments like the one i was testing).

my testing certainly raises a variety of questions about its own validity as a measurement but at the time, i thought it made its case. i gave up on the effort because it was very time consuming and i was hoping to develop a pairing algorhythm that could actually fully implement it but that never happened.

i would think that comparing say g state, northwestern, harvard, etc. to wake won't work too well as tournament disparity of opp difficulty can vary a fair amount (which might even be a problematic in my own testing--the 3 tests using the same method had somewhat differing opponent difficulty equalization each different test).

so, i'm truly wondering how you can measure it and say it is better . . .

JIM: 2. have you tested your second proposal? i did some tests about 3 years ago of something quasi-similar--it was more just making the brackets looser so that teams within one loss/one win difference could be matched against each other and then paired teams based on opp records--so definitely not the same but in the ball park of what you are suggesting. i tested it out and found a 10 to 25% improvement in the final opp difficulty for the teams (as in they were more equal in opposition difficulty for teams with the same record). your approach is more nuanced as it uses seeding and so may very well and probably will do better.

oh wait, maybe your second proposal keeps the same brackets as exist now? (eg 3-1's hit 3-1's, 2-2's hit 2-2's, etc.). when i tried that, i got almost no improvement in equality of opposition difficulty--as i remember it was about 5%.


GARY: First, let me make a quick observation about testing.  While I can do a lot of evaluation of the outcome of a completed tournament (e.g. Kentucky) and can look at what happens with alternative pairings of a single round, it is never possible to fully test an alternative in any lab other than a tournament.  Each pairing creates a set of winners and losers that creates different subsequent pairings.  There is no “controlled experiment.”  But that doesn’t mean that we’re throwing dice.  The conceptual basis for the experiment can be well understood (the goal of this dialogue).  And there will be a number of metrics that will permit the evaluation of the outcomes.  And for what it’s worth, there is very little downside risk in the experiment.  The range of anomalous things that already happen at a tournament (massive pull-ups in round 8 due to side skews, etc.) means that nobody could say, the experiment prevented the outcome that SHOULD have happened (namely me clearing).

Regarding the last paragraph.  Yes, the proposal will still continue to pair within WL brackets.  I’ve actually tested an even more radical alternative that loosens the requirement that teams have the same record (we already have a lot of pull-ups).  But those experiments created worse rather than better results.  And our goal is not that everyone in the tournament have more equal strength of opposition but rather that teams with the same record or opportunity to clear should have similar strength of opposition.  In fact, at the end of the day, an absolute ideal scenario would be that there is a very high correlation between one’s own seed and the average seed of my opponents.  The top teams progressively face the toughest opposition.
Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
glarson
Sr. Member
****
Posts: 477


« Reply #16 on: November 07, 2011, 09:44:53 AM »

Reply to Jim,

I believe the following metrics represent appropriate tests:

1)  Range and distribution of strength of opposition within each W/L bracket.  Distributions within the 4,5, and 6 win brackets are most important.
2)  Differences between the lowest strength of opposition within teams that clear vs. highest strength of opposition within teams that don't clear.
3)  Spearman rank order correlation between final seed and final strength of opposition.  While the correlation won't and can't be "perfect" it should generally be the case that higher seeded teams also have a higher strength of opposition (even without including strength of opposition within the calculus to determine seeding - at present it is a relatively low and infrequently used tiebreaker).
4)  Comparison on a round by round basis of the difference between seeding and hybrid metric that includes strength of opposition.  If the process is indeed "self-correcting" the difference should decrease over the course of the tournament with an "ideal" (but unrealizable) outcome being that strength of opposition comprising a relatively small component in the determination of round 8 pairings.  This would be true if the system had already largely equalized strength of opposition prior to the round 8 pairing.  At Kentucky, the results were just the opposite.  In each subsequent round, the difference between seeding and the hybrid statistic grew, indicating that ordinary power pairing procedures did very little to equalize within bracket differences between strength of opposition.

One other quick comment about method.  While I run lots of simulations, one of the inherent weaknesses is that I have to artificially assign wins and losses based on some metric like "favored team based on final tournament results."  But the rather good news about our activity is that what could be called "upsets" by any artificial procedure rather routinely happen.  At the end of the day that best indicators of who should win any given debate round is "which team debated better in THAT DEBATE."
Logged
hansonjb
Full Member
***
Posts: 223



« Reply #17 on: November 07, 2011, 12:59:58 PM »

thanks gary.
Logged

jim hanson Smiley
seattle u debate forensics speech rhetoric
coach_hanes
Newbie
*
Posts: 8


« Reply #18 on: September 17, 2016, 01:46:01 PM »

Although this is an old thread, I just read these posts. The ideas are fascinating, and I would like to share some thoughts. (My background: I've never coded a tab program, but I do have a math degree from Columbia. Which is to say, I'm primarily treating this like any other math problem.) I did debate for 8 years and have coached for many years, too, and would love to see tournaments run the best way possible. Thank you to Jim, Gary, Jon, and others for all the work they've done over the years to improve debate tab.

Rather than simply recalibrate the pairing method (HH giving way to HL, opp seed vs opp wins, etc.), maybe it's time to radically rethink the process. Computers can run giant optimization matrices in the blink of an eye. Multiple variables can be optimized simultaneously. No matter how much one tweaks HL pairings, it can't optimize multiple variables. If your basic technology is the HL pairing, there will always be "sloppy" pairings in your results where teams with weak opponents get another weak opponent, and also the opposite. For example, say team A is weak and has had strong opponents, and team B is weak and has had weak opponents. If you do HL pairings based on schedule strength, A and B might met, setting up B with another weak opponent. On the other hand, say team C is strong and has had weak opponents, and team D is weak and has had strong opponents. If you do HL pairings based on strength, C and D might met, setting them both up with opponents they don't deserve. There's just no way around this--except for optimization matrices. I even made a nifty visual of these problems and the solution: http://art-of-logic.blogspot.com/2009/08/another-way-to-visualize-strength-of.html

Here's how an optimization matrix works:

1. Generate for each team two variables: strength, x, and schedule strength, y. These can be based on whatever you like: wins, points, opponent wins, opponent seed, whatever. These would be updated before each round is paired.

2. Generate matrix of all possible matches. If there are 60 teams, and it's an even round (side constrained), it will be a 30 x 30 matrix.

3. Code any blocked matches (same school, already met) with a 0.

4. Code the remaining matches with a score. Scores are scaled to make pull-ups more difficult but not impossible. For two teams A and B, the scores use a formula based on x_a, x_b, y_a, and y_b: the strength AND schedule strength of each team.

5. Pick optimal pairing (highest total score) out of the matrix. This is a known, solved problem in math--the Hungarian algorithm works nicely.


The real trick is the formula for assigning the scores. In a rough sense, the formula is similar to abs((x_a - y_b)*(x_b - y_a)). A score is high only if each team strength is very different than its potential opponent's schedule strength. A good match up: team A is strong and has high schedule strength, team B is weak and has weak opponent strength. Each team rounds out the other's schedule. A mediocre match up: team A as before, and team C is weak but has high schedule strength. Team A gets the right opponent, but team C would be screwed over. A bad match up: team A as before, and team D is strong and had high schedule strength. Both A and D would get the wrong opponents to balance out their schedules. The scores would reflect that A vs. B is preferable to A vs. C, and both are preferable to A vs. D. Do this over all the matches and the matrix optimization makes all the tradeoffs to produce the best overall pairing.

The formula I have come up achieves this scoring outcome but is more elegant than what I listed above: (standard deviation of strength of team A's opponents, including B)*(standard deviation of strength of team B's opponents, including A). Standard deviation is a better measure than the average opponent strength. A team that has met three average strength opponents should not meet a fourth, but a team that has met one great opponent, one bad opponent, and one average opponent could meet a second average opponent. The standard deviation scores these two scenarios differently, while also ensuring the cross-mixing I described in the previous paragraph.

It works. It's not hard. My friend cut-and-paste the Hungarian algorithm for me because I'm not a coder, and the scoring formula is simplicity itself. Computers are fast enough to do it even for very, very large tournaments. And furthermore, the formula can also factor in more points for geographic diversity (a boost for two teams from different regions) or anything else that you like. I've tested it out in some experiments and it works. At the end of the tournament, teams that break have more similar opposition strength.
Logged
Pages: 1 [2]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.21 | SMF © 2015, Simple Machines Valid XHTML 1.0! Valid CSS!