Instigator / Pro
4
1470
rating
50
debates
40.0%
won
Topic
#2070

The ELO ranking system is an inadequate method of surveying one's skill level on DArt

Status
Finished

The debate is finished. The distribution of the voting points and the winner are presented below.

Winner & statistics
Better arguments
0
3
Better sources
2
2
Better legibility
1
1
Better conduct
1
1

After 1 vote and with 3 points ahead, the winner is...

Jeff_Goldblum
Parameters
Publication date
Last updated date
Type
Standard
Number of rounds
3
Time for argument
Two days
Max argument characters
10,000
Voting period
One week
Point system
Multiple criterions
Voting system
Open
Contender / Con
7
1634
rating
13
debates
80.77%
won
Description

No information

Round 1
Pro
#1
I would thank Jeff Goldblum to accept this debate and define the terms for me in the comments while I am away.

Here is his definition/description, and we'd agree to use his definition because I agreed, and he had to agree because he voluntarily posted this.
ELO is a score attached to all our profiles. It starts out at 1500 and adjusts in responses to our wins, losses, and ties. Under this system, it's a bigger deal to beat someone with a higher score than it is to beat someone with a lower score.
It's derived from Chess rankings and was developed by someone named Elo. There's a whole Wikipedia page devoted to it.
On DART, our rankings on the Leaderboard are determined by ELO score.
I would start my points now. My format here may seem a little janky but I hope my opponent can understand. 

1. Elo score does not necessarily survey debating skill.
  • Elo goes up when you win a debate.
  • Elo goes down when you lose a debate.
  • Elo is affected by winning/losing.
  • Elo stays the same when the debate is on Unranked mode.
    • This example is one of my unranked debate, and if I won the debate in ranked mode, in theory, I should be somewhere in 1450 instead of as low as I am now(1427)[1][2]
    • This would show that if a person always debates on Unranked mode, no matter how good/bad he is at debating, he always stays at 1500 no matter what(until he accepts a ranked debate and wins/losses). 
  • People are, presumably, in different skill levels. There are legends like Ramshutu, Oromagi, and RM, then there are trolls on the bottom of the barrel like Ramdatt.
    • So, if I am a very good debater, but I am inferior to those legends, yet I only pick ranked debates with those legends and lose most of the time, I will have a 1400 Elo even though my skill is at least top 30. 
  • People may have different Elo on different accounts, and different people with different skill levels may have the same Elo. 
    • Relatively on DArt, PinkFreud and LittleCookie is the same person, so they have a similar skill. However, PinkFreud is of 1614 Elo and LittleCookie 1500. [3][4]
    • BlueCrystal and 3RU7AL both have a 1500 ranking, but BlueCrystal only offered minimalistic, single-line responses while 3RU7AL exhibited coherent logic, and the only vote of his consists of him winning the argument points.[5][6]
      • Because these two people have different skill levels,
      • And they have the same Elo,
      • This would mean the Elo system is not an adequate method for surveying people's skill level on DArt, at least for this example. 
    • Envisage, an intelligent debater from DDO[7], came here and forfeited them all.[8]
      • If a person chose to not try in debates, it is not the sign that this person is bad.
    • One debate called "The Earth is Round" got a number of votes, but Con failed to respond at all, and Pro won without exhibiting skill in debating.[9]
      • If a person won a debate without exhibiting much skill, it is not the sign that this person is good.
  • Elo depends on how the people voted. If an intelligent and obvious debate got zero votes, although the "Virtually winning" side is more skilled than the "Virtually losing side", no one gets any judgment(And if both are users just fled from DDO, they will be regarded the same under Elo even though one is better than the other).
    • The person with the most Elo has a tendency to debate more because more debates won = more points added. The skilled debaters who don't debate too much will not get the same number of points as the debaters on the same levels but debate more. Elo is effective at accounting how many debates are won here, and people may conclude more debates won = better, however these examples suggest, no. 
  • Debating is not chess. It can be very different. One who is good at Rap Battling may not be good at regular logic debating. 
  • I do not need to give a better example as I only need to prove why Elo is an inadequate system in this sense. 
I rest my case for R1. I welcome my opponent to respond.

==SOURCES==
Con
#2
(this text was drafted before I read User's opening round. I intend to begin rebuttals in my R2)

--

Thanks to my opponent for setting up this debate. As I indicated in the comments, I am excited to participate.

My argument proceeds in two broad parts: first, I frame the debate, not only by offering definitions, but also by arguing for basic conceptions of debating skill and the role of ELO as a measurement. With the groundwork laid, I then proceed in the second section to demonstrate how DART’s ELO should be understood as an adequate measurement of skill.

SECTION 1: ADEQUACY, SKILL, AND OPERATIONALIZATION
Adequacy
The word “adequate” is used to describe something that is “satisfactory or acceptable in quality or quantity.” (source) I draw voters’ attention to the fact that satisfactory/acceptable is not the same thing as ideal, perfect, or “the best.” This is important to recognize at the outset. My opponent may put forth examples of ELO failing to be perfect, but this is not enough. My opponent must show that ELO is worse than imperfect - he must show it to be outright unacceptable. Obviously, my role is the opposite: I need only to defend ELO’s acceptability.

Skill
In this context, when we say skill, we refer to the skill of a debater on this site. So, what makes for a skilled debater? Of course, we all know a skilled debater when we see one. Similarly, we know a poor debater when we see one. Simply put, we all share an intuitive understanding of what makes a debater skilled or unskilled. The challenge my opponent and I face - and the challenge ELO faces - is meaningfully articulating this intangible sense.

Before I turn to a direct defense of ELO, it will be beneficial to list what I consider to be the attributes of a skilled debater. I expect these will be non-controversial, as I think they will comport with everyone’s intuitions. My list:

-Quality arguments = a debater who puts forward logical, clear, and relevant arguments is demonstrating skill as a debater

-Quality writing = a debater who can articulate their arguments with concise, understandable writing is demonstrating skill as a debater

-Consistency = a debater who has a long track record of demonstrating the above attributes is further proving their skill as a debater (why? The longer someone’s impressive track record, the more confident we can be that their skill is durable and not just a “flash in the pan”).

Operationalization of “Skill”
In the realm of social science, operationalization is defined thus:

Operationalization is the process by which a researcher defines how a concept is measured, observed, or manipulated within a particular study. This process translates the theoretical, conceptual variable of interest into a set of specific operations or procedures that define the variable’s meaning in a specific study.
The referenced source goes on to use the study of aggression as an example. We know what aggression looks like, but it’s qualitative - so how do we measure it objectively? This requires researchers to create a series of metrics that produce a measurement for the concept.

In simpler terms, operationalization is how we create measurements for a qualitative phenomenon. In the case of this debate, ELO is an operationalization of debate skill, which is why I am discussing the term.

Again, I remind voters of the adequacy standard. In order to really get at the question of ELO’s adequacy, we need to ask ourselves: what makes for an adequate operationalization of a phenomenon?

I contend that an adequate operationalization is one that produces results that generally comport with our intuitive understanding of the phenomenon in question.

I say “results that generally” match our intuition, because operationalization will never be perfect. By its nature, the process of operationalization loses some of the phenomenon’s qualitative richness in its pursuit of objective measurement. Even so, operationalization is valuable because it allows us to assess broad trends and support claims with objective data. If we refused to quantify anything, we’d be left in an ambiguous state where each person’s personal opinion carries no more weight than another’s. Therefore, operationalization, though imperfect, is a valuable tool for understanding reality in an objective, consistent manner.

I don’t want to belabor this discussion of operationalization as a concept much longer, but I do want to offer an example of operationalization. It’s my hope this example will make clear to voters why I define an adequate operationalization as one that generally comports with our understanding of the phenomenon in question.

Let’s say that a good soldier is someone who’s effective in combat, follows orders, and makes a positive contribution to squad morale. Now, these attributes are, to varying degrees, qualitative and intangible phenomena. We know it when we see it, but measuring it quantitatively is another matter entirely.

Let’s say the Army creates a series of tests for the purpose of measuring these qualities. Perhaps combat efficacy is measured by performance at the shooting range, “follows orders” is measured by a psychological evaluation score, and “makes a positive contribution to morale” is measured by anonymous survey responses from the soldier’s comrades.

Finally, let’s say Army commanders, after seeing these tests in action for a while, decide these measurements do a pretty good job of predicting who will be a good soldier and who won’t. Of course, these tests aren’t perfect, because they fail to capture certain nuances or special (mis)attributes a soldier may have, but on balance, they are good at determining the quality of a soldier. If this is the case, we have an example of a adequate operationalization, because the test results generally comport with the commanders’ intuitive sense for good soldiers and bad soldiers.

If instead the tests were terrible at predicting the quality of a soldier - if soldiers with high scores were frequently perceived as poor soldiers and vice versa, the commanders would quickly decide the tests were failing to adequately operationalize the concept of “good soldiering.”

Similarly, the test for DART’s ELO is whether the scores and rankings generally comport with our intuitive sense for what makes a good debater. If high scores are generally held by people we acknowledge are skilled and low scores are generally held by people we acknowledge are unskilled, the ELO rating system is adequate. If the opposite is true - if ELO scores generally fail to comport with our intuitive understanding of what makes for a good debater - then DART’s ELO is an inadequate operationalization of debate skill.

SECTION 2: EVIDENCE OF ADEQUACY
ELO is not the only operationalization of skill we can call on. For this debate, in fact, I’d like to compare win ratio to ELO score. First, allow me to justify this approach.

As I outlined in Section 1, our intuitive sense for what constitutes a skilled debater involves three broad attributes: quality arguments, quality writing, consistency. The better a debater is in each of these qualitative categories, the more skilled we will generally assess them to be. I contend that as a matter of common sense, debaters who generally put forth quality arguments, write well, and perform at a high level with consistency will consequently win more debates than those who put forth poor arguments, write poorly, and perform inconsistently. Thus, win ratio is a rough measurement for debater skill.

Obviously, I acknowledge that a skilled debater could have a worse win ratio than an unskilled debater, for whatever reason. But I maintain that, generally speaking, skilled debaters will have better win ratios than unskilled debaters.

Proceeding on this premise, we can test whether ELO scores generally comport with our sense for what makes a good debater. Since good debaters win more often than bad debaters, high-score ELO debaters should have a better win ratio than low-score ELO debaters, if ELO is working as an adequate measure of skill.

Based on my calculations, the top 10 debaters on this site have an average win ratio of 88.15%. Meanwhile, the bottom 10 debaters possess an average win ratio of 19.75%.

A tougher quantitative test for ELO would be to examine the win rates of those just below the top 10 debaters. Predictably, the next group, 11-20, sports a slightly lower win rate, coming in at an average of 84.59%. The next group below them, 21-30, possesses a win rate of 72.51%.

Clearly, ELO ranking is correlated with win rate. Since we know as a matter of common sense that win rate is a rough predictor of skill, we can further say that ELO scores generally comport to our intuitive understanding of what it means to be a skilled debater.

CONCLUSION
Let me be clear: I am not saying ELO rankings are perfect. You can scroll around the leaderboard and point out a debater who is ranked in the 11-20 group that is just as good as someone in the 1-10 group. I’m sure the same could be done down to the 21-30 group.

I am not claiming that ELO is an exact, perfect measurement of your quality as a debater. ELO is just one part of your overall “portfolio,” if you will.

And again, it’s not my job to argue that ELO is perfect in all cases. It’s not my job to demonstrate conclusively that the number 1 ranked debater is better than the number 2 ranked debater, and so on and so forth.

Rather, my job is to demonstrate that ELO is an adequate measurement (or, “operationalization”) of skill. I have argued that for a measurement to be adequate, its results must generally comport to our intuitive understanding of what makes for a skilled debater. By demonstrating the positive correlation between win rate and ELO, I believe I have done so.

I look forward to my opponent’s reply. Thank you for reading.
Round 2
Pro
#3
I would start my argument so late into the day because I have finally done all my work needed for the day.

1. My opponent had failed to cite his sources. 
Look at the previous argument and sources are nowhere to found. Either way, my sources are real examples that support my claim and my opponent's sources are just elaborate definitions.

2. Elo was not meant to indicate and survey skills. It is meant to indicate the person's wins and losses.
  • Look at one example
    • This example mainly shows why Elo was not a good system for video games. However, the reasoning behind this can also be used for Debating. 
    • This would support my idea: If a former DDO veteran just transferred to this site(with, of course, 1500 Elo) and immediately won against a DArt veteran with 1600 Elo, would you say that the former is actually on 1500 Elo? No! The former is above 1500 Elo based on his true skills. 
      • The text states that: For starters, the Elo system was created to gauge player skill on a 1v1 basis. This is pretty simple to accomplish because the players have full control over their destiny. If Player A, rated at 1000 Elo, beats Player B, rated at 1400, there should be a substantial swing in both player’s ratings because Player B was clearly supposed to have no problems beating Player A. When there isn’t any random element involved in the game, like chess, this makes complete sense. Player B is expected to consistently play at a 1400 rating level. Seeing as how he lost to someone with only 1000 Elo, he’s clearly no longer playing at a 1400 rating.
      • Assume said DDO veteran had a skill level equal to 1629 Elo translated to DArt notation, said DArt veteran should lose fewer points than he did because he is really debating someone with a higher level than what it would seem. Thus, because DDO veterans are winning debates with a 1500 Elo, they are gaining points much quicker than they should because what they really are is above 1500. Their Elo does not reflect their skill anymore. 
  • I extend all my arguments made above since my opponent never attempted to refute during this 2-day period. Everything you see here is built on top of the last argument. 
  • The more often you win, the more "skilled" you are, according to Elo. 
    • Oromagi is the top debater on the entire site because he debates very frequently and basically all the time he wins. The DArt base clearly thinks Bsh1 and Blamonkey are more skilled than Oromagi because the quality of the debate of these two debaters is better than Oromagi's average debates. Oro had admitted he had sniped noobs and trolls on the site because he wants to boost his performance. I admit that Blamonkey had also had debates with less experienced noobs and I-give-up forfeiters, but Blamonkey's quality is higher. Bsh1, on the other hand, was one of the most skilled people on DDO, and the reason he is not as high as we think he is is that he LEFT THE SITE. Oro had also admitted that other debaters are better than him and were pretty surprised that he won every single of them. 
    • CONCLUSION: THE TOP DEBATERS ARE NOT DEFINITELY THE MOST SKILLED. THEY ARE JUST THE MOST FREQUENT WINNERS.
  • CONCLUSION: IF THE TOP DEBATER ISN'T THE MOST SKILLFUL, IT MEANS THIS SYSTEM SHOULD BE REVISED. 
  • Also, debates can range from full forfeit to a true clash of ideas. The "round earth" debate listed last round shouldn't boost Pro's ranking because he didn't exhibit any skill. unvoted and unranked debates should add something to both users' accounts based on skill, especially if you can see if one won or lost. If unranked and tied debates exist, it means both users' might not have an accurate ranking. A true skill-based ranking system should have AI determining who won and who lost instead of people voting on it because soon enough the winning ones will learn to appease most voters(Oromagi did it), instead of exhibiting the most potential. The goal is shifted. 
Rebuttals coming up.

-Quality arguments = a debater who puts forward logical, clear, and relevant arguments is demonstrating skill as a debater

-Quality writing = a debater who can articulate their arguments with concise, understandable writing is demonstrating skill as a debater

-Consistency = a debater who has a long track record of demonstrating the above attributes is further proving their skill as a debater (why? The longer someone’s impressive track record, the more confident we can be that their skill is durable and not just a “flash in the pan”).
Skill and whatever my opponent listed here are qualitative and not quantitative. Debates are vastly different from each other and there is no quantitative system to adequately judge it because it sees everything the same. Bsh1 and Blamonkey are better than Oromagi in the first 2 points, so I want my opponent to justify Oro's position. 

I say “results that generally” match our intuition, because operationalization will never be perfect. By its nature, the process of operationalization loses some of the phenomenon’s qualitative richness in its pursuit of objective measurement. Even so, operationalization is valuable because it allows us to assess broad trends and support claims with objective data. If we refused to quantify anything, we’d be left in an ambiguous state where each person’s personal opinion carries no more weight than another’s. Therefore, operationalization, though imperfect, is a valuable tool for understanding reality in an objective, consistent manner.
We have something that isn't meant to survey skills, and it doesn't work. I would prefer it to be left in an ambiguous state because the Elo system does not truly represent debaters. If anything, I would prefer having an AI telling me who won and who lost based on textual evidence. DebateIsland.com did that. This system could only work in a Utopian nature in which everything correctly represents everyone. Because of the voting system on DArt, one winning factor is studying the frequent voters and learn to appease them, which wasn't a part of the regular debate. Thus, Elo is shifted by the voters instead of the debates. I thought I am getting a grade for my debating when in reality, the system has voters judging the winning side and the losing side, which should be done as a robotic system acting as a judge.

This is the last resort, and it doesn't mean it is justified. If a man thinks that his only way out is stealing, then does it mean he can get away with it because it was his "last resort"? Nopity nopity nope.

Similarly, the test for DART’s ELO is whether the scores and rankings generally comport with our intuitive sense for what makes a good debater.
We don't think Oromagi is better than Bsh1 on this site. Your claim is saying that Oromagi is 100%, with no doubt, better than Bsh1. 

Based on my calculations, the top 10 debaters on this site have an average win ratio of 88.15%. Meanwhile, the bottom 10 debaters possess an average win ratio of 19.75%.

A tougher quantitative test for ELO would be to examine the win rates of those just below the top 10 debaters. Predictably, the next group, 11-20, sports a slightly lower win rate, coming in at an average of 84.59%. The next group below them, 21-30, possesses a win rate of 72.51%.
Forfeiting does not mean your skill is low. It just means you are not trying. If RM the Rational Madman didn't forfeit at all, he would be much higher on the leaderboard. Many skilled debaters wend down the rankings because they forfeited because of business, and that doesn't constitute a lack of skill because they really CAN debate.

And again, it’s not my job to argue that ELO is perfect in all cases. It’s not my job to demonstrate conclusively that the number 1 ranked debater is better than the number 2 ranked debater, and so on and so forth.
No. Your job is to tell me why Elo is adequate. So far, I have proved that Elo has no direct correlation with skill, which defeats its whole purpose. If Elo has no correlation with skill, then it should simply be revised, replaced, or scrapped. 

I am done for round 2. I thank my opponent for providing a meaningful discussion. 

Sources:


Con
#4
My thanks to any prospective voters who have stuck with us thus far. I hope you are finding the read worthwhile.

In this round, I will begin by rebutting my opponent's claims. To do so, I will start first with miscellaneous quote-and-replies. Then, I will proceed to identify my opponent's main points and respond to those points on the general level. After offering these refutations, I will conclude this round by defending the status of my R1 argument.

Miscellaneous Quote-and-Replies
Look at the previous argument and sources are nowhere to found. Either way, my sources are real examples that support my claim and my opponent's sources are just elaborate definitions.
The first sentence claims I provide no sources, while the second seems to admit that I do. Therefore, I'm not exactly sure what my opponent's critique of me is, but let me be clear: I cite my sources in R1 (hyperlinks). The ELO calculations I offer can be verified by crunching a handful of numbers off the leaderboards. As for the idea that my sources are "just elaborate definitions," I find them quite important, as I will reiterate this round.

The DArt base clearly thinks Bsh1 and Blamonkey are more skilled than Oromagi because the quality of the debate of these two debaters is better than Oromagi's average debates.
I politely request that my opponent prove Bsh1 and Blamo are better than Oro. At minimum, I request that my opponent prove "The DArt base clearly thinks" this is so.

Oro had admitted he had sniped noobs and trolls on the site because he wants to boost his performance.
In the forum thread in question, Oro also defends his refusal to solely debate other elites because he wouldn't be able to do that many debates as a result (see post #32 if you're interested in Oro's defense). I think there's a risk of selectively quoting from this very long thread, so I wanted to point this out.

CONCLUSION: IF THE TOP DEBATER ISN'T THE MOST SKILLFUL, IT MEANS THIS SYSTEM SHOULD BE REVISED.
I politely request that my opponent justify this statement. According to my reading of the text that preceded this, he did not substantiate this claim. In particular, I am wondering why a measurement system must be so precise as to ensure that #1 ranked debater will also be the undisputed "best" debater around in order to be considered adequate. Would it not be more reasonable to require that for a measurement to be adequate, it need only generally comport to our intuitive understanding of the phenomenon in question?

So far, I have proved that Elo has no direct correlation with skill, which defeats its whole purpose.
I disagree with the statement I've bolded. As I will demonstrate soon, my opponent has only highlighted instances where ELO fails to perfectly reflect skill. He has not shown that ELO doesn't correlate with skill. There's a big difference between correlation and 1:1 precision. However, I do agree with the sentiment expressed when my opponent says a lack of correlation would defeat ELO's "whole purpose." I agree that ELO should correlate with skill. Further, I believe I have shown it does. Thus, ELO satisfies its purpose and should be considered adequate.

Look at one example
Having reviewed this source, it's not clear to me that the article is actually helpful to my opponent. The article criticizes the use of ELO in online multiplayer gaming. The basis of this criticism lies in the randomness that comes with online, team-based games. For example, what if one of your teammates rage quits? What if their internet is terrible? These things could impact your ELO even though they aren't genuine reflections of your skill. I don't think these critiques transfer to DART. Debates are 1 vs. 1 affairs that have little to do with luck.

Reply to My Opponent's Main Point
Though I am using my own words to describe my opponent's argument, I'm fairly confident voters will find my summary accurate. In essence, my opponent has been identifying circumstances wherein the ELO system fails to perfectly reflect skill. In other words, we could say my opponent is identifying "instances of system breakdown." In his view, the fact of these instances of system breakdown means ELO is an inadequate measure of skill.

A few examples my opponent has used to illustrate instances of system breakdown:
  • Some debates are unranked, thus, any skill exhibited therein will not affect the participants' ELO scores
  • Differences in opponent quality could result in a higher ELO debater actually being less skilled than a lower ELO debater
  • Many noobs who start out with 1500 ELO may not deserve this score, for whatever reason
  • Sometimes people forfeit. Sometimes debates don't get votes
I have absolutely no problem agreeing with User that these are instances where ELO will fail to perfectly reflect skill. However, this does not mean I have lost the argument. In order to show why my position can withstand these instances of system breakdown, I need to briefly reiterate my R1 argument.

I summarize my R1 argument as follows:

  • ELO is an operationalization of debater skill
  • A good operationalization is one that produces results that generally comport with our intuitive understanding of the phenomenon in question
  • ELO scores generally comport with our intuitive understanding of what makes for a good debater
  • Therefore, ELO ought to be considered an adequate measurement of debater skill on DART
With all due respect, I don't think my opponent has come close to addressing the key elements of my argument. It is true my opponent quoted some parts of my R1 and offered replies, but these replies were essentially a rehash of his pre-existing points. That is to say, he replied to my argument by identifying instances of system breakdown. This is not helpful to him, at least in my view, because my argument is resistant to instances of system breakdown. In fact, in my R1, I went out of my way to make clear that ELO does not need to perfectly represent skill in all circumstances to be considered an adequate measurement:
I am not claiming that ELO is an exact, perfect measurement of your quality as a debater. ELO is just one part of your overall “portfolio,” if you will.

And again, it’s not my job to argue that ELO is perfect in all cases. It’s not my job to demonstrate conclusively that the number 1 ranked debater is better than the number 2 ranked debater, and so on and so forth.

Rather, my job is to demonstrate that ELO is an adequate measurement (or, “operationalization”) of skill. I have argued that for a measurement to be adequate, its results must generally comport to our intuitive understanding of what makes for a skilled debater. By demonstrating the positive correlation between win rate and ELO, I believe I have done so.
To put the matter simply: a measurement can fail to be 100% accurate at all times yet still generally comport with our intuitive sense of the phenomenon it's measuring. I believe I have demonstrated that ELO produces rankings which generally comport with our intuitive sense of the phenomenon in question. Therefore, my opponent's identification of instances of system breakdown need not sabotage my position. My argument, structurally speaking, can withstand the points my opponent has made thus far. Since he has not attacked the basics of my argument itself, I believe my position is secure (at least at this point in the debate).

Conclusion
  • My opponent has identified instances of "system breakdown." That is to say, he has identified situations where ELO will fail to 100% accurately reflect skill.
  • My argument can withstand these points.
  • My opponent has not substantively addressed the key elements of my argument.
  • Therefore, my claim that ELO is an adequate measurement of skill should be considered to stand (at least at this point in the debate).

Round 3
Pro
#5
I thank Jeff Goldblum for giving a detailed argument. However, the debate is still a debate and I will still try to give my best. The sentence before this is not an argument and shall not be treated as one. 

1. Leaderboards' purpose

It became a truism that ELO is used in the leaderboards[1]


So, Elo is the main system of the online ranking of the debates. This also applies to DDO but it is more catastrophic than DArt because it has been in anarchy for 2 years or more. 

Now, what is the purpose of a leaderboard? Online articles explain better than I do and I will let the source talk[2]. 

Leaderboards are a visualization of achievement. The purpose of a leaderboard is to show people where they rank in a gamified system. The leaderboard shows them where they stand in relation to their peers.
So, if Elo is correctly made, Oromagi would be, undoubtedly, better than RM, Ramshutu, and Ragnar. However, that is not the case at all. My opponent had even agreed with my examples that show Elo does not exactly rank debaters objectively. The purpose of the leaderboard is to visualize how good you are in this field, but DArt's leaderboard shows mainly of how many debates you won(More wins=better...?), except not all people are the same, and not all debates are the same. The leaderboard on DArt, which runs on Elo, does not necessarily survey skills. Many debaters who are good aren't getting the ranking they deserved because they are just here once in a while(Blamonkey, etc), while someone equally as good or not even as good as him might get higher rankings than him(Oromagi, etc). Thus, in order to get on top of the leaderboard, one would focus on quantity(the number of debates won) instead of quality(skill).

A decent amount of skill plus common visits to this site would mean he would get a lot of wins, and a lot of rankings. Oromagi is an example. He even admitted that Ragnar and Bsh1 are better, but his ranking is above. Leaderboards aren't tiered lists: They are supposed to measure exactly. 

If you know the person at the top has a score of 900 points, you'll know it's at least possible to increase your skill to reach that higher score.
The strategy? More debates! Having to show your skill in forums(such as Grayparrot and 3ru7al) changes nil in the ranking. DArt only changes ranking based on how much you have won, and noob sniping is an unfair advantageous strategy that cannot be banned. Increasing your skill while not debating would not increase your ranking in any way possible that there is. 

A common use for a leaderboard could be to represent a sales team. Each item in the list would represent a sales person and their sales over a period of time. 
Debating is not as simple as salespersons. You have to take skill into account, and those who had just little skill who is paired up with forfeiters will win debates without skill. Overall, I have no system to rank debaters adequately, and there is none currently, definitely not Elo. 

Conclusion: Leaderboards are to measure one's skill exactly, and Elo isn't doing that. It may hint towards higher skill but it is not directly exact. 
Just to be sure, many other webpages also suggest that.
  1. Leaderboards should be able to tell you who is best at a game.
  2. Leaderboards should tell you who has the best time in a game.
The purpose of a leaderboard is to show users where they rank in a gamified system. Those at the top enjoy the notoriety it brings; as for everyone else, the leaderboard shows them where they stand relative to their peers.
Rebuttals:
(Also, my opponent forgot to put the sources AGAIN)

Oh wait, there is no need for rebuttals because I went for my opponent's roots instead of its ends. Everything above suffices to respond to my opponent's points. 

Sources: 






Con
#6
Another thanks is due to any readers who are still with us. I hope you have enjoyed the debate.

Once again, I will begin my argument with miscellaneous quote-and-replies, followed up by a reply to my opponent's overall point in his R3. I will then conclude with a summary of my argument as well as a defense of its validity.

Miscellaneous quote-and-replies
(Also, my opponent forgot to put the sources AGAIN)
Just like last time, my opponent's critique isn't exactly clear. I have cited the sources I've used. In R2, I did not bring any new sources because I didn't think my argument required them, so there was nothing to cite.

Oh wait, there is no need for rebuttals because I went for my opponent's roots instead of its ends. Everything above suffices to respond to my opponent's points. 
If opponent's R3 conclusion had been properly justified by argumentation, I would agree that he did go after the root of my claims. However, as I will show, I think my opponent failed to mount an effective case against my position. As a result, he has effectively dropped my R2.

The Purpose of ELO
My opponent tries to argue that ELO should be held to an incredibly high standard: perfection.
Leaderboards are to measure one's skill exactly, and Elo isn't doing that. It may hint towards higher skill but it is not directly exact. 
Though my opponent does not say so explicitly, he presumably finds this standard superior to the standard I advanced in R1: that, in order to be considered adequate, ELO should produce rankings that generally fit our intuition of what makes for a good debater. Since my opponent offered no direct attack on my proposed standard, all I need to do is rebuttal his proposed standard.

So, how does my opponent attempt to substantiate this standard? To my reading, his argument can be summarized as follows:
  1. My sources say ELO should be a perfect measurement of skill
  2. ELO is not a perfect measurement of skill
Re: #1
My opponent's R3 sources do not substantiate his claim that leaderboards should be a perfect measurement of skill. As he himself quotes, they say leaderboards measure rank in a gamified system. This is not the same thing as skill. According to this Wikipedia article on gamification, leaderboards "rank players according to their relative success, measuring them against a certain success criterion." Obviously the success criterion used to determine ranking may not necessarily 100% accurately reflect skill. This is important because my opponent relies on his sources to assert that ELO should be held to the standard of perfection, when, in fact, his sources say nothing at all about leaderboards needing to perfectly reflect skill.

I should also note that one of my opponent's sources is a blog post in which the author discusses the debate among video game speedrunners as to what their leaderboards should reflect in a gamer. It's not at all clear to me how this relates back to DART's ELO.

Re: #2
My opponent continues to harp on his point about Oro not being as good as some other people, thus showing that ELO is a failure:
The leaderboard on DArt, which runs on Elo, does not necessarily survey skills. Many debaters who are good aren't getting the ranking they deserved because they are just here once in a while(Blamonkey, etc), while someone equally as good or not even as good as him might get higher rankings than him(Oromagi, etc). Thus, in order to get on top of the leaderboard, one would focus on quantity(the number of debates won) instead of quality(skill).
In my R2, I requested that my opponent prove Oro is an inferior debater to Blamo/Bsh1. I asked him to do this because it seems to be a central theme throughout his argument. If he's going to continually claim that ELO is a failure because the #1 ranked debater isn't as good as some other people, he should be willing to do more than simply assert that Oro isn't as good as Blamo/Bsh1. Given that my opponent has provided no evidence to support this claim, I feel I have no more I need to say here.

The missing #3
There is something missing from my opponent's R3: an explanation for why an imperfect ELO should also be considered inadequate. As I indicated in R1, something is adequate if it is acceptable/satisfactory. Of course, my opponent could have claimed that imperfect=unacceptable, but he didn't explicitly do this. Therefore, we are left to assume this final point.

Simply put, because Pro does not tie his R3 argument back to the question of adequacy, his argument suffers.

Conclusion
In R1, I argued the following:
  1. ELO is an operationalization of debater skill
  2. An adequate operationalization will produce results that generally comport to our intuitive understanding of the phenomenon in question
  3. ELO produces rankings that generally comport to our intuitive understanding of the debater skill
  4. Therefore, ELO ought to be considered an adequate measurement (or, "operationalization") of debater skill
In his R1 and R2, Pro pointed out instances where ELO fails to perfectly reflect skill. In my R2, I pointed out that my R1 argument could withstand these "instances of system breakdown," as I termed it.

Most recently, my opponent tried to argue that ELO should be held to the standard of perfection. To do this, he relied on sources that failed to support his claim. As such, my standard ("An adequate operationalization will produce results that generally comport to our intuitive understanding of the phenomenon in question") is the only man left standing, so to speak. Because my opponent did not make any meaningful attempts to address points #1 or #3, my argument remains intact and valid.

For any voters who are still on the fence, I leave them with the following (rhetorical) question:

Is it truly reasonable to deem a measurement adequate only if it perfectly reflects the phenomenon it is intended to measure?