Date: Fri, 21 Dec 2007 10:34:47 -0800 From: "Luca de Alfaro" luca@dealfaro.org
If you want to pick out the malicious changes, you need to flag also small changes.
"Sen. Hillary Clinton did *not* vote in favor of war in Iraq"
"John Doe, born in *1947*"
The ** indicates changes.
Yes, and I did not mean to include cases such as this, which involve the insertion of a few words that could radically alter the semantic content of a unit of text. But legitimate spelling corrections (which can be easily determined using any of the various spell-checker databases to determine the set of common misspellings for a word) do not. In short, I cannot imagine a case where someone changing "Senater Clinton" to "Senator Clinton" could involve vandalism (the "smoother" algorithm should of course also take into account that if a "misspelling" appears repeatedly in an article, or even better, related subject articles by different authors, is is probably a valid technical term or a proper name). I also cannot imagine how moving a large block of relatively self-contained text (i.e. a paragraph, since even parsing at the level of sentences is problematic given all the uses for the period '.') without modifying its interior could have any large semantic repercussions (readability is, of course, a matter for a different discussion ;-)
Again, these are mainly quibbles, but for the articles I sampled it was quite annoying to have my eye repeatedly drawn to a single orange word that represented nothing more than a minor, good-faith correction. And overall the system seems to work well!
The point is that I wanted to make a language-idependent tool. If you go into mis-spelling, you need then a spelling tool for each language you want to support, and you need to worry about support for special names, locations, etc. I did not want to go into that. Should we get into that? Would there be a marked advantage? I would be interested to know. Moving blocks is also tricky. I can cut-and-paste a "did not" ... and change the meaning of the destination place. So as you say, you need to look for self-contained blocks, but even changing the order of blocks can affect meaning...
Maybe a good proposal is the following:
1. Still flag for trust as we do now, paying attention even to minor changes 2. When giving reputation, only give reputation to authors who contribute non-negligible amounts of text.
But we tried 2, and it did not work well: it decreased the predictive power of the reputation we computed. Many editors do mostly small edits; they would not receive much reputation under 2. for their work. We found that valuing the authors of even small changes actually led to a better reputation system (as measured by the predictive power).
Luca
On Dec 21, 2007 11:57 AM, Jonathan Leybovich jleybov@gmail.com wrote:
Date: Fri, 21 Dec 2007 10:34:47 -0800 From: "Luca de Alfaro" luca@dealfaro.org
If you want to pick out the malicious changes, you need to flag also
small
changes.
"Sen. Hillary Clinton did *not* vote in favor of war in Iraq"
"John Doe, born in *1947*"
The ** indicates changes.
Yes, and I did not mean to include cases such as this, which involve the insertion of a few words that could radically alter the semantic content of a unit of text. But legitimate spelling corrections (which can be easily determined using any of the various spell-checker databases to determine the set of common misspellings for a word) do not. In short, I cannot imagine a case where someone changing "Senater Clinton" to "Senator Clinton" could involve vandalism (the "smoother" algorithm should of course also take into account that if a "misspelling" appears repeatedly in an article, or even better, related subject articles by different authors, is is probably a valid technical term or a proper name). I also cannot imagine how moving a large block of relatively self-contained text (i.e. a paragraph, since even parsing at the level of sentences is problematic given all the uses for the period '.') without modifying its interior could have any large semantic repercussions (readability is, of course, a matter for a different discussion ;-)
Again, these are mainly quibbles, but for the articles I sampled it was quite annoying to have my eye repeatedly drawn to a single orange word that represented nothing more than a minor, good-faith correction. And overall the system seems to work well!
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
On Dec 21, 2007 3:05 PM, Luca de Alfaro luca@dealfaro.org wrote:
The point is that I wanted to make a language-idependent tool. If you go into mis-spelling, you need then a spelling tool for each language you want to support, and you need to worry about support for special names, locations, etc. I did not want to go into that. Should we get into that?
No, I dont think that you should. You would end up wasting a lot of server resources for a small potential gain in some languages, and you run into massive complexities when it comes to accurately dealing with fringe cases. I would venture to guess that the precision lost trying to account for spelling/grammar/whatever would outweigh the potential precision gained from it. Best to stick with an algorithm which is simple, elegant, not resource intensive, and universally applicable.
--Andrew Whitoworth
Thanks :-)
(About the server load: you are absolutely right, especially because using misspelling would transform a string-matching problem into a problem where the string matching has to be done modulo misspellings. We put a lot of work into making the string matching efficient, and it would be a big hit to have to account for misspellings).
Luca
On Dec 22, 2007 9:09 AM, Andrew Whitworth wknight8111@gmail.com wrote:
On Dec 21, 2007 3:05 PM, Luca de Alfaro luca@dealfaro.org wrote:
The point is that I wanted to make a language-idependent tool. If you go into mis-spelling, you need then a spelling tool for each
language
you want to support, and you need to worry about support for special
names,
locations, etc. I did not want to go into that. Should we get into
that?
No, I dont think that you should. You would end up wasting a lot of server resources for a small potential gain in some languages, and you run into massive complexities when it comes to accurately dealing with fringe cases. I would venture to guess that the precision lost trying to account for spelling/grammar/whatever would outweigh the potential precision gained from it. Best to stick with an algorithm which is simple, elegant, not resource intensive, and universally applicable.
--Andrew Whitoworth
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
Wetter you do or don't do trust metrics according to misspellings is a choice, but to not correct for misspellings will give a suboptimal solution. It is important to note that you do this, and how it changes the system. I have no doubt that trust metrics will incorporate this as an option in the future, no matter wetter it is part of an official system or not. Likewise I believe it will incorporate systems for weighting cooperation between users and articles overall quality. There is no easy single solution to this, the solution is a complex connected multivariate system.
John E Blad
Luca de Alfaro skrev:
Thanks :-)
(About the server load: you are absolutely right, especially because using misspelling would transform a string-matching problem into a problem where the string matching has to be done modulo misspellings. We put a lot of work into making the string matching efficient, and it would be a big hit to have to account for misspellings).
Luca
On Dec 22, 2007 9:09 AM, Andrew Whitworth < wknight8111@gmail.com mailto:wknight8111@gmail.com> wrote:
On Dec 21, 2007 3:05 PM, Luca de Alfaro <luca@dealfaro.org <mailto:luca@dealfaro.org>> wrote: > The point is that I wanted to make a language-idependent tool. > If you go into mis-spelling, you need then a spelling tool for each language > you want to support, and you need to worry about support for special names, > locations, etc. I did not want to go into that. Should we get into that? No, I dont think that you should. You would end up wasting a lot of server resources for a small potential gain in some languages, and you run into massive complexities when it comes to accurately dealing with fringe cases. I would venture to guess that the precision lost trying to account for spelling/grammar/whatever would outweigh the potential precision gained from it. Best to stick with an algorithm which is simple, elegant, not resource intensive, and universally applicable. --Andrew Whitoworth _______________________________________________ Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org <mailto:Wikiquality-l@lists.wikimedia.org> http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
Am Samstag, 22. Dezember 2007 19:49:10 schrieb John Erling Blad:
Wetter you do or don't do trust metrics according to misspellings is a choice, but to not correct for misspellings will give a suboptimal solution. It is important to note that you do this, and how it changes the system. I have no doubt that trust metrics will incorporate this as an option in the future, no matter wetter it is part of an official system or not. Likewise I believe it will incorporate systems for weighting cooperation between users and articles overall quality. There is no easy single solution to this, the solution is a complex connected multivariate system.
An automated system (regardless which one) should never care about spelling: a) Many citations are in outdated or non-standard ortography. This is especially true for German, which has changed its ortography just some years ago again. A system that gives an incentive to tamper citations is bad. b) There are assistive systems integrated into the browser (at least Konqueror has this for many years and Firefox now also as spell checking). Furthermore there exists a Toolserver + Javascript based solution that highlights probably missspelled words on reading an article (curently only German but this could be adapted for other languages): http://de.wikipedia.org/wiki/Wikipedia:Helferlein/Rechtschreibpr%C3%BCfung (see http://de.wikipedia.org/wiki/Bild:Rp_js_beispiel.png and http://de.wikipedia.org/wiki/MediaWiki:Gadget-Rechtschreibpruefung.js). A more advanced external tool is http://rupp.de/cgi-bin/WP-autoreview.pl. These tools are optionally integrated into the Wikipedia interface via the gadgets extension.
So if you make it obvious to editors that there is something they should check they very likely change it and if someone wrote a text with bad ortography someone else gets reputation because of his spell checking and as he did a review it is absolutely right that this text gets more trust afterwards (I know you will come with examples of rubbish text that got corrected to right spelling, but there are nonsense texts with and without bad spelling).
Arnomane
For the moment there are automatic classification algorithms that use and don't use grammars and vocabularies, and with and without self learning. Some of those systems are very advanced, and some of them has been around for several years. There has been discussions about pure statistical system, pure lexical systems and wetter such systems should be fully automatic or not, and wetter they produce biased results or not. The short version of the results are fully statistical systems gives poor results, and that they should include some form of lexical analysis at some level. Guided systems become very work intensive and should be avoided if possible.
For examples of what automatic text analysis systems can do http://www1.cs.columbia.edu/nlp/projects.cgi
For advanced automatic classification engines http://www.autonomy.com/content/home/index.en.html http://www.cyberwatcher.com/
If lexical analysis is not handled some way or another someone _will_ include it, and if someone runs a simulation and find the system can be better by tweaking it somehow there _will_ be questions why it hasn't been done before. There are other works on vandalism at the moment, and to neglect those will also be questionable. "Exploring the feasibility of automatic rating online article quality" and "Creating, destroying and restoring value in Wikipedia" are only two such papers.
I think it is wise to accept that the proposed system is a solution to a subproblem, and that an overall system is fairly much more advanced than this one. I don't say it does not work, I don't say it should not be tested, I say it is only a solution to a very small subset of the overall solution. As such it should be built in such a way as not to block further refinements. It should not be viewed as a final solution and there should definitely not be made any claims that alternate systems does not work without backing such claims with hard proofs.
If someone will front this particular system as the ultimate one, good luck! I don't think it is the ultimate system. Still I do think it can be a very good tool if used as what it is, a solution to one of several subproblems.
John E Blad
Daniel Arnold skrev:
Am Samstag, 22. Dezember 2007 19:49:10 schrieb John Erling Blad:
Wetter you do or don't do trust metrics according to misspellings is a choice, but to not correct for misspellings will give a suboptimal solution. It is important to note that you do this, and how it changes the system. I have no doubt that trust metrics will incorporate this as an option in the future, no matter wetter it is part of an official system or not. Likewise I believe it will incorporate systems for weighting cooperation between users and articles overall quality. There is no easy single solution to this, the solution is a complex connected multivariate system.
An automated system (regardless which one) should never care about spelling: a) Many citations are in outdated or non-standard ortography. This is especially true for German, which has changed its ortography just some years ago again. A system that gives an incentive to tamper citations is bad. b) There are assistive systems integrated into the browser (at least Konqueror has this for many years and Firefox now also as spell checking). Furthermore there exists a Toolserver + Javascript based solution that highlights probably missspelled words on reading an article (curently only German but this could be adapted for other languages): http://de.wikipedia.org/wiki/Wikipedia:Helferlein/Rechtschreibpr%C3%BCfung (see http://de.wikipedia.org/wiki/Bild:Rp_js_beispiel.png and http://de.wikipedia.org/wiki/MediaWiki:Gadget-Rechtschreibpruefung.js). A more advanced external tool is http://rupp.de/cgi-bin/WP-autoreview.pl. These tools are optionally integrated into the Wikipedia interface via the gadgets extension.
So if you make it obvious to editors that there is something they should check they very likely change it and if someone wrote a text with bad ortography someone else gets reputation because of his spell checking and as he did a review it is absolutely right that this text gets more trust afterwards (I know you will come with examples of rubbish text that got corrected to right spelling, but there are nonsense texts with and without bad spelling).
Arnomane
Wikiquality-l mailing list Wikiquality-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
wikiquality-l@lists.wikimedia.org