Article blaming

List overview All Threads
Download

newer

older

AbuseFilter extension testing

Re: [Wikitech-l] Transcoding Video...

Platonides

23 Jan 2009 23 Jan '09

7:19 p.m.

With all the discussion on foundation-l about contributors and attribution, I have noted that while there're two different implementations for blaming mediawiki articles, none of them seem to be publically available. There're some example results, but not the tools themselves.

The implementations I am aware are: *Roman Nosov (svn user roman) blamemap extension (2006-2007), which was available at http://217.147.83.36:9001/wiki/Freebsd?trackchanges=blamemap&oldid=1524

*Greg Hewgill wikiblame (2008) http://hewgill.com/journal/entries/461-wikipedia-blame

Is the code available and I have missed it? Do we have any other implementation?

Show replies by date

Michael Rosenthal

23 Jan 23 Jan

7:45 p.m.

If you mean something like that, here are some:

http://de.wikipedia.org/wiki/Benutzer:Jah/Rhic

http://de.wikipedia.org/wiki/Benutzer:APPER/WikiHistory

On Fri, Jan 23, 2009 at 8:19 PM, Platonides Platonides@gmail.com wrote:

...

With all the discussion on foundation-l about contributors and attribution, I have noted that while there're two different implementations for blaming mediawiki articles, none of them seem to be publically available. There're some example results, but not the tools themselves.

The implementations I am aware are: *Roman Nosov (svn user roman) blamemap extension (2006-2007), which was available at http://217.147.83.36:9001/wiki/Freebsd?trackchanges=blamemap&oldid=1524

*Greg Hewgill wikiblame (2008) http://hewgill.com/journal/entries/461-wikipedia-blame

Is the code available and I have missed it? Do we have any other implementation?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Daniel Kinzler

24 Jan 24 Jan

9:38 a.m.

Platonides schrieb:

...

With all the discussion on foundation-l about contributors and attribution, I have noted that while there're two different implementations for blaming mediawiki articles, none of them seem to be publically available. There're some example results, but not the tools themselves.

...

Is the code available and I have missed it? Do we have any other implementation?

WikiTrust has another implementation, but I don't know if a demo is live. The code is checked into svn though http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WikiTrust/.

WikiTrust does a lot more, the blame feature is kind of a side product. See http://wikitrust.soe.ucsc.edu/index.php/Main_Page.

-- daniel

PS: perhaps forward this info to foundation-l. WikiTrust would be *very* useful to have imho, and it would be good to have more people knwoing and caring about it.

Petr Kadlec

26 Jan 26 Jan

11:47 a.m.

...

Is the code available and I have missed it? Do we have any other implementation?

I tried to do something similar (two examples are at http://mormegil.info/wp/blame/AIM.htm http://mormegil.info/wp/blame/AFC_Ajax.htm); the code is nothing secret, even though it is not too clean, and there is also no rocket science [you have been warned: https://opensvn.csie.org/traccgi/MWTools/browser/MWTools/trunk/PageBlame]:

The biggest problem I see with such tools is that it is IMHO unusable for any copyright-related purposes. My tool works by diffing the article revisions and tracking who was the last author of every word. Even though you can be much smarter than that, I don't believe you would be able to track all copyright-relevant contributions with that. As an example, consider using that tool on an article that was created by: 1. Importing an article with all its history from the English Wikipedia to some other-language wiki. 2. Translating it into the local language (for more fun, imagine a language using a different script, e.g. Russian, or even Chinese)

There is IMHO no way the blame tool could track copyright properly through the translation (which it has to, copyright-wise). And even in the general case, I believe such tracking would be an AI-hard task (often, even a human is unable to do it properly…). Of course, such Blame tools are great for many reasons (which is why I wrote them), but I think the current context (license change, attribution etc.) does not fit them at all.

-- [[cs:User:Mormegil | Petr Kadlec]]

Robert Rohde

6:41 p.m.

On Mon, Jan 26, 2009 at 3:47 AM, Petr Kadlec petr.kadlec@gmail.com wrote:

...

...
Is the code available and I have missed it? Do we have any other implementation?

I tried to do something similar (two examples are at http://mormegil.info/wp/blame/AIM.htm http://mormegil.info/wp/blame/AFC_Ajax.htm); the code is nothing secret, even though it is not too clean, and there is also no rocket science [you have been warned: https://opensvn.csie.org/traccgi/MWTools/browser/MWTools/trunk/PageBlame]:

I also have a blame engine of my own design. It is new and I haven't released the source.

...

The biggest problem I see with such tools is that it is IMHO unusable for any copyright-related purposes. My tool works by diffing the article revisions and tracking who was the last author of every word. Even though you can be much smarter than that, I don't believe you would be able to track all copyright-relevant contributions with that. As an example, consider using that tool on an article that was created by:

Importing an article with all its history from the English

Wikipedia to some other-language wiki. 2. Translating it into the local language (for more fun, imagine a language using a different script, e.g. Russian, or even Chinese)

There is IMHO no way the blame tool could track copyright properly through the translation (which it has to, copyright-wise). And even in the general case, I believe such tracking would be an AI-hard task (often, even a human is unable to do it properly…). Of course, such Blame tools are great for many reasons (which is why I wrote them), but I think the current context (license change, attribution etc.) does not fit them at all.

I think I have a more positive view than you do. Blame engines as a tool can certainly inform copyright discussions and provide relevant information, even though I agree they aren't by themselves a complete solution.

For example, with situations where one is trying to list a fixed number of "major authors" (as provided in the GFDL, for example), blaming tools can make a reasonable guess at which authors are relevant. They also help estimate the answer to important meta questions, such as "How many authors does a typical Wikipedia article really have?"

When the license calls for attribution to be treated in a "reasonable" way, I suspect that one could make a good case that relying on a good blame engine would often generate a reasonable attempt at attribution, even though there are cases (like translation) where they will fail. Attribution generated by blaming can be a good starting point, though it may not necessarily be the final answer.

-Robert Rohde

Gregory Maxwell

8:38 p.m.

On Mon, Jan 26, 2009 at 1:41 PM, Robert Rohde rarohde@gmail.com wrote: [snip]

...

When the license calls for attribution to be treated in a "reasonable" way, I suspect that one could make a good case that relying on a good blame engine would often generate a reasonable attempt at attribution,

[snip]

Often, sure— But what happens when it fails and you have someone yelling loudly on the talk page "Hey! it's misattributing my authorship to some dumb bot, yet I wrote the whole thing!" ...

It's not reasonable by any human (or legal) standard to continue to misattribute in a case like that, yet addressing that case with some automatically generated report is not easy.

(and, of course, it's a great starting point… so long as someone remembers to continually point out that its not a final answer).

Soxred93

8:47 p.m.

It's more of a starting point, to flag editors who may have made the edits. All that would remain is looking over if that user did in fact make that edit (and if they didn't, it's back to square 1)

On Jan 26, 2009, at 3:38 PM [Jan 26, 2009 ], Gregory Maxwell wrote:

...

On Mon, Jan 26, 2009 at 1:41 PM, Robert Rohde rarohde@gmail.com wrote: [snip]

...
When the license calls for attribution to be treated in a "reasonable" way, I suspect that one could make a good case that relying on a good blame engine would often generate a reasonable attempt at attribution,

[snip]

Often, sure— But what happens when it fails and you have someone yelling loudly on the talk page "Hey! it's misattributing my authorship to some dumb bot, yet I wrote the whole thing!" ...

It's not reasonable by any human (or legal) standard to continue to misattribute in a case like that, yet addressing that case with some automatically generated report is not easy.

(and, of course, it's a great starting point… so long as someone remembers to continually point out that its not a final answer).

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

27 Jan 27 Jan

12:02 a.m.

Gregory Maxwell wrote:

...

Often, sure— But what happens when it fails and you have someone yelling loudly on the talk page "Hey! it's misattributing my authorship to some dumb bot, yet I wrote the whole thing!" ...

It's not reasonable by any human (or legal) standard to continue to misattribute in a case like that, yet addressing that case with some automatically generated report is not easy.

There could be an override for some articles for outsiders, just to stop misattribution in the time being, but it can't be treated as a solution. If it attricutes a bot, the algorith is wrong and shall be fixed. Add it to the tests and improve the algorithm. One very important point of having an automated way of dealing with this is precisely that it allows us to be completely neutral. The "author reviewer" doesn't have a POV, it doesn't care that the contributor is named "Willy on wheels" or "Gmaxwell".

The hard part is obviously how to get that magic attribution tool. If at least there would have been some convention on summary syntax to be used when attributing someone else...

Ilmari Karonen

12:46 a.m.

Platonides wrote:

...

Gregory Maxwell wrote:

...
It's not reasonable by any human (or legal) standard to continue to misattribute in a case like that, yet addressing that case with some automatically generated report is not easy.

There could be an override for some articles for outsiders, just to stop misattribution in the time being, but it can't be treated as a solution. If it attricutes a bot, the algorith is wrong and shall be fixed.

This particular case should be easy to fix, but only because we're fairly meticulous about flagging bot accounts. Essentially, we'd be falling back to a human saying "that account is a bot, don't attribute anything to it".

Which, I suppose, is a fairly straightforward and objective decision to make, but it's still a human decision made according to someone's personal point of view. Certainly such a fix won't easily generalize to more controversial cases.

...

Add it to the tests and improve the algorithm. One very important point of having an automated way of dealing with this is precisely that it allows us to be completely neutral. The "author reviewer" doesn't have a POV, it doesn't care that the contributor is named "Willy on wheels" or "Gmaxwell".

Sort of reminds me about the story somewhere in the Jargon File about closing one's eyes so that the room would be empty.

Deciding who are the "major contributors" to an article is ultimately a subjective issue; it can be algorithmically approximated, given a suitable weighting function (such as "number of words contributed"), but ultimately the choice of the weighting function itself is to an extent a matter of personal opinion. Trying to achieve neutrality by focusing solely on purely objective metrics is likely to produce results that are neutral only in the sense that everyone can agree that they suck.

...

The hard part is obviously how to get that magic attribution tool. If at least there would have been some convention on summary syntax to be used when attributing someone else...

Such a convention (or, better yet, a separate field for attributing edits to someone else) would help a lot. Of course, it would only be as reliable as the users entering the data, and wouldn't help with legacy edits.

A particularly common and tricky use case would be text copied from one article to another (possibly across different wikis). Either the blame tool would have to chase such references (which seems impractical at least in cross-wiki cases) or it would have to rely on the users making the copy to correctly determine the actual author(s) of the copied text (a process which is highly error prone, even with computer assistance, and likely to lead to errors accumulating as text misattributed once is further copied using the incorrect attribution).

Also, there would obviously have to be some way to correct attribution errors after the fact. Which, of course, leads to the question of who should be authorized to make such corrections, and how this would be any less subjective than revising the author list directly.

-- Ilmari Karonen

Robert Rohde

1:08 a.m.

On Mon, Jan 26, 2009 at 4:46 PM, Ilmari Karonen nospam@vyznev.net wrote:

...

Platonides wrote:

...
Gregory Maxwell wrote:

...
It's not reasonable by any human (or legal) standard to continue to misattribute in a case like that, yet addressing that case with some automatically generated report is not easy.

There could be an override for some articles for outsiders, just to stop misattribution in the time being, but it can't be treated as a solution. If it attricutes a bot, the algorith is wrong and shall be fixed.

This particular case should be easy to fix, but only because we're fairly meticulous about flagging bot accounts. Essentially, we'd be falling back to a human saying "that account is a bot, don't attribute anything to it".

Which, I suppose, is a fairly straightforward and objective decision to make, but it's still a human decision made according to someone's personal point of view. Certainly such a fix won't easily generalize to more controversial cases.

<snip>

And what is inherently wrong with a attributing a bot? Some Wikis are heavily influenced by bots that import systematic content (like the Rambot articles on small towns).

At a legal level users like "ShadowCat" and "ShadowCat's Bot" are both pseudonyms and I doubt it makes any difference whether you attribute one or the other. From a practical point of view, I think distinguishing bot generated content is actually a quite useful detail for downstream users to be aware of (especially if very few editors other than the bot have influenced the text).

Depending on where and how the bot got it's content, it might be necessary to add attribution to some external source(s). However, if we are talking about a tool for aiding attribution, and not The Answer to attribution, then I certainly wouldn't drop bots until a human has had a look, and maybe not even then.

-Robert Rohde

5655

Age (days ago)

5659

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

8 participants

tags (0)

participants (8)

Daniel Kinzler
Gregory Maxwell
Ilmari Karonen
Michael Rosenthal
Petr Kadlec
Platonides
Robert Rohde
Soxred93