[Wikimedia-l] Fwd: The most controversial topics in Wikipedia: A multilingual and geographical analysis

Taha Yasseri taha.yasseri at oii.ox.ac.uk
Mon Jul 22 15:24:35 UTC 2013


Anders,
I really like your idea on "universal" articles. given the fact that
translation and communication cross languages is not a very task these days
any more.

By the way, in a blog post, I have release some more data on languages like
Japanese, Chinese, and Portugies, in case anyone's interested:
http://tahayasseri.wordpress.com/2013/05/27/wikipedia-modern-platform-ancient-debates-on-land-and-gods/

bests,
Taha


On Mon, Jul 22, 2013 at 4:17 PM, Anders Wennersten <mail at anderswennersten.se
> wrote:

> I see the difference on the different version as most interesting and to
> have some insight into Arabic version, I have not had before
>
> On a "small version" like sv:wp we are very used to "steal with pride"
> content from other versions, primary en:wp but also de:wp and others and we
> do this especially for controversial subjects that are not specific for a
> country/culture. But are en:wp and other big versions doing the same? It is
> very refreshing for a clinched discussion to start with an almost all new
> textversion.
>
> Also I wonder over articles like Homeopathy http://en.wikipedia.org/wiki/*
> *Homeopathy <http://en.wikipedia.org/wiki/Homeopathy> which seems to be
> in top of controversies. Would it be an idea to compile an unverisal
> article with help from different versions, ie do we really utilize the
> power of us having many versions and many experts?
>
> Anders
>
>
>
> Osmar Valdebenito skrev 2013-07-22 16:13:
>
>  I was interviewed a few days ago from a Chilean newspaper because of this
>> paper. For those interested that can read Spanish here is the full
>> article:
>> http://www.latercera.com/**noticia/tendencias/2013/07/**
>> 659-533645-9-estudio-dice-que-**chile-es-el-articulo-de-**
>> wikipedia-mas-editado-en-**espanol.shtml<http://www.latercera.com/noticia/tendencias/2013/07/659-533645-9-estudio-dice-que-chile-es-el-articulo-de-wikipedia-mas-editado-en-espanol.shtml>
>>
>> I read the paper in full and I have to admit it has very interesting
>> approaches to remove the "vandalism" effect. Probably it won't be perfect,
>> especially for a platform where it is impossible to have an exact,
>> quantitative measure of quality or neutrality. Is there a measure of
>> controversiality? I will consider controversial those articles where I
>> usually edit and probably I will ignore several others that are more
>> controversial and so on...
>>
>> But besides the particular issue of which is the most controversial
>> article, I'm more interested in the trends that each Wikipedia has. They
>> seem consistent and I think there is a lot of things that we can learn
>> from
>> it.
>>
>> *Osmar Valdebenito G.*
>> Director Ejecutivo
>> A. C. Wikimedia Argentina
>>
>>
>> 2013/7/22 Taha Yasseri <taha.yasseri at oii.ox.ac.uk>
>>
>>  Thanks Tilman.
>>>
>>> Especially for your effort to resolve the misunderstandings, which most
>>> of
>>> them I suppose are due to a shallow reading: "I had a bit of free time
>>> last
>>> night waiting for trains and I skimmed  through the study and its
>>> findings."
>>>
>>> We had two strategies to get rid of vandalisms, as you mentioned,
>>> considering only mutual reverts and waiting editors by their maturity, I
>>> suppose a vandal could not have a large maturity score by definition.
>>>
>>> As for the data, this study has been carried out in 2011, and we worked
>>> on
>>> the latest available dump at the time. Someone experienced in academic
>>> research, especially at this scale well knows that it really takes time
>>> to
>>> get the analysis done, write the reports, get them reviewed, etc.
>>> Especially that we have published 7-8 other papers during the same
>>> period.
>>> I see no problem in this as long as the metadata and such information
>>> about
>>> the methods and the data under study are mentioned in the manuscript,
>>> which
>>> is clearly the case here. I have seen many Wikipedia studies without any
>>> mention of the dump they have used!
>>>
>>>   Back to your concern for the general impression that the news media
>>> give
>>> on wikipedia being a battlefield, I'd like to mention that I have
>>> emphasised the small number of controversial articles compare to the
>>> total
>>> number of articles in every single media response I had. Again as you
>>> mentioned, we had given the percentages explicitly in our previous work.
>>> But of course for obvious reasons journalists are not happy to highlight
>>> this. They like to report on controversies and wars! This is not our
>>> fault
>>> that what they report could be misleading, as long as we had tried our
>>> best
>>> to avoid it. An interview of mine with  BBC Radio Scotland: at 04:00 I
>>> clearly say that there are millions and thousands of articles in
>>> WIkipedia
>>> which are not controversial, is available here:
>>> https://www.dropbox.com/s/**8whovkmipbqdzlv/bbc_radio_**Scotland.mp3<https://www.dropbox.com/s/8whovkmipbqdzlv/bbc_radio_Scotland.mp3>. I have
>>> done the same in all the others.
>>>
>>> Finally, I wish that the public media coverage of our research which is
>>> clearly far from perfect, could also provide the members of the public a
>>> better understanding of how Wikipedia works and how fascinating it is!
>>>
>>> Thanks again,
>>>
>>> Taha
>>>
>>>
>>> On 22 Jul 2013 05:58, "Tilman Bayer" <tbayer at wikimedia.org> wrote:
>>>
>>>  On Sun, Jul 21, 2013 at 2:32 PM, MZMcBride <z at mzmcbride.com> wrote:
>>>>
>>>>> Anders Wennersten wrote:
>>>>>
>>>>>> A most interesting study looking at findings from 10 different
>>>>>> language
>>>>>> versions.
>>>>>>
>>>>>> Jesus and Middle east are the most controversial articles seen over
>>>>>> the
>>>>>> world, but George Bush on en:wp and Chile on es:wp
>>>>>>
>>>>>> http://arxiv.org/ftp/arxiv/**papers/1305/1305.5566.pdf<http://arxiv.org/ftp/arxiv/papers/1305/1305.5566.pdf>
>>>>>>
>>>>> FWIW, here is the review by Giovanni Luca Ciampaglia in last month's
>>>> Wikimedia Research Newsletter:
>>>>
>>>>
>>>>  https://blog.wikimedia.org/**2013/06/28/wikimedia-research-**
>>> newsletter-june-2013/#.22The_**most_controversial_topics_in_**
>>> Wikipedia:_a_multilingual_and_**geographical_analysis.22<https://blog.wikimedia.org/2013/06/28/wikimedia-research-newsletter-june-2013/#.22The_most_controversial_topics_in_Wikipedia:_a_multilingual_and_geographical_analysis.22>
>>>
>>>> (also published in the Signpost, the weekly newsletter on the English
>>>> Wikipedia)
>>>>
>>>>  Thanks for sharing this.
>>>>>
>>>>> I had a bit of free time last night waiting for trains and I skimmed
>>>>> through the study and its findings. Two points stuck out at me: a
>>>>> seemingly fatally flawed methodology and the age of data used.
>>>>>
>>>>> The methodology used in this study seems to be pretty inherently
>>>>>
>>>> flawed.
>>>
>>>> According to the paper, controversiality was measured by full page
>>>>> reverts, which are fairly trivial to identify and study in a database
>>>>>
>>>> dump
>>>>
>>>>> (using cryptographic hashes, as the study did), but I don't think full
>>>>> reverts give an accurate impression _at all_ of which articles are the
>>>>> most controversial.
>>>>>
>>>>> Pages with many full reverts are indicative of pages that are heavily
>>>>> vandalized. For example, the "George W. Bush" article is/was heavily
>>>>> vandalized for years on the English Wikipedia. Does blanking the
>>>>>
>>>> article
>>>
>>>> or replacing its contents with the word "penis" mean that it's a very
>>>>> controversial article? Of course not. Measuring only full reverts (as
>>>>>
>>>> the
>>>
>>>> study seems to have done, though it's certainly possible I've
>>>>>
>>>> overlooked
>>>
>>>> something) seems to be really misleading and inaccurate.
>>>>>
>>>> They didn't. You may have overlooked the description of the
>>>> methodology on p.5: It's based on "mutual reverts" where user A has
>>>> reverted user B and user B has reverted user A, and gives higher
>>>> weight to disputes between more experienced editors. This should
>>>> exclude most vandalism reverts of the sort you describe. As noted in
>>>> Giovanni's review, this method was proposed in an earlier paper, Sumi
>>>> et al. (
>>>>
>>>>  https://meta.wikimedia.org/**wiki/Research:Newsletter/2011/**
>>> July#Edit_wars_and_conflict_**metrics<https://meta.wikimedia.org/wiki/Research:Newsletter/2011/July#Edit_wars_and_conflict_metrics>
>>>
>>>> ). That paper explains at length how this metric serves to distinguish
>>>> vandalism reverts from edit wars. Of course there are ample
>>>> possibilities to refine it, e.g. taking into account page protection
>>>> logs.
>>>>
>>>> Personally, I'm more concerned that the new paper totally fails to put
>>>> its subject into perspective by stating how frequent such
>>>> controversial articles are overall on Wikipedia. Thus it's no wonder
>>>> that the ample international media coverage that it generated mostly
>>>> transports the notion (or reinforces the preconception) of Wikipedia
>>>> as a huge battleground.
>>>>
>>>> The 2011 Sumi et al. paper did a better job in that respect: "less
>>>> than 25k articles, i.e. less than 1% of the 3m articles available in
>>>> the November 2009 English WP dump, can be called controversial, and of
>>>> these, less than half are truly edit wars."
>>>>
>>>>
>>>>  In order to measure how controversial an article is, there are a number
>>>>>
>>>> of
>>>>
>>>>> metrics that could be used, though of course no metric is perfect and
>>>>>
>>>> many
>>>>
>>>>> metrics can be very difficult to accurately and rigorously measure:
>>>>>
>>>>> * amount of talk page discussion generated for each article;
>>>>> * number of page watchers;
>>>>> * number of page views (possibly);
>>>>> * number of arbitration cases or other dispute resolution procedures
>>>>> related to the article (perhaps a key metric in determining which
>>>>>
>>>> articles
>>>>
>>>>> are truly most controversial); and
>>>>> * edit frequency and time between certain edits and partial or full
>>>>> reverts of those edits.
>>>>>
>>>>> There are likely a number of other metrics that could be used as well
>>>>>
>>>> to
>>>
>>>> measure controversiality; these were simply off the top of my head.
>>>>>
>>>> Perhaps you are interested in this 2012 paper comparing such metrics,
>>>> which the authors of the present paper cite to justify their choice of
>>>> metric:
>>>> Sepehri Rad, H., Barbosa, D.: Identifying controversial articles in
>>>> Wikipedia: A comparative study.
>>>> http://www.wikisym.org/ws2012/**p18wikisym2012.pdf<http://www.wikisym.org/ws2012/p18wikisym2012.pdf>
>>>>
>>>> Regarding detection of (partial or full) reverts, see also
>>>> https://meta.wikimedia.org/**wiki/Research:Revert_detection<https://meta.wikimedia.org/wiki/Research:Revert_detection>
>>>>
>>>>  The second point that stuck out at me was that the study relied on a
>>>>> database dump from March 2010. While this may be unavoidable, being
>>>>>
>>>> over
>>>
>>>> three years later, this introduces obvious bias into the data and its
>>>>> findings. Put another way, for the English Wikipedia started in 2001,
>>>>>
>>>> this
>>>>
>>>>> omits a quarter of the project's history(!). Again, given the length of
>>>>> time needed to draft and prepare a study, this gap may very well be
>>>>> unavoidable, but it certainly made me raise an eyebrow.
>>>>>
>>>>> One final comment I had from briefly reading the study was that in the
>>>>> past few years we've made good strides in making research like this
>>>>> easier. Not that computing cryptographic hashes is particularly
>>>>>
>>>> intensive,
>>>>
>>>>> but these days we now store such hashes directly in the database
>>>>>
>>>> (though
>>>
>>>> we store SHA-1 hashes, not MD5 hashes as the study used). Storing these
>>>>> hashes in the database saves researchers the need to compute the hashes
>>>>> themselves and allows MediaWiki and other software the ability to
>>>>>
>>>> easily
>>>
>>>> and quickly detect full reverts.
>>>>>
>>>>> MZMcBride
>>>>>
>>>>> P.S. Noting that this study is still a draft, I happened to notice a
>>>>>
>>>> small
>>>>
>>>>> typo on page nine: "We tried to a as diverse as possible sample
>>>>>
>>>> including
>>>
>>>> West European [...]". Hopefully this can be corrected before formal
>>>>> publication.
>>>>>
>>>>>
>>>>
>>>> --
>>>> Tilman Bayer
>>>> Senior Operations Analyst (Movement Communications)
>>>> Wikimedia Foundation
>>>> IRC (Freenode): HaeB
>>>>
>>>>
>>>
>>> --
>>> Dr Taha Yasseri
>>> http://www.oii.ox.ac.uk/**people/yasseri/<http://www.oii.ox.ac.uk/people/yasseri/>
>>> Oxford Internet Institute
>>> University of Oxford
>>> 1 St.Giles
>>> Oxford OX1 3JS
>>> Tel.01865-287229
>>> ------------------------------**-------------
>>> Latest Article: Phys. Rev. Lett. Opinions, Conflicts, and Consensus:
>>> Modeling Social Dynamics in a Collaborative
>>> Environment<http://prl.aps.**org/abstract/PRL/v110/i8/**e088701<http://prl.aps.org/abstract/PRL/v110/i8/e088701>
>>> >
>>>
>>> Non-technical review: University of Oxford, Mathematical model
>>> 'describes'
>>> how online conflicts are
>>> resolved<http://www.ox.ac.uk/**media/news_stories/2013/**130220.html<http://www.ox.ac.uk/media/news_stories/2013/130220.html>
>>> >
>>> ______________________________**_________________
>>> Wikimedia-l mailing list
>>> Wikimedia-l at lists.wikimedia.**org <Wikimedia-l at lists.wikimedia.org>
>>> Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-l<https://lists.wikimedia.org/mailman/listinfo/wikimedia-l>
>>> ,
>>> <mailto:wikimedia-l-request@**lists.wikimedia.org<wikimedia-l-request at lists.wikimedia.org>
>>> ?subject=**unsubscribe>
>>>
>>>  ______________________________**_________________
>> Wikimedia-l mailing list
>> Wikimedia-l at lists.wikimedia.**org <Wikimedia-l at lists.wikimedia.org>
>> Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-l<https://lists.wikimedia.org/mailman/listinfo/wikimedia-l>,
>> <mailto:wikimedia-l-request@**lists.wikimedia.org<wikimedia-l-request at lists.wikimedia.org>
>> ?subject=**unsubscribe>
>>
>
>
> ______________________________**_________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.**org <Wikimedia-l at lists.wikimedia.org>
> Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-l<https://lists.wikimedia.org/mailman/listinfo/wikimedia-l>,
> <mailto:wikimedia-l-request@**lists.wikimedia.org<wikimedia-l-request at lists.wikimedia.org>
> ?subject=**unsubscribe>
>



-- 
Dr Taha Yasseri
http://www.oii.ox.ac.uk/people/yasseri/
Oxford Internet Institute
University of Oxford
1 St.Giles
Oxford OX1 3JS
Tel.01865-287229
-------------------------------------------
Latest Article: Phys. Rev. Lett. Opinions, Conflicts, and Consensus:
Modeling Social Dynamics in a Collaborative
Environment<http://prl.aps.org/abstract/PRL/v110/i8/e088701>

Non-technical review: University of Oxford, Mathematical model 'describes'
how online conflicts are
resolved<http://www.ox.ac.uk/media/news_stories/2013/130220.html>


More information about the Wikimedia-l mailing list