> I'm doing some analysis on the wikipedia image metadata and seeing some
> missing image rows in the sql dumps.
>
> I downloaded
> enwiki-latest-image.sql, enwiki-latest-imagelinks.sql,
> enwiki-latest-imagelinks.sql
> and enwiki-latest-oldimage.sql from
> http://dumps.wikimedia.org/enwiki/latest/
>
> I picked a page, 25041,
> http://en.wikipedia.org/wiki/Special:Export/Lockheed_P-38_Lightning
>
> I get 39 links from
> "select il_to from imagelinks where il_from = 25041"
>
> When I query the image table for these, only 8 of the 39 appear.
> Some of the missing files are 050218-F-1234P-076.jpg, 020930-O-9999G-017.jpg
>
> I grepped the original mysql file for these and get nothing.
>
> I can see the original file here though:
> http://en.wikipedia.org/wiki/File:050218-F-1234P-076.jpg
>
> I did a select count and got a total of 849,801 rows. Seems low for the
> total # of wikipedia images.
>
> Any ideas why i'm getting missing data?
>
> --
> @tommychheng
> http://tommy.chheng.com
>
Please join the summer of research fellows (http://meta.wikimedia.org/wiki/Research:WSOR11) as we present the results of our research on the causes, effects, characteristics, tools, and visualizations of the editor decline.
Test your knowledge of Wikipedia against some of the summer of research results by taking our quiz (https://docs.google.com/spreadsheet/viewform?formkey=dFdscUFfY0dhdUN5eGkwUT…). Submit the quiz before the brown bag on Thursday, August 25th from 2-3:30pm PST. We'll share our research and the answers to the quiz. Prizes will be given out for the highest grade, but you must attend in SF or remotely to receive your prize.
**REMOTE call-in instructions**
Topic: Summer of Research Wrap-up
Date: Thursday, August 25, 2011
Time: 2:00 pm, Pacific Daylight Time (San Francisco, GMT-07:00)
Meeting Number/Access Code: 801 468 325
Meeting Password: (This meeting does not require a password.)
-------------------------------------------------------
To join the online meeting (Now from mobile devices!)
-------------------------------------------------------
1. Go to https://wikimedia.webex.com/wikimedia/j.php?ED=159146047&UID=1207190257&RT=…
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: (This meeting does not require a password.)
4. Click "Join".
To view in other time zones or languages, please click the link:
https://wikimedia.webex.com/wikimedia/j.php?ED=159146047&UID=1207190257&ORT…
-------------------------------------------------------
To join the audio conference only
-------------------------------------------------------
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll-free number (US/Canada): 1-877-669-3239
Call-in toll number (US/Canada): +1-408-600-3600
Global call-in numbers: https://wikimedia.webex.com/wikimedia/globalcallin.php?serviceType=MC&ED=15…
Toll-free dialing restrictions: http://www.webex.com/pdf/tollfree_restrictions.pdf
I'm doing some analysis on the wikipedia image metadata and seeing some
missing image rows in the sql dumps.
I downloaded
enwiki-latest-image.sql, enwiki-latest-imagelinks.sql,
enwiki-latest-imagelinks.sql
and enwiki-latest-oldimage.sql from
http://dumps.wikimedia.org/enwiki/latest/
I picked a page, 25041,
http://en.wikipedia.org/wiki/Special:Export/Lockheed_P-38_Lightning
I get 39 links from
"select il_to from imagelinks where il_from = 25041"
When I query the image table for these, only 8 of the 39 appear.
Some of the missing files are 050218-F-1234P-076.jpg, 020930-O-9999G-017.jpg
I grepped the original mysql file for these and get nothing.
I can see the original file here though:
http://en.wikipedia.org/wiki/File:050218-F-1234P-076.jpg
I did a select count and got a total of 849,801 rows. Seems low for the
total # of wikipedia images.
Any ideas why i'm getting missing data?
--
@tommychheng
http://tommy.chheng.com
I've updated my dump processing python project to include code for quickly
detecting identity reverts from XML dumps. See
https://bitbucket.org/halfak/wikimedia-utilities for the project and the
process() function at bottom of
https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/…
for
the algorithm. The actual function with the revert detection logic is about
50 lines long.
The resulting dump.map function using this revert processor() will emit
"revert" revisions and "reverted" revisions with the following fields:
Revert revision:
- "revert" - denotes that this row is a reverting edit
- revision_id - the rev_id if the reverting edit
- reverted_to_id - the rev_id of the reverted to edit
- for_vandalism - using D_LOOSE/D_STRICT regular expression on the
reverting comment (See Priedhorsky et al. "Creating, Destroying and
Restoring Value in Wikipedia" GROUP 2007)
- reverted_revs - number of revisions that were reverted (this is the
number of revisions between the reverting edit and reverted to edit)
Reverted revision:
- "reverted" - denotes that this row is a reverted edit
- revision_id - the rev_id of the reverted edit
- reverting_id - the rev_id if the reverting edit
- reverted_to_id - the rev_id of the reverted to edit
- for_vandalism - using D_LOOSE/D_STRICT regular expression on the
reverting comment (See Priedhorsky et al. "Creating, Destroying and
Restoring Value in Wikipedia" GROUP 2007)
- reverted_revs - number of revisions that were reverted (this is the
number of revisions between the reverting edit and reverted to edit)
I hope this is helpful.
-Aaron
On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker <aaron.halfaker(a)gmail.com>wrote:
> An identity revert is one which changes the article to an absolutely
> identical previous state. This is a common operation in the English
> Wikipedia.
>
> There is a Kittur & Kraut (and others) paper which I can't recall that
> found the vast majority of reverts of any sort were identity. Some other
> types the define are:
>
> - "Partial reverts": Part of an edit is discarded
> - "Effective reverts": Looks to be an identity revert, but not
> *exactly* the same as a previous revision. Often a few white-space
> characters were out of place.
>
> See http://www.grouplens.org/node/427 for a discussion of the difficulty
> of detecting reverts in better ways.
>
> My code detects identity reverts. For example suppose the following is the
> content of a sequence of revisions.
>
>
> 1. "foo"
> 2. "bar"
> 3. "foobar"
> 4. "bar"
> 5. "barbar"
>
> Revision #4 reverts back to revision #2 and revision #3 is reverted. When
> looking for identity reverts, I have found that limiting the number of
> revisions that can be reverted to ~15 produces the highest quality of
> results. This is discussed in http://www.grouplens.org/node/416 (see
> http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for
> quick/dirty summary of the work.).
>
> This subject deserves a long conversation, but I think the bit you might be
> interested in is that the identity revert (described above and example)
> seems to be the accepted approach for identifying reverts for most types of
> analyses.
>
> -Aaron
>
> On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian <fabian.floeck(a)kit.edu>wrote:
>
>> Hi Aaron,
>>
>> thanks, that would be awesome :) we built something ourselves, but I'm not
>> quite content with it.
>>
>> Could you also tell me how you defined a revert (and maybe how you
>> determine who is the reverter)? Because this is a crucial issue for me.
>> Is it the complete deletion of all the characters entered by an editor in
>> an edit? What about editors that revert others or delete content? do you
>> treat their edits as being reverted if the deleted content gets
>> reintroduced? Did you take into account location of the words in the text or
>> did you use a bag-of-words model?
>> I read many papers and tool documentations that use "reverts", and some
>> mention their method (while many don't), while it seems almost no-one
>> describes their definition of what a "revert" actually is.
>>
>> But maybe I will get the answers to this from your code as well :)
>>
>> Anyway, thanks for the help!
>>
>> Best,
>> Fabian
>>
>>
>> On 19 Aug 2011, at 18:31, Aaron Halfaker wrote:
>>
>> Fabian,
>>
>> I actually have some software for quickly producing reverts from a
>> database dump. The framework for doing it is available here:
>> https://bitbucket.org/halfak/wikimedia-utilities. I still have to
>> package up the code that actually generates the reverts though. It's just a
>> matter of finding time to sit down with it and figure out the dependencies!
>> I expect that I can have it ready by Monday. I hope to actually package up
>> the revert detecting code into the above python project as an example.
>>
>> I just wanted to let you know that I have a response for you on the way.
>>
>> -Aaron
>>
>> On Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian <fabian.floeck(a)kit.edu>wrote:
>>
>>> Hi,
>>>
>>> I'm trying to detect reverts in Wikipedia for my research, right now with
>>> a self-built script using MD5hashes and DIFFs between revisions. I always
>>> read about people taking reverts into account in their data, but it's
>>> seldomly described HOW exactly a revert is determined or what tool they use
>>> to do that. Can you point me to any research or tools or tell me maybe what
>>> you used in your own research to identify which edits were reverted and/or
>>> who reverted them?
>>>
>>> Best,
>>>
>>> Fabian
>>>
>>>
>>>
>>>
>>> --
>>> Karlsruhe Institute of Technology (KIT)
>>> Institute of Applied Informatics and Formal Description Methods
>>>
>>> Dipl.-Medwiss. Fabian Flöck
>>> Research Associate
>>>
>>> Building 11.40, Room 222
>>> KIT-Campus South
>>> D-76128 Karlsruhe
>>>
>>> Phone: +49 721 608 4 6584
>>> Skype: f.floeck_work
>>> E-Mail: fabian.floeck(a)kit.edu
>>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
>>>
>>> KIT – University of the State of Baden-Wuerttemberg and
>>> National Research Center of the Helmholtz Association
>>>
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Karlsruhe Institute of Technology (KIT)
>> Institute of Applied Informatics and Formal Description Methods
>>
>> Dipl.-Medwiss. Fabian Flöck
>> Research Associate
>>
>> Building 11.40, Room 222
>> KIT-Campus South
>> D-76128 Karlsruhe
>>
>> Phone: +49 721 608 4 6584
>> Skype: f.floeck_work
>> E-Mail: fabian.floeck(a)kit.edu
>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
>>
>> KIT – University of the State of Baden-Wuerttemberg and
>> National Research Center of the Helmholtz Association
>>
>>
>
It's worth pointing out in our research at PARC, we had also discussed
the possibility of using containment based measure as described in:
On the resemblance and containment of documents, AZ Broder
In the end, we realized that the real issue is that there is no
universal agreement on what is a 'revert'.
--Ed
On Sun, Aug 21, 2011 at 3:15 PM,
<wiki-research-l-request(a)lists.wikimedia.org> wrote:
> There have been a few publication on the subject:
> 1. "Us vs. them: Understanding social dynamics in Wikipedia with revert
> graph visualizations", B Suh, EH Chi, BA Pendleton.
> 2. "He says, she says: Conflict and coordination in Wikipedia.", A Kittur, B
> Suh, BA Pendleton.
>
We received several requests to extend the submission deadline for WikiViz 2011 – a competition organized by WikiSym and the Wikimedia Foundation to visualize WIkipedia's impact with open data.
The WikiSym committee is glad to announce that the deadline has been extended to August 28 2011.
http://www.wikisym.org/ws2011/wikiviz:presentationhttp://twitter.com/WikiViz/status/104349201680437248
The 3 finalists will have travel costs covered for the awarding ceremony at WikiSym 2011 in Mountain View, CA (3-5 October 2011) and their work showcased at the conference, featured in our partners' dataviz outlets (FlowingData, Information Aesthetics, Periscopic, Visualizing.org) and published by El Mundo – the largest digital newspaper by readership in Spanish.
Please circulate the call to anyone who might be interested.
Best,
Dario
--
Dario Taraborelli, PhD
Senior Research Analyst
Wikimedia Foundation
http://wikimediafoundation.orghttp://nitens.org/taraborelli
Hi,
I'm trying to detect reverts in Wikipedia for my research, right now with a self-built script using MD5hashes and DIFFs between revisions. I always read about people taking reverts into account in their data, but it's seldomly described HOW exactly a revert is determined or what tool they use to do that. Can you point me to any research or tools or tell me maybe what you used in your own research to identify which edits were reverted and/or who reverted them?
Best,
Fabian
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Skype: f.floeck_work
E-Mail: fabian.floeck(a)kit.edu
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
Hello,
we have written a tool that simulates the hardware-meter. It should run on
Windows, Linux, Android and MacOS.
You can find it on our project-page: http://l3q.de/pediameter and in the
android market when it has left the beta-status
Greets,
Lukas Benedix and Jens Hantke.
Hello listeners,
in a project on our university(FU-Berlin), wevisualize therecent
changeson the majorWikipedia's(related tolanguage). It's called Pediameter.
If you're interested, you can have a look at our project on
http://l3q.de/pediameter/ .
It's supported on Windows, Linux and MacOSX.
Greets,
Lukas Benedix and Jens Hantke.