Hi,

 

you might want to have a look at the JWPL Revision Toolkit

http://code.google.com/p/jwpl/

It should provide most information you are looking for, especially access to all the modifications and a parser to extract the plain text from Wikipedia.

 

The UIMA toolkit

http://code.google.com/p/dkpro-core-asl/

also contains a component that gets you all pairs of adjacent revisions from which makes it quite easy to spot the ones which are additions only.

 

-Torsten

 

 

From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of Rami Al-Rfou'
Sent: Tuesday, October 18, 2011 9:29 PM
To: Research into Wikimedia content and communities
Cc: Yanqing Chen
Subject: [Wiki-research-l] Wikipedia Users database

 

Hi,

 

I am planning to study the difference in users edits style and their spelling errors in English Wikipedia as part of a research project I am involved in.

 

So I downloaded some of the wikipedia XML partial dump and convert them to SQL. My understanding that wikipedia stores every copy of the pages in the database.

 

  • I can not see the users table! Is the users table stored in a special partial dump?
  • Does the user table contain any properties related to the user country, preferred wikipeidas, or their skill in different languages ?
  • I am interested in the user modifications that contain addition to the articles and not modification or deletion. I am planning now to diff between revisions to get such data. Are you aware of any tool or effort that can help?
  • Are you aware of any tools that extract the text from wikipedia markup language.

Regards.

 

--

Rami Al-Rfou'

PhD student at Stony Brook University