I had a discussion on IRC and this statement of mine is not true:

"More so when an admin revert is preformed it seemingly overwrites the earlier revision itself - at least for the datadumps so the ~210gb of archives are barred of any vandalism at all at the moment."

Just telling that to avoid making put their thinking caps on unnecessarily. My other problem still is however unresolved. :)

---------- Forwarded message ----------
From: White Cat <wikipedia.kawaii.neko@gmail.com>
Date: Thu, Mar 17, 2011 at 12:14 AM
Subject: Re: [Xmldatadumps-l] [xmldatadumps-l] Torrents
To: xmldatadumps-l@lists.wikimedia.org


Ah sorry I just realized I wasn't emailing xmldatadumps-l@lists.wikimedia.org at all and instead emailing individual people talking it me.

Thanks to much appreciated help from people like Jamie Morken and Tim Starling I have the data dumps up and running. I also setup my database to record the filtered version of the dump itself on my end and the code seem to be working just fine.

Just to recap: I am trying to analyse the datadumps themselves to create a database of words used in vandalism.

My current strategy is getting the edits right before an admin revert was preformed and process that as vandalism/unwanted. I am not sure this is the best strategy and am open to ideas. 

From experience as a wikipedian for many years, in my experience often vandals spread their vandalism to multiple edits. My current strategy would only get the last revision which may miss the vandalism itself. I quickly realized this flaw of my current strategy.

Aside from that I noticed that individual revisions contain everything on the article that was on the page when edit was pressed. No wonder these dumps are so large, majority of the information is redundant copy of the entire article. I only care about what is removed by the admin revert. I am not sure how to only get what was removed by the admin through admin revert.

More so when an admin revert is preformed it seemingly overwrites the earlier revision itself - at least for the datadumps so the ~210gb of archives are barred of any vandalism at all at the moment.

Once I have my vandalism database, I'll apply various methods of statistical analysis and filters to exclude certain "good" words. I expect common words like "the" or "I" to be collected along with "vandalism" words. The exact methods I'll use will depend on the processed data so thats a consideration I want to preform later on.

On Wed, Mar 16, 2011 at 11:22 AM, emijrp <emijrp@gmail.com> wrote:
Can you use an old full dump (from 2010)? Or do you need an updated one?

2011/2/26 White Cat <wikipedia.kawaii.neko@gmail.com>
http://dumps.wikimedia.org/enwiki/20110115/

Hi, has anyone got plans to create individual torrents for "All pages with complete page edit history (.bz2)" ? I downloaded them and turns out I have several files that seem to be corrupted. I am unable to re-download them but feel the torrent would be able to fix the corrupted parts. All of the individual parts for the dumps except 1st,8th,9th,10th ones are complete.

I need these dumps because I will analyse revisions in hopes of better identifying vandalism on the wikis through machine learning. I however need the database to process this soon as my assignment is due in about a month.

_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l