I had a discussion on IRC and this statement of mine is not true:
"More so when an admin revert is preformed it seemingly overwrites the
earlier revision itself - at least for the datadumps so the ~210gb of
archives are barred of any vandalism at all at the moment."
Just telling that to avoid making put their thinking caps on unnecessarily.
My other problem still is however unresolved. :)
---------- Forwarded message ----------
From: White Cat <wikipedia.kawaii.neko(a)gmail.com>
Date: Thu, Mar 17, 2011 at 12:14 AM
Subject: Re: [Xmldatadumps-l] [xmldatadumps-l] Torrents
To: xmldatadumps-l(a)lists.wikimedia.org
Ah sorry I just realized I wasn't emailing
xmldatadumps-l(a)lists.wikimedia.org at all and instead emailing individual
people talking it me.
Thanks to much appreciated help from people like Jamie Morken and Tim
Starling I have the data dumps up and running. I also setup my database to
record the filtered version of the dump itself on my end and the code seem
to be working just fine.
Just to recap: I am trying to analyse the datadumps themselves to create a
database of words used in vandalism.
My current strategy is getting the edits right before an admin revert was
preformed and process that as vandalism/unwanted. I am not sure this is the
best strategy and am open to ideas.
From experience as a wikipedian for many years, in my
experience often
vandals spread their vandalism to multiple edits. My current
strategy would
only get the last revision which may miss the vandalism itself. I
quickly realized this flaw of my current strategy.
Aside from that I noticed that individual revisions contain everything on
the article that was on the page when edit was pressed. No wonder these
dumps are so large, majority of the information is redundant copy of the
entire article. I only care about what is removed by the admin revert. I am
not sure how to only get what was removed by the admin through admin revert.
More so when an admin revert is preformed it seemingly overwrites the
earlier revision itself - at least for the datadumps so the ~210gb of
archives are barred of any vandalism at all at the moment.
Once I have my vandalism database, I'll apply various methods of statistical
analysis and filters to exclude certain "good" words. I expect common words
like "the" or "I" to be collected along with "vandalism"
words. The exact
methods I'll use will depend on the processed data so thats a consideration
I want to preform later on.
On Wed, Mar 16, 2011 at 11:22 AM, emijrp <emijrp(a)gmail.com> wrote:
Can you use an old full dump (from 2010)? Or do you
need an updated one?
2011/2/26 White Cat <wikipedia.kawaii.neko(a)gmail.com>
http://dumps.wikimedia.org/enwiki/20110115/
Hi, has anyone got plans to create individual torrents for "All pages with
complete page edit history (.bz2)" ? I downloaded them and turns out I have
several files that seem to be corrupted. I am unable to re-download them but
feel the torrent would be able to fix the corrupted parts. All of the
individual parts for the dumps except 1st,8th,9th,10th ones are complete.
I need these dumps because I will analyse revisions in hopes of better
identifying vandalism on the wikis through machine learning. I however need
the database to process this soon as my assignment is due in about a month.
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l