On Fri, Nov 4, 2011 at 5:56 PM, Rami Al-Rfou' <rmyeid@gmail.com> wrote:

Hi Diederik,

I have two questions:
Which algorithm you used to get the added/removed content between two revisions of wikipedia?

What is the size of the diffdb dump after extracting? I do not want to waste wikipedia bandwidth if I know that I can not deal with it ;).

By the way, what you did is exactly what I just started working on to implement for my project, so thanks a lot :)

Regards.

On Fri, Nov 4, 2011 at 13:19, Diederik van Liere <dvanliere@gmail.com> wrote:

Dear Wiki Researchers,

During the summer we have worked on Wikihadoop [0], a tool that allows us to create the diffs between two revisions of a Wiki article using Hadoop.
Now I am happy to announce that the entire diffdb is available for download at http://http://dumps.wikimedia.org/other/diffdb/

This dataset is based on the English Wikipedia April 2011 XML dump files. The advantage of this dataset is that:
a) You can search for specific content being added / removed
b) Measure more accurately how much text an editor has added or removed

We are currently working on a Lucene-based application [1] that will allow us to quickly search for specific strings being added or removed.

If you have any questions, then please let me know!

[0] https://github.com/whym/wikihadoop

[1] https://github.com/whym/diffindexer

Best regards,

Diederik van Liere

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--
Rami Al-Rfou'
631-371-3165

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

--
<a href="http://about.me/diederik">Check out my about.me profile!</a>