Re: [Wiki-research-l] Announcing availability new dataset diffdb

4 Nov 2011

Hi Diederik,

I have two questions:

   1. Which algorithm you used to get the added/removed content between two
   revisions of wikipedia?
   2. What is the size of the diffdb dump after extracting? I do not want
   to waste wikipedia bandwidth if I know that I can not deal with it ;).

By the way, what you did  is exactly what I just started working on to
implement for my project, so thanks a lot :)

Regards.

On Fri, Nov 4, 2011 at 13:19, Diederik van Liere &lt;dvanliere(a)gmail.com&gt;wrote;wrote:

...
  Dear Wiki Researchers,

 During the summer we have worked on Wikihadoop [0], a tool that allows us
 to create the diffs between two revisions of a Wiki article using Hadoop.
 Now I am happy to announce that the entire diffdb is available for
 download at http://http://dumps.wikimedia.org/other/diffdb/

 This dataset is based on the English Wikipedia April 2011 XML dump files.
 The advantage of this dataset is that:
 a) You can search for specific content being added / removed
 b) Measure more accurately how much text an editor has added or removed

 We are currently working on a Lucene-based application [1] that will allow
 us to quickly search for specific strings being added or removed.

 If you have any questions, then please let me know!

 [0] https://github.com/whym/wikihadoop
 [1] https://github.com/whym/diffindexer

 Best regards,

 Diederik van Liere

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- 
Rami Al-Rfou'
631-371-3165

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Announcing availability new dataset diffdb