Dear Wiki Researchers,
During the summer we have worked on Wikihadoop [0], a tool that allows us to create the diffs between two revisions of a Wiki article using Hadoop. Now I am happy to announce that the entire diffdb is available for download at http://http://dumps.wikimedia.org/other/diffdb/
This dataset is based on the English Wikipedia April 2011 XML dump files. The advantage of this dataset is that: a) You can search for specific content being added / removed b) Measure more accurately how much text an editor has added or removed
We are currently working on a Lucene-based application [1] that will allow us to quickly search for specific strings being added or removed.
If you have any questions, then please let me know!
[0] https://github.com/whym/wikihadoop [1] https://github.com/whym/diffindexer
Best regards,
Diederik van Liere
Hi Diederik,
I have two questions:
1. Which algorithm you used to get the added/removed content between two revisions of wikipedia? 2. What is the size of the diffdb dump after extracting? I do not want to waste wikipedia bandwidth if I know that I can not deal with it ;).
By the way, what you did is exactly what I just started working on to implement for my project, so thanks a lot :)
Regards.
On Fri, Nov 4, 2011 at 13:19, Diederik van Liere dvanliere@gmail.comwrote:
Dear Wiki Researchers,
During the summer we have worked on Wikihadoop [0], a tool that allows us to create the diffs between two revisions of a Wiki article using Hadoop. Now I am happy to announce that the entire diffdb is available for download at http://http://dumps.wikimedia.org/other/diffdb/
This dataset is based on the English Wikipedia April 2011 XML dump files. The advantage of this dataset is that: a) You can search for specific content being added / removed b) Measure more accurately how much text an editor has added or removed
We are currently working on a Lucene-based application [1] that will allow us to quickly search for specific strings being added or removed.
If you have any questions, then please let me know!
[0] https://github.com/whym/wikihadoop [1] https://github.com/whym/diffindexer
Best regards,
Diederik van Liere
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Rami,
If I recall correctly, we use the diff library from google ( http://code.google.com/p/google-diff-match-patch/) and the total size is about 420Gb (after decompression).
But you can also just download a couple of chunks and see if you can handle those. Best, Diederik
On Fri, Nov 4, 2011 at 5:56 PM, Rami Al-Rfou' rmyeid@gmail.com wrote:
Hi Diederik,
I have two questions:
- Which algorithm you used to get the added/removed content between
two revisions of wikipedia? 2. What is the size of the diffdb dump after extracting? I do not want to waste wikipedia bandwidth if I know that I can not deal with it ;).
By the way, what you did is exactly what I just started working on to implement for my project, so thanks a lot :)
Regards.
On Fri, Nov 4, 2011 at 13:19, Diederik van Liere dvanliere@gmail.comwrote:
Dear Wiki Researchers,
During the summer we have worked on Wikihadoop [0], a tool that allows us to create the diffs between two revisions of a Wiki article using Hadoop. Now I am happy to announce that the entire diffdb is available for download at http://http://dumps.wikimedia.org/other/diffdb/
This dataset is based on the English Wikipedia April 2011 XML dump files. The advantage of this dataset is that: a) You can search for specific content being added / removed b) Measure more accurately how much text an editor has added or removed
We are currently working on a Lucene-based application [1] that will allow us to quickly search for specific strings being added or removed.
If you have any questions, then please let me know!
[0] https://github.com/whym/wikihadoop [1] https://github.com/whym/diffindexer
Best regards,
Diederik van Liere
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Rami Al-Rfou' 631-371-3165
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org