Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

17 Aug 2011

Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can
see how this can be very useful! Otherwise... well... It seems like Hadoop
gives you a lot of overhead, and it is just not practical to do parsing this
way.

With a straightforward implementation in Python, on a single Core2 Duo you
can parse the dump (7z), compute diffs, md5, etc and store everything into a
binary form in about 6-7 days.
For example an implementation here: http://code.google.com/p/pymwdat/  can
do exactly that. I imagine that with faster C++ code and with modern i7 box
it can be done within a day.
And after that this precomputed binary form (diffs+metadata+stats take about
several times of the .7z dump ~ 100Gb) can be serialized very efficiently
(just about an hour on a single box).

Saying that, I still think using Hadoop/EC2 could be really nice.
Particularly if the dump can be made available on the S3/EC2.

-- Best, Dmitry

On Wed, Aug 17, 2011 at 3:07 PM, Diederik van Liere &lt;dvanliere(a)gmail.com&gt;wrote;wrote:

...
  So the 14 day task included xml parsing and creating
diffs. We might gain
 performance improvements by fine-tuning the Hadoop configuration although
 that seems to be more of  an art than science.
 Diederik

  On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov &lt;dchichkov(a)gmail.com&gt;wrote;wrote:

  Hello,

 This is an excellent news!

 Have you tried running it on Amazon EC2? It would be really nice to know
 how well WikiHadoop scale up with the number of nodes.
 Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on
 what kind of task (xml parsing, diffs, md5, etc?) was it obtained?

 -- Best, Dmitry

 On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere &lt;dvanliere(a)gmail.com&gt;wrote;wrote:

  Hello!

 Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker
 and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked
 hard on a customized stream-based InputFormatReader that allows parsing of
 both bz2 compressed and uncompressed files of the full Wikipedia dump(dump file with the
complete edit histories)using Hadoop. Prior to WikiHadoop and the accompanying
InputFormatReader it
 was not possible to use Hadoop to analyze the full Wikipedia dump files
 (see the detailed tutorial / background for an explanation why that was not
 possible).

 This means:
 1) We can now harness Hadoop's distributed computing capabilities in
 analyzing the full dump files.
 2) You can send either one or two revisions to a single mapper so it's
 possible to diff two revisions and see what content has been addded /
 removed.
 3) You can exclude namespaces by supplying a regular expression.
 4) We are using Hadoop's Streaming interface which means people can use
 this InputFormat Reader using different languages such as Java, Python, Ruby
 and PHP.

 The source code is available at: https://github.com/whym/wikihadoop
 A more detailed tutorial and installation guide is available at:
 https://github.com/whym/wikihadoop/wiki

 (Apologies for cross-posting to wikitech-l and wiki-research-l)

 [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

 Best,

 Diederik

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 --
 <a href="http://about.me/diederik">Check out my about.me
profile!</a>

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files