So the 14 day task included xml parsing and creating diffs. We might gain
performance improvements by fine-tuning the Hadoop configuration although
that seems to be more of an art than science.
Diederik
On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov <dchichkov(a)gmail.com>wrote;wrote:
Hello,
This is an excellent news!
Have you tried running it on Amazon EC2? It would be really nice to know
how well WikiHadoop scale up with the number of nodes.
Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on
what kind of task (xml parsing, diffs, md5, etc?) was it obtained?
-- Best, Dmitry
On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere <dvanliere(a)gmail.com>wrote;wrote:
Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker
and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked
hard on a customized stream-based InputFormatReader that allows parsing of
both bz2 compressed and uncompressed files of the full Wikipedia dump(dump file with the
complete edit histories)using Hadoop. Prior to WikiHadoop and the accompanying
InputFormatReader it
was not possible to use Hadoop to analyze the full Wikipedia dump files
(see the detailed tutorial / background for an explanation why that was not
possible).
This means:
1) We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files.
2) You can send either one or two revisions to a single mapper so it's
possible to diff two revisions and see what content has been addded /
removed.
3) You can exclude namespaces by supplying a regular expression.
4) We are using Hadoop's Streaming interface which means people can use
this InputFormat Reader using different languages such as Java, Python, Ruby
and PHP.
The source code is available at:
https://github.com/whym/wikihadoop
A more detailed tutorial and installation guide is available at:
https://github.com/whym/wikihadoop/wiki
(Apologies for cross-posting to wikitech-l and wiki-research-l)
[0]
http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
Best,
Diederik
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
<a href="http://about.me/diederik">Check out my about.me
profile!</a>