Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

List overview All Threads
Download

newer

older

Re: [Wiki-research-l]...

Summary of findings from WMF...

Diederik van Liere

17 Aug 2011 17 Aug '11

9:58 a.m.

Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).

This means: 1) We can now harness Hadoop's distributed computing capabilities in analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.

The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki

(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

Best,

Diederik

Attachments:

attachment.htm (text/html — 3.1 KB)

Show replies by date

Dmitry Chichkov

17 Aug 17 Aug

2:28 p.m.

New subject: Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

Hello,

This is an excellent news!

Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on what kind of task (xml parsing, diffs, md5, etc?) was it obtained?

-- Best, Dmitry

On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanliere@gmail.comwrote:

...

Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).

This means:

We can now harness Hadoop's distributed computing capabilities in

analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.

The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki

(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

Best,

Diederik

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Aaron Halfaker

3:02 p.m.

New subject: Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

We haven't tried EC2. Since this use-case was really strange for hadoop (if we break the dump up by pages, some are > 160GB!!) we have been rolling our own code and testing on our own machines so we had the flexibility to get things working. Theoretically, one should be able to use this on EC2 (or another large cluster) to get vastly improved runtime.

We are generating diffs (using http://code.google.com/p/google-diff-match-patch/) in pypy via the streaming interface. I expect much better performance from tasks that are less computationally intensive.

-Aaron

On Wed, Aug 17, 2011 at 2:28 PM, Dmitry Chichkov dchichkov@gmail.comwrote:

...

Hello,

This is an excellent news!

Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on what kind of task (xml parsing, diffs, md5, etc?) was it obtained?

-- Best, Dmitry

On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanliere@gmail.comwrote:

...
Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump(dump file with the complete edit histories)using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).

This means:

We can now harness Hadoop's distributed computing capabilities in

analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.

The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki

(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

Best,

Diederik

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Diederik van Liere

3:07 p.m.

New subject: Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

So the 14 day task included xml parsing and creating diffs. We might gain performance improvements by fine-tuning the Hadoop configuration although that seems to be more of an art than science. Diederik

On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov dchichkov@gmail.comwrote:

...

Hello,

This is an excellent news!

Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on what kind of task (xml parsing, diffs, md5, etc?) was it obtained?

-- Best, Dmitry

On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanliere@gmail.comwrote:

...
Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump(dump file with the complete edit histories)using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).

This means:

We can now harness Hadoop's distributed computing capabilities in

analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.

The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki

(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

Best,

Diederik

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- <a href="http://about.me/diederik">Check out my about.me profile!</a>

Dmitry Chichkov

4:14 p.m.

New subject: Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can see how this can be very useful! Otherwise... well... It seems like Hadoop gives you a lot of overhead, and it is just not practical to do parsing this way.

With a straightforward implementation in Python, on a single Core2 Duo you can parse the dump (7z), compute diffs, md5, etc and store everything into a binary form in about 6-7 days. For example an implementation here: http://code.google.com/p/pymwdat/ can do exactly that. I imagine that with faster C++ code and with modern i7 box it can be done within a day. And after that this precomputed binary form (diffs+metadata+stats take about several times of the .7z dump ~ 100Gb) can be serialized very efficiently (just about an hour on a single box).

Saying that, I still think using Hadoop/EC2 could be really nice. Particularly if the dump can be made available on the S3/EC2.

-- Best, Dmitry

On Wed, Aug 17, 2011 at 3:07 PM, Diederik van Liere dvanliere@gmail.comwrote:

...

So the 14 day task included xml parsing and creating diffs. We might gain performance improvements by fine-tuning the Hadoop configuration although that seems to be more of an art than science. Diederik

On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov dchichkov@gmail.comwrote:

...
Hello,

This is an excellent news!

Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on what kind of task (xml parsing, diffs, md5, etc?) was it obtained?

-- Best, Dmitry

On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanliere@gmail.comwrote:

...
Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump(dump file with the complete edit histories)using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).

This means:

We can now harness Hadoop's distributed computing capabilities in

analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.

The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki

(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

Best,

Diederik

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- <a href="http://about.me/diederik">Check out my about.me profile!</a>

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Dirk Riehle

14 Sep 14 Sep

4:28 a.m.

New subject: [Wikitech-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

Hello everyone!

Wikihadoop sounds like a great project!

I wanted to point out that you can make it even more powerful for many research applications by combining it with the Sweble Wikitext parser.

Doing so, you could enable Wikipedia dump processing not only on the rough XML dump level, but on the fine grain individual element (bold piece, heading, paragraph, category, page, etc.) level.

You can learn more about Sweble here: http://sweble.org

Cheers, Dirk

On 08/17/2011 06:58 PM, Diederik van Liere wrote:

...

Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).

This means:

We can now harness Hadoop's distributed computing capabilities in

analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.

The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki

(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/

Best,

Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Website: http://dirkriehle.com - Twitter: @dirkriehle Ph (DE): +49-157-8153-4150 - Ph (US): +1-650-450-8550

4860

Age (days ago)

4888

Last active (days ago)

wiki-research-l@lists.wikimedia.org

5 comments

4 participants

tags (0)

participants (4)

Aaron Halfaker
Diederik van Liere
Dirk Riehle
Dmitry Chichkov