Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).
This means: 1) We can now harness Hadoop's distributed computing capabilities in analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.
The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki
(Apologies for cross-posting to wikitech-l and wiki-research-l)
[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
Best,
Diederik
Very cool!
--tomasz
On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanliere@gmail.com wrote:
Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).
This means:
- We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.
The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki
(Apologies for cross-posting to wikitech-l and wiki-research-l)
[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
Best,
Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Way cool - Look forward to a brown bag on this project - Diederik? :-)
-Alolita
On Wed, Aug 17, 2011 at 10:05 AM, Tomasz Finc tfinc@wikimedia.org wrote:
Very cool!
--tomasz
On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanliere@gmail.com wrote:
Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).
This means:
- We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.
The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki
(Apologies for cross-posting to wikitech-l and wiki-research-l)
[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
Best,
Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello everyone!
Wikihadoop sounds like a great project!
I wanted to point out that you can make it even more powerful for many research applications by combining it with the Sweble Wikitext parser.
Doing so, you could enable Wikipedia dump processing not only on the rough XML dump level, but on the fine grain individual element (bold piece, heading, paragraph, category, page, etc.) level.
You can learn more about Sweble here: http://sweble.org
Cheers, Dirk
On 08/17/2011 06:58 PM, Diederik van Liere wrote:
Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible).
This means:
- We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP.
The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki
(Apologies for cross-posting to wikitech-l and wiki-research-l)
[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
Best,
Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org