I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary.
I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook.
The tool is "seek-bzip2" by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2
* The free Borland compiler won't compile it due to missing (Unix?) header files * lcc compiles it but it always fails with error "unexpected EOF" * mingw compiles it if the -m64 option is removed from the Makefile but it then has the same behaviour as the lcc build.
My C experience is now quite stale and my 64-bit programming experience negligible.
(I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary)
Andrew Dunbar (hippietrail)
On Thu, 16 Dec 2010 02:21:34 +1100, Andrew Dunbar hippytrail@gmail.com wrote:
I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary.
I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook.
The tool is "seek-bzip2" by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2
- The free Borland compiler won't compile it due to missing (Unix?)
header files
- lcc compiles it but it always fails with error "unexpected EOF"
- mingw compiles it if the -m64 option is removed from the Makefile
but it then has the same behaviour as the lcc build.
My C experience is now quite stale and my 64-bit programming experience negligible.
(I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary)
Have a look to the openZim project, we have launched exactly for this purpose: * free software * supported by WMF * compiles on many systems * extremly fast * use of LZMA2 (better than bzip2) * though primarily for Wikipedia content * used in many other software (like Kiwix for example) * ...
Emmanuel
On 15/12/10 16:21, Andrew Dunbar wrote:
I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary.
I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook.
The tool is "seek-bzip2" by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2
- The free Borland compiler won't compile it due to missing (Unix?) header files
- lcc compiles it but it always fails with error "unexpected EOF"
- mingw compiles it if the -m64 option is removed from the Makefile
but it then has the same behaviour as the lcc build.
My C experience is now quite stale and my 64-bit programming experience negligible.
(I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary)
Andrew Dunbar (hippietrail)
Your problem are Windows text streams. The attached patch fixes it.
Thank you for the link. I was completely unaware of it when I basically did the same thing for mediawiki a couple years ago. http://www.wiki-web.es/mediawiki-offline-reader/
2010/12/16 Ángel González keisial@gmail.com:
On 15/12/10 16:21, Andrew Dunbar wrote:
I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary.
I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook.
The tool is "seek-bzip2" by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2
- The free Borland compiler won't compile it due to missing (Unix?) header files
- lcc compiles it but it always fails with error "unexpected EOF"
- mingw compiles it if the -m64 option is removed from the Makefile
but it then has the same behaviour as the lcc build.
My C experience is now quite stale and my 64-bit programming experience negligible.
(I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary)
Andrew Dunbar (hippietrail)
Your problem are Windows text streams. The attached patch fixes it.
Thank you for the link. I was completely unaware of it when I basically did the same thing for mediawiki a couple years ago. http://www.wiki-web.es/mediawiki-offline-reader/
Thanks Ángel! I feel like a fool for not realizing this. It's the same problem I've worked around many times in the past but not recently. I just got a similar answer on stackoverflow.com
By the way I'm keen to find something similar for .7z
It would be incredibly useful if these indices could be created as part of the dump creation process. Should I file a feature request?
Andrew Dunbar (hippietrail)
On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar hippytrail@gmail.com wrote:
By the way I'm keen to find something similar for .7z
I've written something similar for .xz, which uses LZMA2 same as .7z. It creates a virtual read-only filesystem using FUSE (the FUSE part is in perl, which uses pipes to dd and xzcat). Only real problem is that it doesn't use a stock .xz file, it uses a specially created one which concatenates lots of smaller .xz files (currently I concatenate between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5 and 20 because there's a preference to split on </page><page> boundaries).
Apparently the folks at openzim have done something similar, using LZMA2.
If anyone is interesting in working with me to make a package capable of being released to the public, I'd be willing to share my code. But it sounds like I'm just reinventing a wheel already invented by opensim.
It would be incredibly useful if these indices could be created as part of the dump creation process. Should I file a feature request?
With concatenated .xz files, creating the index is *much* faster, because the .xz format puts the stream size at the end of each stream. Plus with .xz all streams are broken on 4-byte boundaries, whereas with .bz2 blocks can end at any *bit* (which means you have to do painful bit shifting to create the index).
The file is also *much* smaller, on the order of 5-10% of bzip2 for a full history dump.
On 15 December 2010 20:41, Anthony wikimail@inbox.org wrote:
On Wed, Dec 15, 2010 at 12:01 PM, Andrew Dunbar hippytrail@gmail.com wrote:
By the way I'm keen to find something similar for .7z
I've written something similar for .xz, which uses LZMA2 same as .7z. It creates a virtual read-only filesystem using FUSE (the FUSE part is in perl, which uses pipes to dd and xzcat). Only real problem is that it doesn't use a stock .xz file, it uses a specially created one which concatenates lots of smaller .xz files (currently I concatenate between 5 and 20 or so 900K bz2 blocks into one .xz stream - between 5 and 20 because there's a preference to split on </page><page> boundaries).
At the moment I'm interested in .bz2 and .7z because those are the formats WikiMedia currently publishes data in. Though some files are also in .gz so I would also like to find a solution for those.
I thought about the concatenation solution splitting at <page> boundaries for .bz2 until I found out there was already a solution that worked with the vanilla dump files as is.
Apparently the folks at openzim have done something similar, using LZMA2.
If anyone is interesting in working with me to make a package capable of being released to the public, I'd be willing to share my code. But it sounds like I'm just reinventing a wheel already invented by opensim.
I'm interested in what everybody else is doing regarding offline WikiMedia content. I'm also mainly using Perl though I just ran into a problem with 64-bit values when indexing huge dump files.
It would be incredibly useful if these indices could be created as part of the dump creation process. Should I file a feature request?
With concatenated .xz files, creating the index is *much* faster, because the .xz format puts the stream size at the end of each stream. Plus with .xz all streams are broken on 4-byte boundaries, whereas with .bz2 blocks can end at any *bit* (which means you have to do painful bit shifting to create the index).
The file is also *much* smaller, on the order of 5-10% of bzip2 for a full history dump.
Have we made the case for this format to the WikiMedia people? I think they use .bz2 because it is pretty fast for very good compression ratios but they use .7z for the full history dumps where the extremely good compression ratios warrant the slower compression times since these files can be gigantic.
How is .xz for compression times? Would we have to worry about patent issues for LZMA?
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Dec 16, 2010 at 12:47 AM, Andrew Dunbar hippytrail@gmail.com wrote:
At the moment I'm interested in .bz2 and .7z because those are the formats WikiMedia currently publishes data in.
I'm fairly certain the specific 7z format which Wikimedia uses doesn't allow for random access, because the dictionary is never reset.
Have we made the case for this format to the WikiMedia people?
No, there's no off-the-shelf tool to create these files - the standard .xz file created by xz utils puts everything in one stream, which is basically equivalent to the .7z files already being made. I'm sure "patches are welcome", but I don't have the time to create the patch.
How is .xz for compression times?
At the default settings, it's quite slow. I believe it's pretty much the same as 7zip with its default settings. The main reason I was using xz instead of 7zip is that xz handles pipes better - specifically, 7zip doesn't allow you to pipe from stdin to stdout. (See https://bugs.launchpad.net/ubuntu/+source/p7zip/+bug/383667 and the response - "You should use lzma." - well, lzma utils has been replaced by xz utils.)
For decompression, .xz is generally faster than .bz2, slower than .gz
Would we have to worry about patent issues for LZMA?
No, it uses LZMA2.
Hi Andrew,
maybe you'd like to check out ZIM: This is an standardized file format for compressed HTML dumps, focused on Wikimedia content at the moment.
There is some C++ code around to read and write ZIM files and there are several projects using that, eg. the WP1.0 project, the Israeli and Kenyan Wikipedia Offline initiatives and more. Also the Wikimedia Foundation is currently in progress to adopt the format to provide ZIM files from Wikimedia wikis in the future.
/Manuel
Am 15.12.2010 16:21, schrieb Andrew Dunbar:
I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary.
I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook.
The tool is "seek-bzip2" by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2
- The free Borland compiler won't compile it due to missing (Unix?) header files
- lcc compiles it but it always fails with error "unexpected EOF"
- mingw compiles it if the -m64 option is removed from the Makefile
but it then has the same behaviour as the lcc build.
My C experience is now quite stale and my 64-bit programming experience negligible.
(I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary)
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 15 December 2010 20:24, Manuel Schneider manuel.schneider@wikimedia.ch wrote:
Hi Andrew,
maybe you'd like to check out ZIM: This is an standardized file format for compressed HTML dumps, focused on Wikimedia content at the moment.
There is some C++ code around to read and write ZIM files and there are several projects using that, eg. the WP1.0 project, the Israeli and Kenyan Wikipedia Offline initiatives and more. Also the Wikimedia Foundation is currently in progress to adopt the format to provide ZIM files from Wikimedia wikis in the future.
This is very interesting and I'll be watching it. Where do the HTML dumps come from? I'm pretty sure I've only seen "static" for Wikipedia and not for Wiktionary for example. I am also looking at adapting the parser for offline use to generate HTML from the dump file wikitext.
Andrew Dunbar (hippietrail)
/Manuel
Am 15.12.2010 16:21, schrieb Andrew Dunbar:
I've long been interested in offline tools that make use of WikiMedia information, particularly the English Wiktionary.
I've recently come across a tool which can provide random access to a bzip2 archive without decompressing it and I would like to make use of it in my tools but I can't get it to compile and/or function with any free Windows compiler I have access to. It works fine on the *nix boxes I have tried but my personal machine is a Windows XP netbook.
The tool is "seek-bzip2" by James Taylor and is available here: http://bitbucket.org/james_taylor/seek-bzip2
- The free Borland compiler won't compile it due to missing (Unix?) header files
- lcc compiles it but it always fails with error "unexpected EOF"
- mingw compiles it if the -m64 option is removed from the Makefile
but it then has the same behaviour as the lcc build.
My C experience is now quite stale and my 64-bit programming experience negligible.
(I'm also interested in hearing from other people working on offline tools for dump files, wikitext parsing, or Wiktionary)
Andrew Dunbar (hippietrail)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Regards Manuel Schneider
Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, 16 Dec 2010 07:50:56 +0200, Andrew Dunbar hippytrail@gmail.com wrote:
On 15 December 2010 20:24, Manuel Schneider manuel.schneider@wikimedia.ch wrote:
Hi Andrew,
maybe you'd like to check out ZIM: This is an standardized file format for compressed HTML dumps, focused on Wikimedia content at the moment.
There is some C++ code around to read and write ZIM files and there are several projects using that, eg. the WP1.0 project, the Israeli and Kenyan Wikipedia Offline initiatives and more. Also the Wikimedia Foundation is currently in progress to adopt the format to provide ZIM files from Wikimedia wikis in the future.
This is very interesting and I'll be watching it. Where do the HTML dumps come from?
I do the HDML dumps on my own, using a customed version of the dumpHTML extension and additional scripts.
Emmanuel
Hi,
Am 16.12.2010 06:50, schrieb Andrew Dunbar:
This is very interesting and I'll be watching it. Where do the HTML dumps come from? I'm pretty sure I've only seen "static" for Wikipedia and not for Wiktionary for example. I am also looking at adapting the parser for offline use to generate HTML from the dump file wikitext.
there are several ways to get the HTML data.
Emmanuel (Kelson) who is doing Kiwix and WP1.0 uses SQL dumps to set up a seperate instance of MediaWiki to dump the data on the command line.
Ralf from Pediapress has written a wrapper for zimlib to use it with Python, in order to integrate a ZIM export to the Collection Extension. * http://github.com/schmir/pyzim
A few weeks ago Tommi from openZIM has committed "wikizim" to the zimwriter codebase, a command line tool that dumps a whole wiki into a ZIM file using the MediaWiki API. * http://svn.openzim.org/viewvc.cgi/trunk/zimwriter/
There have been discussions with Roan Kattouw about how to use wikizim on the Wikimedia wikis in the best way. His approach is to integrate the relevant code from wikizim into the MediaWiki code to avoid abstraction layers as much as possible in favour of performance and system load.
/Manuel
wikitech-l@lists.wikimedia.org