Hi
I'm very proud to announce the release of our new tool: warc2zim.
Warc2zim is a command line tool for GNU/Linux and macOS which allows to convert a WARC file to a ZIM file. WARC being a widely used storage format of the archive world, warc2zim offers new opportunities to reuse WARC stored data and benefit of the whole feature set of the ZIM file format and readers like Kiwix.
The tool has been achieved with the strong collaboration of the Webrecorder team. It is one milestone of a bigger project called Zimit, a project we run we the sponsoring of the Mozilla Foundation.
The ZIM created using that process works slightly differently than the traditional ones (the ZIM specification is formally respected). We are currently running an effort to update all the Kiwix readers, but it already works well with Kiwix Serve.
The tool is distributed at: https://pypi.org/project/warc2zim/
More news to come about warc2zim and Zimit in January 2020.
Happy scraping! Happy coding! Happy offline reading!
Emmanuel
oh neat! That does indeed open all kinds of interesting possibilities! :)
A.
Asaf Bartov (he/him/his)
Senior Program Officer, Emerging Wikimedia Communities
Wikimedia Foundation https://wikimediafoundation.org/
Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org
On Thu, Oct 29, 2020 at 1:14 PM Emmanuel Engelhart kelson@kiwix.org wrote:
Hi
I'm very proud to announce the release of our new tool: warc2zim.
Warc2zim is a command line tool for GNU/Linux and macOS which allows to convert a WARC file to a ZIM file. WARC being a widely used storage format of the archive world, warc2zim offers new opportunities to reuse WARC stored data and benefit of the whole feature set of the ZIM file format and readers like Kiwix.
The tool has been achieved with the strong collaboration of the Webrecorder team. It is one milestone of a bigger project called Zimit, a project we run we the sponsoring of the Mozilla Foundation.
The ZIM created using that process works slightly differently than the traditional ones (the ZIM specification is formally respected). We are currently running an effort to update all the Kiwix readers, but it already works well with Kiwix Serve.
The tool is distributed at: https://pypi.org/project/warc2zim/
More news to come about warc2zim and Zimit in January 2020.
Happy scraping! Happy coding! Happy offline reading!
Emmanuel
-- Kiwix - Wikipedia Offline & more
- Web: https://kiwix.org/
- Twitter: https://twitter.com/KiwixOffline
- Wiki: https://wiki.kiwix.org/
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Hello,
I am working on a search engine (unlike sphinx or elastic search, more like bing or google), I was planning to use .zim files to feed the index, the problem is there is no systematic way to find the original URL of the documents.
I am wondering whether one of the following will be possible for kiwix project to do:
A) Add a <meta url="https://foobar"> in the html inside the .zim files,
A bis) Add a metadata field per document with the original url inside the .zim files,
B) Publish .warc files of wikipedia, stackoverflow dumps etc... so that people like myself can re-use those. WARC files are more useful than .zim files but still less user friendly than the following proposal...
C) ... One last alternative, is to pivot the custom .zim file storage to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is to make it very easy to access the kiwix dumps from many programming languages unlike the current approach that is limited to C++ and Python. Also, it will be easier to extend a given dump with custom fields, unlike the current .zim which seems to be read-only.
Let me know what you think :-)
Thanks in advance!
[0] https://en.wikipedia.org/wiki/Ordered_Key-Value_Store [1] https://github.com/sqlite/sqlite/tree/master/ext/lsm1
Le jeu. 29 oct. 2020 à 12:14, Emmanuel Engelhart kelson@kiwix.org a écrit :
Hi
I'm very proud to announce the release of our new tool: warc2zim.
Warc2zim is a command line tool for GNU/Linux and macOS which allows to convert a WARC file to a ZIM file. WARC being a widely used storage format of the archive world, warc2zim offers new opportunities to reuse WARC stored data and benefit of the whole feature set of the ZIM file format and readers like Kiwix.
The tool has been achieved with the strong collaboration of the Webrecorder team. It is one milestone of a bigger project called Zimit, a project we run we the sponsoring of the Mozilla Foundation.
The ZIM created using that process works slightly differently than the traditional ones (the ZIM specification is formally respected). We are currently running an effort to update all the Kiwix readers, but it already works well with Kiwix Serve.
The tool is distributed at: https://pypi.org/project/warc2zim/
More news to come about warc2zim and Zimit in January 2020.
Happy scraping! Happy coding! Happy offline reading!
Emmanuel
-- Kiwix - Wikipedia Offline & more
- Web: https://kiwix.org/
- Twitter: https://twitter.com/KiwixOffline
- Wiki: https://wiki.kiwix.org/
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Hi Amirouche -- is this for an offline search? Would love to read more about it.
On Sun, Nov 1, 2020 at 6:36 AM Amirouche Boubekki < amirouche.boubekki@gmail.com> wrote:
Hello,
I am working on a search engine (unlike sphinx or elastic search, more like bing or google), I was planning to use .zim files to feed the index, the problem is there is no systematic way to find the original URL of the documents.
I am wondering whether one of the following will be possible for kiwix project to do:
A) Add a <meta url="https://foobar"> in the html inside the .zim files,
A bis) Add a metadata field per document with the original url inside the .zim files,
B) Publish .warc files of wikipedia, stackoverflow dumps etc... so that people like myself can re-use those. WARC files are more useful than .zim files but still less user friendly than the following proposal...
C) ... One last alternative, is to pivot the custom .zim file storage to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is to make it very easy to access the kiwix dumps from many programming languages unlike the current approach that is limited to C++ and Python. Also, it will be easier to extend a given dump with custom fields, unlike the current .zim which seems to be read-only.
Let me know what you think :-)
Thanks in advance!
[0] https://en.wikipedia.org/wiki/Ordered_Key-Value_Store [1] https://github.com/sqlite/sqlite/tree/master/ext/lsm1
Le jeu. 29 oct. 2020 à 12:14, Emmanuel Engelhart kelson@kiwix.org a écrit :
Hi
I'm very proud to announce the release of our new tool: warc2zim.
Warc2zim is a command line tool for GNU/Linux and macOS which allows to convert a WARC file to a ZIM file. WARC being a widely used storage format of the archive world, warc2zim offers new opportunities to reuse WARC stored data and benefit of the whole feature set of the ZIM file format and readers like Kiwix.
The tool has been achieved with the strong collaboration of the Webrecorder team. It is one milestone of a bigger project called Zimit, a project we run we the sponsoring of the Mozilla Foundation.
The ZIM created using that process works slightly differently than the traditional ones (the ZIM specification is formally respected). We are currently running an effort to update all the Kiwix readers, but it already works well with Kiwix Serve.
The tool is distributed at: https://pypi.org/project/warc2zim/
More news to come about warc2zim and Zimit in January 2020.
Happy scraping! Happy coding! Happy offline reading!
Emmanuel
-- Kiwix - Wikipedia Offline & more
- Web: https://kiwix.org/
- Twitter: https://twitter.com/KiwixOffline
- Wiki: https://wiki.kiwix.org/
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
-- Amirouche ~ https://hyper.dev
Offline-l mailing list Offline-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/offline-l
Hello Samuel,
Le mar. 3 nov. 2020 à 00:05, Samuel Klein meta.sj@gmail.com a écrit :
Hi Amirouche -- is this for an offline search? Would love to read more about it.
The primary use-case is not offline search. I am working on a description of how it works, but it is not ready yet.
That said with the "cache" feature it will be possible to use offline. In particular, it will be possible to crawl the internet and feed the search engine warc files or another format based on sqlite.
Hi Amirouche
On 01.11.20 08:52, Amirouche Boubekki wrote:
I am working on a search engine (unlike sphinx or elastic search, more like bing or google), I was planning to use .zim files to feed the index, the problem is there is no systematic way to find the original URL of the documents.
Yes, with Mediawiki based created ZIM, to an extrem large extened it should be the same articleId.
I am wondering whether one of the following will be possible for kiwix project to do:
A) Add a <meta url="https://foobar"> in the html inside the .zim files,
Yes, but there is not always such link possible. Many ZIMs are mash-ups. That said this is trivial to add such a meta node in in the HTML in MWoffliner.
A bis) Add a metadata field per document with the original url inside the .zim files,
Yes, I was sure to have such a ticket open at least for MWoffliner. Can not find it anymore. Probably both approaches are doable, you should open a ticket in MWoffliner repository.
B) Publish .warc files of wikipedia, stackoverflow dumps etc... so that people like myself can re-use those. WARC files are more useful than .zim files but still less user friendly than the following proposal...
Why are WARC more useful, beside the fact that they have an exact copy of of the original Web page?
C) ... One last alternative, is to pivot the custom .zim file storage to an okvs [0] like rocksdb or sqlite lsm extension [1]. The idea is to make it very easy to access the kiwix dumps from many programming languages unlike the current approach that is limited to C++ and Python. Also, it will be easier to extend a given dump with custom fields, unlike the current .zim which seems to be read-only.
There is bindings for Go and Javascript as well. Which kind of additional binding do you need? Creating a binding for libzim takes from a few days to 2 weeks (depending how you want to make it). This is easy. If you want to read the content of Wikipedia, this is the easiest solution (if the current tools are not enough). zimdump allows you as well to extract extremely efficiently the content from the command line.
I personally think we should create a fuse driver.
Yes, the ZIM format is readonly. If you want to write content, then you definitely need a rw DB, whatever its name. That does not mean you have to replace the ZIM, this can be complementary. An other point is that if you need a DB, that basically mean you have an additional source of information you deal with. Something you don't have talked about.
Please open a ticket in a repo if you need anything from the libzim or MWoffliner.
Regards Emmanuel
Hello Emmanuel,
I figured I will make it work: I will translate the .zim files into an sqlite database with an ad-hoc python script that way I can add any metadata I need.