[openZIM dev-l] zeno, zim formats

List overview All Threads
Download

newer

older

Re: [openZIM dev-l] zeno, zim...

Re: [openZIM dev-l] Compiling...

Andy Rabagliati

16 Nov 2009 16 Nov '09

9:48 p.m.

Folks,

Coming here by way of the Wikipedia:Version 1.0 Editorial Team pages, I have a few questions.

Disclaimer :- I am in South Africa, where bandwidth is expensive, so a lot of my 'fiddling' is done on my server in the USA, where I do not have b/w limitations.

My interest is to be able to serve wikipedia content off a server in a classroom - usually a Thin Client setup, but not always, and they rarely have internet access.

Weak, small clients, networked to a strong server, running linux.

I downloaded a torrent - en.wikipedia.okawix - which appeared to be a tarball with these contents :-

-rw-r--r-- 1 andyr andyr 571428918 Jul 6 23:24 article.index -rw-r--r-- 1 andyr andyr 108812163 Jul 6 23:24 article.map -rw-r--r-- 1 andyr andyr 5601891132 Jul 6 23:23 en.wikipedia.zeno -rw-r--r-- 1 andyr andyr 11 Jun 23 18:58 entry.url -rw-r--r-- 1 andyr andyr 307 Jul 6 23:23 licence.xml -rw-r--r-- 1 andyr andyr 596303890 Jul 6 23:25 word.index -rw-r--r-- 1 andyr andyr 108108941 Jul 6 23:25 word.map

I see a zeno file - a precursor to the zim format ?

The ZimReader static binary from http://openzim.org/Releases dated 2009-06-07 (same date as the files above ?) will not read the file.

Am I doing this wrong ?

Secondly, I downloaded wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim and pointed the ZimReader at this - hoping to browse to server:8080 and read the file. But I have no index, and no search - do I need to get the index files separately ?

And the css files (?) are in German - do I need to compile my own reader from svn to get an English reader ?

Can I build a deb (for Ubuntu ?)

Cheers, Andy!

Show replies by date

Pascal Martin

16 Nov 16 Nov

10:50 p.m.

Dear Andy,

...

My interest is to be able to serve wikipedia content off a server in

a classroom - usually a Thin Client setup, but not always, and they rarely have internet access. Weak, small clients, networked to a strong server, running linux.

Zim and Zeno file formats are not intended for network access. Zim/Zeno readers like Okawix and Kiwix need to have direct access to the files. Now you could workaround it by using some NFS or Samba share.

...

I downloaded a torrent - en.wikipedia.okawix - which appeared to be a

tarball with these contents :- -rw-r--r-- 1 andyr andyr 571428918 Jul 6 23:24 article.index -rw-r--r-- 1 andyr andyr 108812163 Jul 6 23:24 article.map -rw-r--r-- 1 andyr andyr 5601891132 Jul 6 23:23 en.wikipedia.zeno -rw-r--r-- 1 andyr andyr 11 Jun 23 18:58 entry.url -rw-r--r-- 1 andyr andyr 307 Jul 6 23:23 licence.xml -rw-r--r-- 1 andyr andyr 596303890 Jul 6 23:25 word.index -rw-r--r-- 1 andyr andyr 108108941 Jul 6 23:25 word.map I see a zeno file - a precursor to the zim format ?

Yeah, Zim is an upgraded version of the Zeno format.

...

The ZimReader static binary from http://openzim.org/Releases dated

2009-06-07 (same date as the files above ?) will not read the file. Am I doing this wrong ?

Yeah, Zim readers can't read the Zeno format. Use Okawix (http://okawix.com) to read Zeno files.

...

Secondly, I downloaded wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim

and pointed the ZimReader at this - hoping to browse to server:8080 and read the file. But I have no index, and no search - do I need to get the index files separately ?

As I already said, Zim and Zeno files are mostly intended for direct access, not for network access. As for the search indexes, they are provided along the Zeno contents in the Okawix bundles and are only usefull for the Okawix search components.

...

And the css files (?) are in German - do I need to compile my own

reader from svn to get an English reader ?

Ehm... not sure what you are talking about...

...

Can I build a deb (for Ubuntu ?)

A .deb for Okawix would be great. If you need on that matter (or something

else) join irc://irc.freenode.net/okawix

On Mon, 16 Nov 2009 16:48:13 +0200, Andy Rabagliati andyr@wizzy.com wrote:

...

Folks,

Coming here by way of the Wikipedia:Version 1.0 Editorial Team pages, I have a few questions.

Disclaimer :- I am in South Africa, where bandwidth is expensive, so a lot of my 'fiddling' is done on my server in the USA, where I do not have b/w limitations.

My interest is to be able to serve wikipedia content off a server in a classroom - usually a Thin Client setup, but not always, and they rarely have internet access.

Weak, small clients, networked to a strong server, running linux.

I downloaded a torrent - en.wikipedia.okawix - which appeared to be a tarball with these contents :-

-rw-r--r-- 1 andyr andyr 571428918 Jul 6 23:24 article.index -rw-r--r-- 1 andyr andyr 108812163 Jul 6 23:24 article.map -rw-r--r-- 1 andyr andyr 5601891132 Jul 6 23:23 en.wikipedia.zeno -rw-r--r-- 1 andyr andyr 11 Jun 23 18:58 entry.url -rw-r--r-- 1 andyr andyr 307 Jul 6 23:23 licence.xml -rw-r--r-- 1 andyr andyr 596303890 Jul 6 23:25 word.index -rw-r--r-- 1 andyr andyr 108108941 Jul 6 23:25 word.map

I see a zeno file - a precursor to the zim format ?

The ZimReader static binary from http://openzim.org/Releases dated 2009-06-07 (same date as the files above ?) will not read the file.

Am I doing this wrong ?

Secondly, I downloaded wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim and pointed the ZimReader at this - hoping to browse to server:8080 and read the file. But I have no index, and no search - do I need to

get

...

the index files separately ?

And the css files (?) are in German - do I need to compile my own reader from svn to get an English reader ?

Can I build a deb (for Ubuntu ?)

Cheers, Andy! _______________________________________________ dev-l mailing list dev-l@openzim.org https://intern.openzim.org/mailman/listinfo/dev-l

-- Cordialement Pascal Martin 06 13 89 77 32

Manuel Schneider

11:21 p.m.

Hi!

Pascal Martin schrieb:

...

Dear Andy,

...
My interest is to be able to serve wikipedia content off a server in

a classroom - usually a Thin Client setup, but not always, and they rarely have internet access. Weak, small clients, networked to a strong server, running linux.

Zim and Zeno file formats are not intended for network access. Zim/Zeno readers like Okawix and Kiwix need to have direct access to the files. Now you could workaround it by using some NFS or Samba share.

well, this is not true.

If you get the zimreader from the openZIM website (better you get it from SVN and compile it yourself as the binary is quite old) you already have a webserver.

Start the zimreader, pointing it to the ZIM files (article file and index file) and you have a website running on localhost:8080.

Emmanuel (on this list) has made some ZIM files, also for english, which work.

At the end of this week we have another developers meeting and we aim to fix some minor incompatibilities we still have between different ZIM file creators and ZIM readers.

Greets,

Manuel

-- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch

Andy Rabagliati

17 Nov 17 Nov

1:33 a.m.

On Mon, 16 Nov 2009, Manuel Schneider wrote:

...

Hi!

Pascal Martin schrieb:

If you get the zimreader from the openZIM website (better you get it from SVN and compile it yourself as the binary is quite old) you already have a webserver.

Start the zimreader, pointing it to the ZIM files (article file and index file) and you have a website running on localhost:8080.

I am glad to hear this.

I will take the code from svn.

Is there an index file to go with wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim ?

Even some small files, or Wikipedia:Vital articles would allow a test of everything and make files more manageable.

I couldn't download any zim files on the web except yours.

I would like to see categories - do they work ?

Searches can be title search, article lede search, full text search.

Cheers, Andy!

Andy Rabagliati

2:43 a.m.

On Mon, 16 Nov 2009, Andy Rabagliati wrote:

...

I would like to see categories - do they work ?

Searches can be title search, article lede search, full text search.

These indexes http://ai.cs.utsa.edu/wikipedia0.7/ seem to have been built using categories.

Is that a part of the zim file too ?

Maybe that is good enough ?

Good enough for search, but not good enough for the page until the category link is on the page itself, so we can easily go from "Gamma ray burst" to other pages in the category Astronomy.

Cheers, Andy!

Andy Rabagliati

3:05 a.m.

On Mon, 16 Nov 2009, Andy Rabagliati wrote:

...

Good enough for search, but not good enough for the page until the category link is on the page itself, so we can easily go from "Gamma ray burst" to other pages in the category Astronomy.

It has <meta name="keywords" content="Gamma-Ray Burst, Gamma burst, Gamma-ray burst, Gamma rays burst, Gamma-ray bursts, Afterglow (gamma ray burst), Gamma-ray burster, Gamma ray burster, Gammy ray bursts, Gama ray burst, Gamma ray bursts" />

(are those the redirects ?) but nothing obviously saying Categories: Gamma-ray bursts | Astronomical events | Stellar phenomena that the main WP article says.

And I see that 2/3rds of the "Gamma ray burst" article is references.

It is important that this valuable information is not lost, in case you *do* happen to have access to internet, but I do not want to say take them out .. :)

Cheers, Andy!

Emmanuel Engelhart

3:15 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Andy Rabagliati a écrit :

...

On Mon, 16 Nov 2009, Andy Rabagliati wrote:

...
Good enough for search, but not good enough for the page until the category link is on the page itself, so we can easily go from "Gamma ray burst" to other pages in the category Astronomy.

It has <meta name="keywords" content="Gamma-Ray Burst, Gamma burst, Gamma-ray burst, Gamma rays burst, Gamma-ray bursts, Afterglow (gamma ray burst), Gamma-ray burster, Gamma ray burster, Gammy ray bursts, Gama ray burst, Gamma ray bursts" />

(are those the redirects ?)

Yes, by building the ZIM file, I add to "the already there keywords" the title of the redirects pages pointing to this page.

...

It is important that this valuable information is not lost, in case you *do* happen to have access to internet, but I do not want to say take them out .. :)

I'm not enthusiastic about dumping category pages... but this is only a part of the issue. The other part is that I have no method to know, given a list of articles, which categories I have to integrate in the final dump! do you?

Emmanuel

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAksBstEACgkQn3IpJRpNWtNfuwCfcXp8TqqWjTrPaQBmvEDF7Jhu p2EAoKvGKvOV+8xwlVvyvdHQ2Apr5q+x =ocP2 -----END PGP SIGNATURE-----

Andy Rabagliati

6:12 p.m.

On Mon, 16 Nov 2009, Emmanuel Engelhart wrote:

...

Yes, by building the ZIM file, I add to "the already there keywords" the title of the redirects pages pointing to this page.

What are "the already there keywords" ?

...

I'm not enthusiastic about dumping category pages... but this is only a part of the issue. The other part is that I have no method to know, given a list of articles, which categories I have to integrate in the final dump! do you?

You could do it iteratively ?

You must have a method of unlinking 'red-linked' pages - links in articles that point to pages not in our collection.

Include all categories, remove those that point to zero or one article in our collection.

You can leave all categories in the article, just unlink the ones that do not 'make the cut'.

My point is - if references make up half the text dump, categories surely deserve to be in there.

Re: References - could you perhaps link to http://en.wikipedia.org/wiki/Gamma-ray_burst#References ?

Then, if you *do* have internet access, you can get to the refs ?

(Still not very satisfactory - the correspondence between the article ref and the actual one is lost - you have to look for it).

I appreciate all the work that has already gone in to this, but I see a lot of effort going into one or two zim files, and not enough on the process - where you could create another zim file which is just chemistry-related, or Africa-related, or top-1000 articles.

Cheers, Andy!

Emmanuel Engelhart

3:09 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi Andy,

Andy Rabagliati a écrit :

...

On Mon, 16 Nov 2009, Andy Rabagliati wrote:

...
I would like to see categories - do they work ?

It depends what do you mean exactly.

In every dump I currently do (http://tmp.kiwix.org/zim/), categories are avoided and I think it will stay like that as long the ZIM format does not support it natively.

So, the ZIM format does not support natively a cat. system but this is a feature request: http://bugs.openzim.org/show_bug.cgi?id=1

... but every ZIM creator can make the choice to integrate categories as HTML pages (but not so trivial IMO).

...

...
Searches can be title search, article lede search, full text search.

I understand your problem. In fact, we have many search engine solutions but currently nothing which seems to be really what you need (easy to index a ZIM file and available to build an HTTP server). This is a pity because this would really good and the most complicated part of the code is already there.

Tommi, maybe you can help Andy to work with the openzim search engine? Andy has been working since a long time in South Africa to spread Wikipedia content (offline).

...

These indexes http://ai.cs.utsa.edu/wikipedia0.7/ seem to have been built using categories.

This dump is one I have build (maybe extract from the ZIM)... but a little bit modified. This a pretty interesting url, would be great to know how the dev. behind have done exactly... maybe you would be able to do the same.

...

Is that a part of the zim file too ?

Nothing to do with openzim... but he re-uses my work.

...

Maybe that is good enough ?

Maybe :)

...

Good enough for search, but not good enough for the page until the category link is on the page itself, so we can easily go from "Gamma ray burst" to other pages in the category Astronomy.

I think you won't have that soon because: * I didn't it (and all WP0.7 content are issue from my original ZIM file) * I know nobody which is currently able to do that cleanly.

Regards Emmanuel

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAksBsV4ACgkQn3IpJRpNWtPe4gCdE6/3NEAKI5y4MpNoapS9hfDQ I9sAn1woZ26+/zUrmzSzVMSlt4+m2FO4 =QOZq -----END PGP SIGNATURE-----

Manuel Schneider

3:44 p.m.

Hey there,

Emmanuel Engelhart schrieb:

...

So, the ZIM format does not support natively a cat. system but this is a feature request: http://bugs.openzim.org/show_bug.cgi?id=1

well, it is already thought through and documented in parts - all that is needed is implementation. Categorisation is one of the hottest topics for the upcoming dev meeting.

...

...
...
Searches can be title search, article lede search, full text search.

I understand your problem. In fact, we have many search engine solutions but currently nothing which seems to be really what you need (easy to index a ZIM file and available to build an HTTP server). This is a pity because this would really good and the most complicated part of the code is already there.

hmm, I don't understand that answer.

ZIM features a fulltext index and with the zimwriter it can even be created when you just have the ZIM file with article contents.

just do a

zimwriter -Z wikipedia-de.zim wikipedia-de-x.zim

where wikipedia-de.zim is the file containg the articles and wikipedia-de-x.zim is the file where the index will be written to.

Just run that command (it takes some time as the process is quite I/O intensive) and you have a fulltext index which you can use with the zimreader.

What else do you need?

Manuel

-- Regards Manuel Schneider Wikimedia CH - Verein zur Förderung Freien Wissens Wikimedia CH - Association for the advancement of free knowledge www.wikimedia.ch

Andy Rabagliati

20 Nov 20 Nov

5:11 p.m.

On Mon, 16 Nov 2009, Emmanuel Engelhart wrote:

...

...
These indexes http://ai.cs.utsa.edu/wikipedia0.7/ seem to have been built using categories.

This dump is one I have build (maybe extract from the ZIM)... but a little bit modified. This a pretty interesting url, would be great to know how the dev. behind have done exactly... maybe you would be able to do the same.

This is his explanation :-

This collection of articles is called "beta2" because they were extracted from wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim at tmp.kiwix.org. My distribution has different file names for all the articles based on their titles and has a title search capability that only depends on Javascript. Below is some explanation of the steps I performed, not necessarily done in this order.

1. I extracted the articles from the zim file by downloading, compiling and running zimDump from openzim.org. Compiling zimDump is nontrivial because it involves downloading and compiling other packages in versions that will work together.

2. I created three lists with Perl scripts and manual cleaning afterwards.

A. The first list was a list of all articles: zim file name, UTF8 title, and ASCII title.

B. The second list was a list of all zim files (articles, images and other files, Javascript and CSS): zim file name and target file name in my distribution.

C. The third list was a list for redirecting one zim file name to another. The zim dump creates a lot of empty files in the A subdirectory (A contains all the articles). It turns out that each of them needs to be redirected to another article. The redirects can be determined by downloading and running the zimReader program for Linux, which can be found at openzim.org.

There appear to be a few duplicate articles (none were deleted), which I list below (in ASCII) for anyone who is interested:

Abu Rayhan Biruni 'Alawi Battle of Mohacs Beer-Lambert law Charismatic movement Elian Gonzalez affair Ismail Enver Ismet Inonu Istiklal Marsi Izmir Province Wikipedia:0.7/0.7geo/Leopold Macapa or Macapai Maceia or Maceio Nicole Vaidisova PRIDE Fighting Championships War in Afghanistan (2001-present)

3. I used a Perl script to copy all the files from the zim dump to a staging area, modifying the links along the way. There are many, many dead image links (26314 in my count); I changed those links to empty strings. There are also some dead article links, most of them correspond to dead image links, but a few of them should have been redirected; they got added to my third list above. Here are all the dead article links and any appropriate redirect for anyone who is interested.

A/5ISM ignore A/35A A/D6N A/53Z A/CWO A/5J03 ignore A/5J55 ignore A/9XO A/HQO A/APD A/A35 A/D07 A/9PW A/F5G A/163K A/PRL ignore A/TKV A/S4X A/TR4 ignore A/VBB ignore A/ZM2 ignore A/2QE6 ignore A/T3B A/NQR A/11B0 ignore A/1NN2 ignore A/5ISU A/4O A/5JAZ ignore A/5IV3 ignore A/5IXU ignore A/102Z ignore A/1QTM ignore A/5IOB ignore A/5J51 ignore A/5IP6 ignore A/5JBO ignore A/Y91 ignore

4. In addition to changing links, I made a few other changes. Each article now has a search box for title search. I took some existing GPLed Javascript (JSE search engine) and made extensive modifications for this application. It only searches the titles; there is no keyword index, and there is no text search. The motto of the code is "Linear Search FTW". It is surprisingly snappy, though in hindsight, searching 30000 titles is not a lot for a computer to do. The results page is functional, but otherwise not too exciting.

I changed the titles of the index pages to something less geeky, e.g.. "Topical Index: Wikipedia" for the topic index page on Wikipedia. I also fixed a number of incorrect links to the topical index page to alphabetical index page.

Enjoy,

Tom Bylander bylander@cs.utsa.edu written November 13, 2009

5352

Age (days ago)

5356

Last active (days ago)

offline-l@lists.wikimedia.org

10 comments

4 participants

tags (0)

participants (4)

Andy Rabagliati
Emmanuel Engelhart
Manuel Schneider
Pascal Martin