Re: [openZIM dev-l] zeno, zim formats - Offline-l

20 Nov 2009

 Le ven 20/11/09 11:11, "Andy Rabagliati" andyr(a)wizzy.com a écrit:
...
  On Mon, 16 Nov 2009, Emmanuel Engelhart wrote:

  > These indexes
http://ai.cs.utsa.edu/wikipedia0.7/ seem to have  been built
  using categories.

 This dump is one I have build (maybe extract from  the ZIM)... but a
 little bit modified. This a pretty interesting url,
 would be great to
 know how the dev. behind have done exactly... maybe
 you would be able to
 do the same.

 This is his explanation :-

 This collection of articles is called "beta2" because they were
 extracted from wikipedia_en_wp1_0.7_30000+_05_2009_beta2.zim at
 tmp.kiwix.org.  My distribution has different file names for all the
 articles based on their titles and has a title search capability that
 only depends on Javascript.  Below is some explanation of the steps I
 performed, not necessarily done in this order.

 1. I extracted the articles from the zim file by downloading,
 compiling and running zimDump from openzim.org.  Compiling zimDump is
 nontrivial because it involves downloading and compiling other
 packages in versions that will work together.

 2. I created three lists with Perl scripts and manual cleaning
 afterwards.  

 A. The first list was a list of all articles: zim file name, UTF8
 title, and ASCII title.

 B. The second list was a list of all zim files (articles, images and
 other files, Javascript and CSS): zim file name and target file name
 in my distribution.

 C. The third list was a list for redirecting one zim file name to
 another.  The zim dump creates a lot of empty files in the A
 subdirectory (A contains all the articles).  It turns out that each
 of them needs to be redirected to another article.  The redirects
 can be determined by downloading and running the zimReader program
 for Linux, which can be found at openzim.org.

 There appear to be a few duplicate articles (none were deleted), which
 I list below (in ASCII) for anyone who is interested:

 Abu Rayhan Biruni
 'Alawi
 Battle of Mohacs
 Beer-Lambert law
 Charismatic movement
 Elian Gonzalez affair
 Ismail Enver
 Ismet Inonu
 Istiklal Marsi
 Izmir Province
 Wikipedia:0.7/0.7geo/Leopold
 Macapa or Macapai
 Maceia or Maceio
 Nicole Vaidisova
 PRIDE Fighting Championships
 War in Afghanistan (2001-present)

 3. I used a Perl script to copy all the files from the zim dump to a
 staging area, modifying the links along the way.  There are many, many
 dead image links (26314 in my count); I changed those links to empty
 strings.  There are also some dead article links, most of them
 correspond to dead image links, but a few of them should have been
 redirected; they got added to my third list above.  Here are all the
 dead article links and any appropriate redirect for anyone who is
 interested.

 A/5ISM	ignore
 A/35A	A/D6N
 A/53Z	A/CWO
 A/5J03	ignore
 A/5J55	ignore
 A/9XO	A/HQO
 A/APD	A/A35
 A/D07	A/9PW
 A/F5G	A/163K
 A/PRL	ignore
 A/TKV	A/S4X
 A/TR4	ignore
 A/VBB	ignore
 A/ZM2	ignore
 A/2QE6	ignore
 A/T3B	A/NQR
 A/11B0	ignore
 A/1NN2	ignore
 A/5ISU	A/4O
 A/5JAZ	ignore
 A/5IV3	ignore
 A/5IXU	ignore
 A/102Z	ignore
 A/1QTM	ignore
 A/5IOB	ignore
 A/5J51	ignore
 A/5IP6	ignore
 A/5JBO	ignore
 A/Y91	ignore

 4. In addition to changing links, I made a few other changes.  Each
 article now has a search box for title search.  I took some existing
 GPLed Javascript (JSE search engine) and made extensive modifications
 for this application.  It only searches the titles; there is no
 keyword index, and there is no text search.  The motto of the code is
 "Linear Search FTW".  It is surprisingly snappy, though in hindsight,
 searching 30000 titles is not a lot for a computer to do.  The results
 page is functional, but otherwise not too exciting.

 I changed the titles of the index pages to something less geeky, e.g..
 "Topical Index: Wikipedia" for the topic index page on Wikipedia.  I
 also fixed a number of incorrect links to the topical index page to
 alphabetical index page.

 Enjoy,

 Tom Bylander 
Thank you very much Andy for having forwarded this email, this is really interesting.

That's confirm a few things:
* The wikipedia_en_wp1_0.7_30000+_05_2009_beta2 is only a beta and should be improved and
I will do it soon.
* I need to code a perl script to check many thing in a HTML directory
(https://sourceforge.net/tracker/?func=detail&aid=2901059&group_id=1…)
before building the ZIM
* I think we need such a tool (zim-check?) in C++ coded to be able to do the same with ZIM
files (I see that pretty necessary if we want to setup a ZIM Library: nobody want that we
spread bad quality ZIM) http://bugs.openzim.org/show_bug.cgi?id=14

As soon as I will have finished with that stuff I will publish a new version of the ZIM
file and contact Tom.

Regards
Emmanuel