Hi, Emmanuel.

Thanks a lot for this information!  We can live with a 0.5GB index, so now Kiwix is a realistic option for us.

In the coming week, we will be submitting an initial localization file for Kiwix, providing Hebrew strings for its user interface.

   Asaf

On Wed, Jul 8, 2009 at 10:23 PM, Emmanuel Engelhart <emmanuel@engelhart.org> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Asaf,

I have improved the indexing code and now (with Kiwix SVN code) the
index is only 1.1G for your ZIM file. For you this will be available in
the next Kiwix release (~2 weeks).

There is also a special tool called "xapian-compact" able to reduce
about 50% the index size. I do not plan currently to integrate it to
Kiwix, but if you want to produce a software with already contents and
index, you can use it. I have tested and the index is now 573M.

So, it seems that it is all what I can do now with Xapian, but this is a
lot better than the first 2.3G :)

Another point is that the upcoming Xapian release will have a new
storage backend and this backend should again be able to reduce ~50% the
index. Actually I do not have test it, but that is what the developers say.

Regards
Emmanuel

Asaf Bartov a écrit :
> Clarification:
>
> This last message was by Rotem, a fellow WM-IL member helping me with the
> embedding of the Hebrew Wikipedia in the One Computer Per Child project.
>
> He is reporting issues with Kiwix and the ZIM file I created last week.
>
> Regarding size:  Size is important, because we intend to add images (the
> 300MB ZIM file is the complete Hebrew Wikipedia text, but no pictures).  We
> are hoping to have at least 5GB reserved for us in those One Computer Per
> Child machines we are to install on, but we may be forced to make do with
> 3GB.  So every MB saved from the index, is another MB available for
> images...
>
>    Asaf Bartov
>    Wikimedia Israel
>
> On Mon, Jul 6, 2009 at 3:58 PM, Rotem Simha <hidroo@gmail.com> wrote:
>
>> * there are some errors in links of files and special pages
>> examples
>> קובץ:Nuvola_apps_important.svg<http://commons.wikimedia.org/wiki/File:Nuvola_apps_important.svg> link
>> to ויקיפדיה:מיזמי ויקיפדיה/מיזם ערכים ללא תמונות/קטגוריות/ספורטאים איטלקים(wikipedia:wikipedia projects\ articles without images\categories\Sports
>> people from Italy)
>> מיוחד:אקראי (Special:Random) > 15 במאי (may 15)
>> מיוחד:שינויים אחרונים (Special:RecentChanges) > 10_באוגוסט
>>
>> * size is important because we intend to add images
>>
>> 2009/7/6 <dev-l-request@openzim.org>
>>
>>> Send dev-l mailing list submissions to
>>>        dev-l@openzim.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>        https://intern.openzim.org/mailman/listinfo/dev-l
>>> or, via email, send a message with subject or body 'help' to
>>>        dev-l-request@openzim.org
>>>
>>> You can reach the person managing the list at
>>>        dev-l-owner@openzim.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of dev-l digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>   1. Kiwix index size (Asaf Bartov)
>>>   2. Re: Kiwix index size (Manuel Schneider)
>>>   3. Re: Kiwix index size (Emmanuel Engelhart)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Sun, 5 Jul 2009 19:18:57 +0300
>>> From: Asaf Bartov <asaf.bartov@gmail.com>
>>> Subject: [openZIM dev-l] Kiwix index size
>>> To: dev-l@openzim.org
>>> Message-ID:
>>>        <50a20d900907050918r3fcff23l275c67690ed7fc20@mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> Hi, everyone.
>>>
>>> When running Kiwix's indexer on the ZIM file I had created from the Hebrew
>>> Wikipedia last week, the Kiwix data directory ran up to a total of 31
>>> items,
>>> totalling 2.3 GB.  The ZIM file itself is ~300MB.  Does this proportion
>>> make
>>> sense?
>>>
>>> Detailed ls output attached.
>>>
>>> Thanks in advance,
>>>
>>>   Asaf Bartov
>>> --
>>> Asaf Bartov <asaf@forum2.org>
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://intern.openzim.org/pipermail/dev-l/attachments/20090705/2afee878/attachment.html
>>> -------------- next part --------------
>>> rotem@desktop:~/.www.kiwix.org/kiwix$ ls -l -h -a -R
>>> .:
>>> total 16K
>>> drwx------ 3 rotem rotem 4.0K 2009-07-01 16:10 .
>>> drwx------ 3 rotem rotem 4.0K 2009-07-01 16:10 ..
>>> drwx------ 4 rotem rotem 4.0K 2009-07-05 19:00 7680jxd5.default
>>> -rw-r--r-- 1 rotem rotem   94 2009-07-01 16:10 profiles.ini
>>>
>>> ./7680jxd5.default:
>>> total 1.7M
>>> drwx------ 4 rotem rotem 4.0K 2009-07-05 19:00 .
>>> drwx------ 3 rotem rotem 4.0K 2009-07-01 16:10 ..
>>> drwxr-xr-x 2 rotem rotem 4.0K 2009-07-02 05:13
>>> 31c26198d06ad265677b450796cc09aa.index
>>> -rw------- 1 rotem rotem  162 2009-07-05 18:19 compatibility.ini
>>> -rw-r--r-- 1 rotem rotem 135K 2009-07-05 18:19 compreg.dat
>>> drwxr-xr-x 2 rotem rotem 4.0K 2009-07-01 16:10 extensions
>>> -rw-r--r-- 1 rotem rotem  169 2009-07-01 16:10 localstore.rdf
>>> -rw-r--r-- 1 rotem rotem  304 2009-07-05 18:39 mimeTypes.rdf
>>> -rw-r--r-- 1 rotem rotem    0 2009-07-05 18:40 .parentlock
>>> -rw-r--r-- 1 rotem rotem 2.0K 2009-07-01 16:10 permissions.sqlite
>>> -rw-r--r-- 1 rotem rotem 128K 2009-07-05 18:54 places.sqlite
>>> -rw------- 1 rotem rotem  951 2009-07-05 19:00 prefs.js
>>> -rw-r--r-- 1 rotem rotem 1.1M 2009-07-05 18:20 XPC.mfasl
>>> -rw-r--r-- 1 rotem rotem  98K 2009-07-05 18:19 xpti.dat
>>> -rw-r--r-- 1 rotem rotem  98K 2009-07-05 18:20 XUL.mfasl
>>>
>>> ./7680jxd5.default/31c26198d06ad265677b450796cc09aa.index:
>>> total 2.4G
>>> drwxr-xr-x 2 rotem rotem 4.0K 2009-07-02 05:13 .
>>> drwx------ 4 rotem rotem 4.0K 2009-07-05 19:00 ..
>>> -rw-r--r-- 1 rotem rotem    0 2009-07-02 01:46 flintlock
>>> -rw-r--r-- 1 rotem rotem   12 2009-07-02 01:46 iamflint
>>> -rw-r--r-- 1 rotem rotem  22K 2009-07-02 05:13 position.baseA
>>> -rw-r--r-- 1 rotem rotem  21K 2009-07-02 05:10 position.baseB
>>> -rw-r--r-- 1 rotem rotem 1.4G 2009-07-02 05:13 position.DB
>>> -rw-r--r-- 1 rotem rotem  12K 2009-07-02 05:13 postlist.baseA
>>> -rw-r--r-- 1 rotem rotem  12K 2009-07-02 05:10 postlist.baseB
>>> -rw-r--r-- 1 rotem rotem 754M 2009-07-02 05:13 postlist.DB
>>> -rw-r--r-- 1 rotem rotem   70 2009-07-02 05:13 record.baseA
>>> -rw-r--r-- 1 rotem rotem   70 2009-07-02 05:10 record.baseB
>>> -rw-r--r-- 1 rotem rotem 3.3M 2009-07-02 05:13 record.DB
>>> -rw-r--r-- 1 rotem rotem 4.4K 2009-07-02 05:13 termlist.baseA
>>> -rw-r--r-- 1 rotem rotem 4.3K 2009-07-02 05:10 termlist.baseB
>>> -rw-r--r-- 1 rotem rotem 278M 2009-07-02 05:13 termlist.DB
>>> -rw-r--r-- 1 rotem rotem  232 2009-07-02 05:13 value.baseA
>>> -rw-r--r-- 1 rotem rotem  230 2009-07-02 05:10 value.baseB
>>> -rw-r--r-- 1 rotem rotem  14M 2009-07-02 05:13 value.DB
>>>
>>> ./7680jxd5.default/extensions:
>>> total 8.0K
>>> drwxr-xr-x 2 rotem rotem 4.0K 2009-07-01 16:10 .
>>> drwx------ 4 rotem rotem 4.0K 2009-07-05 19:00 ..
>>> rotem@desktop:~/.www.kiwix.org/kiwix$
>>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Sun, 5 Jul 2009 20:57:39 +0200
>>> From: Manuel Schneider <manuel.schneider@wikimedia.ch>
>>> Subject: Re: [openZIM dev-l] Kiwix index size
>>> To: asaf@forum2.org, dev-l@openzim.org
>>> Message-ID: <200907052057.39966.manuel.schneider@wikimedia.ch>
>>> Content-Type: text/plain;  charset="utf-8"
>>>
>>> Hi Asaf,
>>>
>>> Am Sonntag, 5. Juli 2009 schrieb Asaf Bartov:
>>>> When running Kiwix's indexer on the ZIM file I had created from the
>>> Hebrew
>>>> Wikipedia last week, the Kiwix data directory ran up to a total of 31
>>>> items, totalling 2.3 GB.  The ZIM file itself is ~300MB.  Does this
>>>> proportion make sense?
>>> I am not sure about the other files which were created, you only need the
>>> ZIM
>>> file with the index itself.
>>>
>>> For 900'000 articles the ZIM file containing the articles was 1.4 GB, the
>>> Index ZIM was 1.0 GB.
>>>
>>> So I think 300 MB looks fine.
>>>
>>> Greets,
>>>
>>>
>>> Manuel
>>> --
>>> Regards
>>> Manuel Schneider
>>>
>>> Wikimedia CH - Verein zur F?rderung Freien Wissens
>>> Wikimedia CH - Association for the advancement of free knowledge
>>> www.wikimedia.ch
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 3
>>> Date: Sun, 05 Jul 2009 21:05:33 +0200
>>> From: Emmanuel Engelhart <emmanuel@engelhart.org>
>>> Subject: Re: [openZIM dev-l] Kiwix index size
>>> To: asaf@forum2.org, dev-l@openzim.org
>>> Message-ID: <4A50F97D.2030607@engelhart.org>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
> Hi Asaf
> Asaf Bartov a ?crit :
>>>>> When running Kiwix's indexer on the ZIM file I had created from the
> Hebrew
>>>>> Wikipedia last week, the Kiwix data directory ran up to a total of 31
> items,
>>>>> totalling 2.3 GB.  The ZIM file itself is ~300MB.  Does this proportion
> make
>>>>> sense?
> this is possible. Kiwix uses the Xapian search engine which generates
> pretty big index files.
>
> I have to questions:
> * Are the search results OK?
> * Do you have a problem with the size of the index? Do you have a size
> limit?
>
> They are many open search/index softwares. I choose to use Xapian for
> many reasons, but this is possible under certain condition to add to
> Kiwix the support to an another search engine. This should be also
> possible to make a modified version of the indexer using less disk space
> (but with less words indexed).
>
> OpenZIM itself provides a search solution, Tommi can explain you more
> about it. Maybe it would be interesting for you to test it and give us a
>  feedback!
>
> Regards
> Emmanuel
>>>
>>>
- ------------------------------
>>>
_______________________________________________
dev-l mailing list
dev-l@openzim.org
https://intern.openzim.org/mailman/listinfo/dev-l
>>>
>>>
End of dev-l Digest, Vol 5, Issue 2
***********************************
>>>
>>
>> --
>> Rotem Simha
>>
>> _______________________________________________
>> dev-l mailing list
>> dev-l@openzim.org
>> https://intern.openzim.org/mailman/listinfo/dev-l
>>
>>

> ------------------------------------------------------------------------

> _______________________________________________
> dev-l mailing list
> dev-l@openzim.org
> https://intern.openzim.org/mailman/listinfo/dev-l

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpU8hEACgkQn3IpJRpNWtNWZwCeJz9ljyt4QrxaAOJnQdebD3Sw
qvsAoLfj1pJFYPjUW5WucEs8HhHetR0H
=RhGg
-----END PGP SIGNATURE-----



--
--
Asaf Bartov <asaf@forum2.org>