Hi,
we (emmanuel and me) already started to talk about this title problem but we have not come to a conclusion. It is one of the themes we need to address on the next developer meeting.
I have developed the libzim and the indexer. Emmanuel uses as you know xapian in kiwix. Maybe emmanuel is willing to try out my indexer but I can't promise anything.
The index size of the german wikipedia is 1.5GB, because you did not pass the trivial words list to the zimwriter. There is a trivial words list in zimwriter/db/trivialwords-de.txt, which can be passed with -T to the writer. Then all words in this list are ignored. They make about 0.4GB in index size.
The list contains as the name suggest trivial words like "der", "und" or "ein" where searching does not make sense, because they are found in almost every article. The list currently contain 303 words. You can tune the list if you want.
Tommi
On Dienstag, 11. August 2009 16:26:41 Marc Bantle wrote:
Hi all,
as I wrote in an earlier post, I have compiled ZimReader for Openmoko platform. To make more databases available for the device I also had look at kiwix and the ZIM-Files supplied on it's site [1].
Some question arose from that and it appears, that this list is the right place to discuss them. Please correct me, if I'm wrong.
I observed that Kiwix is producing an "ad-hoc" type index. This may be usefull for desktops as they have the power to generate an index file on the fly. On small footprint devices this will not reasonably be possible, due to lacking memory and cpu resources.
Even on a dual core desktop with 3.5 GB of memory Kiwix failed to produce "ad-hoc" index of the openzim-edition of the German Wikipedia running out of memory after many hours.
Question 1:
From the change log I see that kiwix is using a
prominent search engine (Xapian) instead of the mechanism ZimReader/Writer are using. Is there an easy way to reuse an index produced by Kiwix on a different machines?
Question 2: Are there plans to enable Kiwix to read reusable indexes of the format released for ZimReader/ Writer?
Question 3: Are there plans to enable Kiwix to produce such a reusable index.
Question 4: Wouldn't it be desirable to deliver reusable indexes together with zim-article-databases for all those people with less capable devices (mids, netbooks, phones) on the Kiwix site?
Question 5: The zim databases supplied on the Kiwix site [1] seem to use the articles title field as article id field, which - I'm sure - solves some problems for Kiwix, but results in a list of article ids as result of a search on zimreader instead of a list of article titles. Since both Kiwix and ZimReader are part of the openzim standardization effort, this confuses me a bit. Which format is supposed to be the standard?
Question 6: I succeeded in producing a ZIM-Format index of the openzim-edition of the German Wikipedia using ZimWriter on the above 3.5 GB machine. Other than the index supplied on DVD, the generated index is
- 5 GB of size (instead of 1.1 GB ). Any ideas why
that is?
Cheers, Marc
[1] http://tmp.kiwix.org/zim _______________________________________________ dev-l mailing list dev-l@openzim.org https://intern.openzim.org/mailman/listinfo/dev-l