Hi,
we (emmanuel and me) already started to talk about this title problem but we
have not come to a conclusion. It is one of the themes we need to address on
the next developer meeting.
I have developed the libzim and the indexer. Emmanuel uses as you know xapian
in kiwix. Maybe emmanuel is willing to try out my indexer but I can't promise
anything.
The index size of the german wikipedia is 1.5GB, because you did not pass the
trivial words list to the zimwriter. There is a trivial words list in
zimwriter/db/trivialwords-de.txt, which can be passed with -T to the writer.
Then all words in this list are ignored. They make about 0.4GB in index size.
The list contains as the name suggest trivial words like "der", "und"
or "ein"
where searching does not make sense, because they are found in almost every
article. The list currently contain 303 words. You can tune the list if you
want.
Tommi
On Dienstag, 11. August 2009 16:26:41 Marc Bantle wrote:
Hi all,
as I wrote in an earlier post, I have compiled
ZimReader for Openmoko platform. To make more
databases available for the device I also had look at
kiwix and the ZIM-Files supplied on it's site [1].
Some question arose from that and it appears, that this
list is the right place to discuss them. Please correct me,
if I'm wrong.
I observed that Kiwix is producing an "ad-hoc" type
index. This may be usefull for desktops as they have
the power to generate an index file on the fly. On
small footprint devices this will not reasonably be
possible, due to lacking memory and cpu resources.
Even on a dual core desktop with 3.5 GB of memory
Kiwix failed to produce "ad-hoc" index of the
openzim-edition of the German Wikipedia running
out of memory after many hours.
Question 1:
From the change log I see that kiwix is using a
prominent search engine (Xapian) instead of the
mechanism ZimReader/Writer are using. Is there an
easy way to reuse an index produced by Kiwix on
a different machines?
Question 2:
Are there plans to enable Kiwix to read reusable
indexes of the format released for ZimReader/
Writer?
Question 3:
Are there plans to enable Kiwix to produce such a
reusable index.
Question 4:
Wouldn't it be desirable to deliver reusable indexes
together with zim-article-databases for all those
people with less capable devices (mids, netbooks,
phones) on the Kiwix site?
Question 5:
The zim databases supplied on the Kiwix site [1]
seem to use the articles title field as article id field,
which - I'm sure - solves some problems for Kiwix,
but results in a list of article ids as result of a search
on zimreader instead of a list of article titles. Since
both Kiwix and ZimReader are part of the openzim
standardization effort, this confuses me a bit. Which
format is supposed to be the standard?
Question 6:
I succeeded in producing a ZIM-Format index of
the openzim-edition of the German Wikipedia using
ZimWriter on the above 3.5 GB machine. Other than
the index supplied on DVD, the generated index is
1. 5 GB of size (instead of 1.1 GB ). Any ideas why
that is?
Cheers,
Marc
[1]
http://tmp.kiwix.org/zim
_______________________________________________
dev-l mailing list
dev-l(a)openzim.org
https://intern.openzim.org/mailman/listinfo/dev-l