Hi Tommi,
sorry for the late answer.
Tommi Mäkitalo schrieb:
On Dienstag, 11. August 2009 16:26:41 Marc Bantle
wrote:
Question 6:
I succeeded in producing a ZIM-Format index of
the openzim-edition of the German Wikipedia using
ZimWriter on the above 3.5 GB machine. Other than
the index supplied on DVD, the generated index is
1. 5 GB of size (instead of 1.1 GB ). Any ideas why
that is?
The index size of the german wikipedia is 1.5GB, because you did not pass
the
trivial words list to the zimwriter. There is a trivial words list in
zimwriter/db/trivialwords-de.txt, which can be passed with -T to the writer.
Then all words in this list are ignored. They make about 0.4GB in index size.
Thanks for the hint. That will probably reduce memory requirements
and increase processing speed as well. I will give that a try next time.
How did you determine the list?
Just an idea: Wouldn't zimwriter be the place to automatically
generate such a list during indexing or in a separate pass?
An additional parameter would specify the maximum size of that
list, the minimum number of occurrences of a word or maybe a
minimum percentage of articles the word occurres in, before it
gets marked as trivial.
Cheers,
Marc