Hi Tommi,
sorry for the late answer.
Tommi Mäkitalo schrieb:
On Dienstag, 11. August 2009 16:26:41 Marc Bantle wrote:
Question 6: I succeeded in producing a ZIM-Format index of the openzim-edition of the German Wikipedia using ZimWriter on the above 3.5 GB machine. Other than the index supplied on DVD, the generated index is
- 5 GB of size (instead of 1.1 GB ). Any ideas why
that is?
The index size of the german wikipedia is 1.5GB, because you did not pass the trivial words list to the zimwriter. There is a trivial words list in zimwriter/db/trivialwords-de.txt, which can be passed with -T to the writer. Then all words in this list are ignored. They make about 0.4GB in index size.
Thanks for the hint. That will probably reduce memory requirements and increase processing speed as well. I will give that a try next time.
How did you determine the list?
Just an idea: Wouldn't zimwriter be the place to automatically generate such a list during indexing or in a separate pass? An additional parameter would specify the maximum size of that list, the minimum number of occurrences of a word or maybe a minimum percentage of articles the word occurres in, before it gets marked as trivial.
Cheers, Marc