How libzim and zimwriterfs handle the bigger than memory nature of Wikipedia dumps? - Offline-l

21 Jun 2020


      Hello,
I am new to the mailing list. I used to work on sotoki.
My question is somewhat related to my failed attempt to store
stackoverflow dumps inside wiredtiger. Eventually, I figured that
wiredtiger could not keep up with the write load and that it is a pain
point at least with wiredtiger (but also with sqlite lsm extension).
The workaround is to have as much as RAM as the data (which is in my
opinion not acceptable), and fine-tune eviction triggers et al.
My questions are about libzim, zimwriterfs and how full-text search is
implemented:
1) Why zimwriterfs or libzim succeed at putting together all the html
dumps of wikipedia? Is it because they're a lot of RAM? Or is it a
particular algorithm?
2) Follow up question, how the full-text search is put together? Is
the index written document by document then packed into the zim file?
I am working on my free time on a search engine [1], my goal is to
have my own search engine that I can use locally. That is why, I was
thinking about kiwix, because kiwix via .zim files provide readily
available dumps of many useful resources. The last question is:
3) how can I read the content of .zim from C code? Are there C
bindings of libzim?
Thanks!
[0] It does not work anymore but the code is at
https://github.com/amirouche/babelia