Hello,
I am new to the mailing list. I used to work on sotoki.
My question is somewhat related to my failed attempt to store stackoverflow dumps inside wiredtiger. Eventually, I figured that wiredtiger could not keep up with the write load and that it is a pain point at least with wiredtiger (but also with sqlite lsm extension). The workaround is to have as much as RAM as the data (which is in my opinion not acceptable), and fine-tune eviction triggers et al.
My questions are about libzim, zimwriterfs and how full-text search is implemented:
1) Why zimwriterfs or libzim succeed at putting together all the html dumps of wikipedia? Is it because they're a lot of RAM? Or is it a particular algorithm?
2) Follow up question, how the full-text search is put together? Is the index written document by document then packed into the zim file?
I am working on my free time on a search engine [1], my goal is to have my own search engine that I can use locally. That is why, I was thinking about kiwix, because kiwix via .zim files provide readily available dumps of many useful resources. The last question is:
3) how can I read the content of .zim from C code? Are there C bindings of libzim?
Thanks!
[0] It does not work anymore but the code is at https://github.com/amirouche/babelia