Hello,
I am new to the mailing list. I used to work on sotoki.
My question is somewhat related to my failed attempt to store stackoverflow dumps inside wiredtiger. Eventually, I figured that wiredtiger could not keep up with the write load and that it is a pain point at least with wiredtiger (but also with sqlite lsm extension). The workaround is to have as much as RAM as the data (which is in my opinion not acceptable), and fine-tune eviction triggers et al.
My questions are about libzim, zimwriterfs and how full-text search is implemented:
1) Why zimwriterfs or libzim succeed at putting together all the html dumps of wikipedia? Is it because they're a lot of RAM? Or is it a particular algorithm?
2) Follow up question, how the full-text search is put together? Is the index written document by document then packed into the zim file?
I am working on my free time on a search engine [1], my goal is to have my own search engine that I can use locally. That is why, I was thinking about kiwix, because kiwix via .zim files provide readily available dumps of many useful resources. The last question is:
3) how can I read the content of .zim from C code? Are there C bindings of libzim?
Thanks!
[0] It does not work anymore but the code is at https://github.com/amirouche/babelia
Hi Amirouche
On 21.06.20 12:12, Amirouche Boubekki wrote:
I am new to the mailing list. I used to work on sotoki.
My question is somewhat related to my failed attempt to store stackoverflow dumps inside wiredtiger. Eventually, I figured that wiredtiger could not keep up with the write load and that it is a pain point at least with wiredtiger (but also with sqlite lsm extension). The workaround is to have as much as RAM as the data (which is in my opinion not acceptable), and fine-tune eviction triggers et al.
Your initiative on StackExchange/Sotoki has not been forgotten, lost. We maintain and develop the tools. We have really improved the scraper and made many releases these last few months: https://pypi.org/project/sotoki/#history
My questions are about libzim, zimwriterfs and how full-text search is implemented:
- Why zimwriterfs or libzim succeed at putting together all the html
dumps of wikipedia? Is it because they're a lot of RAM? Or is it a particular algorithm?
libzim does not use a lot of RAM, otherwise it would not be able to run a smaller devices like RPIs or Low-end smartphones.
libzim succeeds to store huge amount of data and make it available on really small devices, because the file format and the libzim have been conceived for that purpose. I won't explain all the details here, but everything need for the understanding is available here https://openzim.org/.
- Follow up question, how the full-text search is put together? Is
the index written document by document then packed into the zim file?
The fulltext search engine has "nothing" to do with the ZIM format. We use the Xapian engine for that optional feature. We keep only the key words in the Xapian not the documents (they are already in the ZIM). Since a few years, this index is embedded in the ZIM file itself for a better UX. See https://xapian.org/ for more details.
I am working on my free time on a search engine [1], my goal is to have my own search engine that I can use locally. That is why, I was thinking about kiwix, because kiwix via .zim files provide readily available dumps of many useful resources. The last question is:
If you deal with large amount of free text and want to do a fulltext search engine. This might be a good choice indeed.
- how can I read the content of .zim from C code? Are there C
bindings of libzim?
The libzim is done in C++, you won't be able to deal properly in C with it!
Good luck for your project.
Regards Emmanuel