Hey all,
Henry (in CC) and I have been looking into the possibility of importing a dataset in the order of around 10-20 million items into Wikibase, and maybe around 50 million claims. Wikibase would be perfect for our needs, but we have been struggling quite a lot to load the data.
We are using the Docker version. Initial attempts on a small sample of 10-20 thousand items were not promising, with the load taking a very long time. We found that RaiseWikibase helped to considerably speed up the initial load:
https://github.com/UB-Mannheim/RaiseWikibase
but on a small sample of 10-20 thousand items, the secondary indexing process was taking several hours. This is the building_indexing() process here (which just calls maintenance scripts):
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
This seems to be necessary for labels to appear correctly in the wiki, and for search to work.
Rather than call that method, we have been trying to invoke the maintenance scripts directly and play with arguments that might help, such as batch size. However, some of the scripts still take a long time, even considering the small size of what we are loading. For example:
docker exec wikibase-docker_wikibase_1 bash "-c" "php extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --sleep 0.1 --batch-size 10000"
Takes around 2 hours on the small sample (which we could multiply by a thousand for the full dataset, i.e., 83 days as an estimate). Investigating the mysql database, it seems to be generating four tables: wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang, but these are in the order of 20,000 tuples when finished, so it is surprising that the process takes so long. My guess is that the PHP code is looking up pages per item, generating thousands of random accesses on the disk, when it would seem better to just stream tuples/pages contiguously from the table/disk?
Later on the CirrusSearch indexing is also taking a long time for the small sample, generating jobs for batches that take a long time to clear. In previous experience, ElasticSearch will happily eat millions of documents in an hour. We are still looking at how batch sizes might help, but it feels like it is taking much longer than it should.
Overall, we were wondering if we are approaching this bulk import in the right way? It seems that the PHP scripts are not optimised for performance/scale? Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase? Is there no complete solution (envisaged) for this right now?
Best, Aidan