Hey all,
Henry (in CC) and I have been looking into the possibility of importing
a dataset in the order of around 10-20 million items into Wikibase, and
maybe around 50 million claims. Wikibase would be perfect for our needs,
but we have been struggling quite a lot to load the data.
We are using the Docker version. Initial attempts on a small sample of
10-20 thousand items were not promising, with the load taking a very
long time. We found that RaiseWikibase helped to considerably speed up
the initial load:
https://github.com/UB-Mannheim/RaiseWikibase
but on a small sample of 10-20 thousand items, the secondary indexing
process was taking several hours. This is the building_indexing()
process here (which just calls maintenance scripts):
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser…
This seems to be necessary for labels to appear correctly in the wiki,
and for search to work.
Rather than call that method, we have been trying to invoke the
maintenance scripts directly and play with arguments that might help,
such as batch size. However, some of the scripts still take a long time,
even considering the small size of what we are loading. For example:
docker exec wikibase-docker_wikibase_1 bash "-c" "php
extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --sleep 0.1
--batch-size 10000"
Takes around 2 hours on the small sample (which we could multiply by a
thousand for the full dataset, i.e., 83 days as an estimate).
Investigating the mysql database, it seems to be generating four tables:
wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang, but
these are in the order of 20,000 tuples when finished, so it is
surprising that the process takes so long. My guess is that the PHP code
is looking up pages per item, generating thousands of random accesses on
the disk, when it would seem better to just stream tuples/pages
contiguously from the table/disk?
Later on the CirrusSearch indexing is also taking a long time for the
small sample, generating jobs for batches that take a long time to
clear. In previous experience, ElasticSearch will happily eat millions
of documents in an hour. We are still looking at how batch sizes might
help, but it feels like it is taking much longer than it should.
Overall, we were wondering if we are approaching this bulk import in the
right way? It seems that the PHP scripts are not optimised for
performance/scale? Anyone has experience, tips or pointers on converting
and loading large-ish scale legacy data into Wikibase? Is there no
complete solution (envisaged) for this right now?
Best,
Aidan