[Wikibase] Experiences/doubts regarding bulk imports into Wikibase

12 Jul 2021


      Hey all,
Henry (in CC) and I have been looking into the possibility of importing 
a dataset in the order of around 10-20 million items into Wikibase, and 
maybe around 50 million claims. Wikibase would be perfect for our needs, 
but we have been struggling quite a lot to load the data.
We are using the Docker version. Initial attempts on a small sample of 
10-20 thousand items were not promising, with the load taking a very 
long time. We found that RaiseWikibase helped to considerably speed up 
the initial load:
https://github.com/UB-Mannheim/RaiseWikibase
but on a small sample of 10-20 thousand items, the secondary indexing 
process was taking several hours. This is the building_indexing() 
process here (which just calls maintenance scripts):
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
This seems to be necessary for labels to appear correctly in the wiki, 
and for search to work.
Rather than call that method, we have been trying to invoke the 
maintenance scripts directly and play with arguments that might help, 
such as batch size. However, some of the scripts still take a long time, 
even considering the small size of what we are loading. For example:
docker exec wikibase-docker_wikibase_1 bash "-c" "php 
extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --sleep 0.1 
--batch-size 10000"
Takes around 2 hours on the small sample (which we could multiply by a 
thousand for the full dataset, i.e., 83 days as an estimate). 
Investigating the mysql database, it seems to be generating four tables: 
wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang, but 
these are in the order of 20,000 tuples when finished, so it is 
surprising that the process takes so long. My guess is that the PHP code 
is looking up pages per item, generating thousands of random accesses on 
the disk, when it would seem better to just stream tuples/pages 
contiguously from the table/disk?
Later on the CirrusSearch indexing is also taking a long time for the 
small sample, generating jobs for batches that take a long time to 
clear. In previous experience, ElasticSearch will happily eat millions 
of documents in an hour. We are still looking at how batch sizes might 
help, but it feels like it is taking much longer than it should.
Overall, we were wondering if we are approaching this bulk import in the 
right way? It seems that the PHP scripts are not optimised for 
performance/scale? Anyone has experience, tips or pointers on converting 
and loading large-ish scale legacy data into Wikibase? Is there no 
complete solution (envisaged) for this right now?
Best,
Aidan

2024

2023

2022

2021

2020

2019

2018

[Wikibase] Experiences/doubts regarding bulk imports into Wikibase