[Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase

15 Jul 2021


      Dear Aidan,
Thanks a lot for your comments and questions. I am a developer of RaiseWikibase and, in fact, I have the same questions as you. I am not sure whether we receive the answers here. So may be we could open an issue at Phabricator explaining performance issues with rebuildItemTerms.php and CirrusSearch indexing.
Regarding "the labels to appear correctly" using rebuildItemTerms.php: Yes, it is slow. :-( May be we should avoid using it at all. Inserting data directly into wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables from Python could be faster. I need to understand the logic of creating those indexes first.
Regarding the CirrusSearch indexing: I thought about using parallel indexing as explained at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/R.... But I have not tried it.
...
Overall, we were wondering if we are approaching this bulk import in the right way?
What could be an alternative? There is MySQL with many tables. `LOAD DATA INFILE` would require creating many files. I wanted to avoid that. But may be that's something to try.
...
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
...
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase?
1. Only to Blazegraph: http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
2. To MariaDB with internal PHP scripts: https://wikibase.consulting/transferring-wikibase-data-between-wikis/ 
3. Apart of that people mainly use the Wikibase API and its wrappers (to the best of my knowledge)
...
Is there no complete solution (envisaged) for this right now?
RaiseWikibase could become such a solution, but the issues you mentioned need to be addressed.
Kind regards,
Renat

2024

2023

2022

2021

2020

2019

2018

[Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase