Dear Aidan,
Thanks a lot for your comments and questions. I am a developer of RaiseWikibase and, in
fact, I have the same questions as you. I am not sure whether we receive the answers here.
So may be we could open an issue at Phabricator explaining performance issues with
rebuildItemTerms.php and CirrusSearch indexing.
Regarding "the labels to appear correctly" using rebuildItemTerms.php: Yes, it
is slow. :-( May be we should avoid using it at all. Inserting data directly into
wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables from Python could
be faster. I need to understand the logic of creating those indexes first.
Regarding the CirrusSearch indexing: I thought about using parallel indexing as explained
at
https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/…. But
I have not tried it.
Overall, we were wondering if we are approaching this
bulk import in the right way?
What could be an alternative? There is MySQL with many
tables. `LOAD DATA INFILE` would require creating many files. I wanted to avoid that. But
may be that's something to try.
It seems that the PHP scripts are not optimised for
performance/scale?
It seems so.
Anyone has experience, tips or pointers on converting
and loading large-ish scale legacy data into Wikibase?
1. Only to Blazegraph:
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
2. To MariaDB with internal PHP scripts:
https://wikibase.consulting/transferring-wikibase-data-between-wikis/
3. Apart of that people mainly use the Wikibase API and its wrappers (to the best of my
knowledge)
Is there no complete solution (envisaged) for this
right now?
RaiseWikibase could become such a solution, but the issues you mentioned
need to be addressed.
Kind regards,
Renat