[Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase

15 Jul 2021

      Hi Renat,
On 2021-07-15 5:51, renat.shigapov@bib.uni-mannheim.de wrote:
...
Dear Aidan,
Thanks a lot for your comments and questions. I am a developer of RaiseWikibase and, in fact, I have the same questions as you. I am not sure whether we receive the answers here. So may be we could open an issue at Phabricator explaining performance issues with rebuildItemTerms.php and CirrusSearch indexing.
Many thanks for your work on RaiseWikibase! It really helped us out a 
lot in a project and accelerated the initial load considerably! :)
...
Regarding "the labels to appear correctly" using rebuildItemTerms.php: Yes, it is slow. :-( May be we should avoid using it at all. Inserting data directly into wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables from Python could be faster. I need to understand the logic of creating those indexes first.
Yes, what I have seen is that in our monolingual mono-label dataset, it 
is creating very regular tables with item, lang, item label id, text, 
etc. Some of the tables just contain something like:
1	1	1	1
2	2	2	1
3	3	3	1
...
n	n	n	1
They appear to be foreign key IDs for items, labels, etc. There is then 
another table that is just a dictionary for labels. We're looking into 
just "faking" the tables in our use case for the moment since they are 
so regular and we are hoping to have an internal demo based on Wikibase 
ready soon.
My guess is that the tables will not be so regular when one is working 
with multilingual labels, or more generally, multiple labels per entity.
...
Regarding the CirrusSearch indexing: I thought about using parallel indexing as explained at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/R.... But I have not tried it.
Indeed, I think this might help. I also think the batch size might help 
as I suspect the issue is the number of jobs/requests being made and not 
really the data volume involvedd. In experience in other projects with 
Elasticsearch, having batch sizes of size n can generally lead to O(n) 
speed ups, especially when n is a low-ish value (but I would suggest 
batches of around 10000). What I do not know if is CirrusSearch is doing 
something more intensive, such as applying ranking based on links; this 
might be more costly, but could be solved using external frameworks 
(like Hadoop, Spark, etc.).
...
...
Overall, we were wondering if we are approaching this bulk import in the right way?
What could be an alternative? There is MySQL with many tables. `LOAD DATA INFILE` would require creating many files. I wanted to avoid that. But may be that's something to try.
I mention a couple of alternatives below ...
...
...
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
...
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase?

Only to Blazegraph: http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
To MariaDB with internal PHP scripts: https://wikibase.consulting/transferring-wikibase-data-between-wikis/
Apart of that people mainly use the Wikibase API and its wrappers (to the best of my knowledge)

...
Is there no complete solution (envisaged) for this right now?
RaiseWikibase could become such a solution, but the issues you mentioned need to be addressed.
I've been thinking about this, and I see three possible directions, 
listed here from most "radical" to least radical, in order for Wikibase 
to support bulk inserts:
1) Extend RaiseBase with a custom Python script to stream pages from the 
MariaDB tables, and bulk load the corresponding data for the secondary 
tables into MariaDB.
2) SQL "Views": Use RaiseBase for the initial population of data and 
create the secondary tables within the database itself. This assumes 
that the data needed for the secondary tables can be queried and 
enumerated efficiently within the database. One of the issues is that 
the pages are stored in JSON, but many database engines now support a 
JSON datatype and SQL functions to extract data from the JSON datatype. 
Each table could then be created with a single SQL command (these 
secondary tables are more-or-less equivalent to a materialised view, 
assuming the JSON data can be queried from the pages). I think this 
should be a reasonable solution for medium-sized graphs (in the tens of 
millions of nodes I would say, taking maybe hours).
3) JSON dump -> SQL dump: Create external jobs to populate the data for 
the tables using an external framework, such as Hadoop or Spark. As 
input the process receives a JSON document (or collection thereof) 
representing the pages (similar to the JSON dump), and as output it 
creates an SQL document that one can call to load the data into the 
tables in batch. This would be a good solution for importing large-scale 
graphs into MariaDB (the limit on scale would rather become the MariaDB 
instance on one machine, but indexing billions of elements should be 
fine on a reasonable machine).
4) Migrate Wikibase: Create a new database connector such that MariaDB 
is no longer needed. For example, create a fork of the Wikibase 
repository that can query BlazeGraph directly. Then the data can be bulk 
loaded as RDF and accessed through SPARQL. This might lead to a 
performance cost as the notion of a materialised view does not really 
exist in RDF/SPARQL, though one could of course add additional triples 
to create a sort of concise "graph view" of the data needed. Another 
option would be to port to ElasticSearch, which offers distributed 
storage (as well as excellent text search with Lucene, interfaces like 
Kibana, etc). Data can be stored in a "denormalised" way so that 
precisely the data for each article can be loaded with one lookup.
Do you have further plans for extending RaiseBase? I could potentially 
try to find a student interested in working on something like (3) using 
Spark, for example, but at the earliest, this would yield results around 
this time next year. I think (4) is probably not a good idea as a "third 
party" project as it would be a radical change to Wikibase. :)
Best,
Aidan
...
Kind regards,
Renat
_______________________________________________
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org

2024

2023

2022

2021

2020

2019

2018

[Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase