Hi Renat,
On 2021-07-15 5:51, renat.shigapov@bib.uni-mannheim.de wrote:
Dear Aidan,
Thanks a lot for your comments and questions. I am a developer of RaiseWikibase and, in fact, I have the same questions as you. I am not sure whether we receive the answers here. So may be we could open an issue at Phabricator explaining performance issues with rebuildItemTerms.php and CirrusSearch indexing.
Many thanks for your work on RaiseWikibase! It really helped us out a lot in a project and accelerated the initial load considerably! :)
Regarding "the labels to appear correctly" using rebuildItemTerms.php: Yes, it is slow. :-( May be we should avoid using it at all. Inserting data directly into wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables from Python could be faster. I need to understand the logic of creating those indexes first.
Yes, what I have seen is that in our monolingual mono-label dataset, it is creating very regular tables with item, lang, item label id, text, etc. Some of the tables just contain something like:
1 1 1 1 2 2 2 1 3 3 3 1 ... n n n 1
They appear to be foreign key IDs for items, labels, etc. There is then another table that is just a dictionary for labels. We're looking into just "faking" the tables in our use case for the moment since they are so regular and we are hoping to have an internal demo based on Wikibase ready soon.
My guess is that the tables will not be so regular when one is working with multilingual labels, or more generally, multiple labels per entity.
Regarding the CirrusSearch indexing: I thought about using parallel indexing as explained at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/R.... But I have not tried it.
Indeed, I think this might help. I also think the batch size might help as I suspect the issue is the number of jobs/requests being made and not really the data volume involvedd. In experience in other projects with Elasticsearch, having batch sizes of size n can generally lead to O(n) speed ups, especially when n is a low-ish value (but I would suggest batches of around 10000). What I do not know if is CirrusSearch is doing something more intensive, such as applying ranking based on links; this might be more costly, but could be solved using external frameworks (like Hadoop, Spark, etc.).
Overall, we were wondering if we are approaching this bulk import in the right way?
What could be an alternative? There is MySQL with many tables. `LOAD DATA INFILE` would require creating many files. I wanted to avoid that. But may be that's something to try.
I mention a couple of alternatives below ...
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase?
- Only to Blazegraph: http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
- To MariaDB with internal PHP scripts: https://wikibase.consulting/transferring-wikibase-data-between-wikis/
- Apart of that people mainly use the Wikibase API and its wrappers (to the best of my knowledge)
Is there no complete solution (envisaged) for this right now?
RaiseWikibase could become such a solution, but the issues you mentioned need to be addressed.
I've been thinking about this, and I see three possible directions, listed here from most "radical" to least radical, in order for Wikibase to support bulk inserts:
1) Extend RaiseBase with a custom Python script to stream pages from the MariaDB tables, and bulk load the corresponding data for the secondary tables into MariaDB.
2) SQL "Views": Use RaiseBase for the initial population of data and create the secondary tables within the database itself. This assumes that the data needed for the secondary tables can be queried and enumerated efficiently within the database. One of the issues is that the pages are stored in JSON, but many database engines now support a JSON datatype and SQL functions to extract data from the JSON datatype. Each table could then be created with a single SQL command (these secondary tables are more-or-less equivalent to a materialised view, assuming the JSON data can be queried from the pages). I think this should be a reasonable solution for medium-sized graphs (in the tens of millions of nodes I would say, taking maybe hours).
3) JSON dump -> SQL dump: Create external jobs to populate the data for the tables using an external framework, such as Hadoop or Spark. As input the process receives a JSON document (or collection thereof) representing the pages (similar to the JSON dump), and as output it creates an SQL document that one can call to load the data into the tables in batch. This would be a good solution for importing large-scale graphs into MariaDB (the limit on scale would rather become the MariaDB instance on one machine, but indexing billions of elements should be fine on a reasonable machine).
4) Migrate Wikibase: Create a new database connector such that MariaDB is no longer needed. For example, create a fork of the Wikibase repository that can query BlazeGraph directly. Then the data can be bulk loaded as RDF and accessed through SPARQL. This might lead to a performance cost as the notion of a materialised view does not really exist in RDF/SPARQL, though one could of course add additional triples to create a sort of concise "graph view" of the data needed. Another option would be to port to ElasticSearch, which offers distributed storage (as well as excellent text search with Lucene, interfaces like Kibana, etc). Data can be stored in a "denormalised" way so that precisely the data for each article can be loaded with one lookup.
Do you have further plans for extending RaiseBase? I could potentially try to find a student interested in working on something like (3) using Spark, for example, but at the earliest, this would yield results around this time next year. I think (4) is probably not a good idea as a "third party" project as it would be a radical change to Wikibase. :)
Best, Aidan
Kind regards, Renat _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org