we have been facing the same issue, because we need to upload 20 Millons items, and I woder if you were able to update those tables at least?, perhaps we could work in the others together?
I have now been able to insert the data directly into the database.
The process consists of these steps:
- generate the data for an item in JSON - determine the next Q number and update the JSON item data accordingly - insert data into the various database tables
There are several foreign keys in the database. Therefore, you can
a) use MySQL's auto-increment and read the assigned ids after an insert b) assume that you are the only process writing to the database and assign the key in advance
I have written a Java program to perform these steps. The process is still terrible slow. Generating and updating the JSON item data is blazing fast. I can process thousands of items per seconds.
The bottleneck is the MySQL database. Up to now I have used the standard docker-compose setup. With the Java program I can write only one item per second.
So I have tried option b) with pre-defined ids and wrote the SQL statements into a text file. I fed these SQL statements into the MySQL command line tool. But after two hours I lost patience.
Now I will try to optimize the MySQL database setup.
Jesper
The bottleneck is the MySQL database. Up to now I have used the standard docker-compose setup. With the Java program I can write only one item per second.
I suspect that I have found the solution. The trick is to put the insert operations into a transaction.
In a simple experiment with 1,000 item it took 760 ms per item. Within a transaction an item could be inserted less than 3 ms. What a speed-up!
Now the insertion of my 40 million items seems to be doable. :-)
After I have cleaned up my demo code a bit I am going to share it via GitHub.
Jesper
Hi Jesper,
After I have cleaned up my demo code a bit I am going to share it via GitHub.
yes, please! I'm really interested in such a speedup. We need to load eight million of items and any speedup is appreciated :)
cheers Hans-Jürgen
After I have cleaned up my demo code a bit I am going to share it via GitHub.
yes, please! I'm really interested in such a speedup. We need to load eight million of items and any speedup is appreciated :)
You can find it here: https://github.com/jze/wikibase-insert
Jesper
Dear all,
First, let me introduce myself, I am Pascal Lefeuvre from the french national library (BnF). I am scrum master on a project which aims at building the new software for cataloguing documents at the BnF. For this software, we have decided to use a private wikibase instance to store and manage the data we will produce.
We are very interested in what you have done Jesper since we will need to initialize our wikibase with more than 50 millions of items. We have made some experiment using mediawiki/wikibase API. we developed a bot to call the create item API in several threads in order to speed up the process but it is not sufficient. we have planned to run the bot on several servers at a time to see if it goes faster.
I have some questions about what you did : are elastic search index and blazegraph automatically synchronized when you create your items directly in the database without using API or do you have to run specific scripts to synchronize everything ?
Thank you again for your experiments.
Pascal
De : "Jesper Zedlitz" jesper@zedlitz.de A : wikibaseug@lists.wikimedia.org Date : 11/06/2020 15:17 Objet : Re: [Wikibase] propographical data and insert performance Envoyé par : "Wikibaseug" wikibaseug-bounces@lists.wikimedia.org
After I have cleaned up my demo code a bit I am going to share it via GitHub.
yes, please! I'm really interested in such a speedup. We need to load eight million of items and any speedup is appreciated :)
You can find it here: https://github.com/jze/wikibase-insert
Jesper
_______________________________________________ Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
En raison de la situation sanitaire en France concernant le Covid-19, et suite aux instructions du Gouvernement, tous les sites de la Bibliothèque nationale de France sont fermés au public jusqu’à nouvel ordre. Avant d'imprimer, pensez à l'environnement.
I have some questions about what you did : are elastic search index and blazegraph automatically synchronized when you create your items directly in the database without using API or do you have to run specific scripts to synchronize everything ?
After a little modification to my code at https://github.com/jze/wikibase-insert/ the `recentchanges` table is also updated correctly.
Inserted items can be found via the SPARQL interface. You can see the updates in log output of docker-compose. I have not yet measured how long it takes until all inserted items have been written to Blazegraph.
However, the items do not appear in Elasticsearch. For my purpose it is enough to query the data via the SPARQL endpoint. That's why I did not look at the synchronization to Elastichsearch.
Jesper
wikibaseug@lists.wikimedia.org