Hey all,
Henry (in CC) and I have been looking into the possibility of importing a dataset in the order of around 10-20 million items into Wikibase, and maybe around 50 million claims. Wikibase would be perfect for our needs, but we have been struggling quite a lot to load the data.
We are using the Docker version. Initial attempts on a small sample of 10-20 thousand items were not promising, with the load taking a very long time. We found that RaiseWikibase helped to considerably speed up the initial load:
https://github.com/UB-Mannheim/RaiseWikibase
but on a small sample of 10-20 thousand items, the secondary indexing process was taking several hours. This is the building_indexing() process here (which just calls maintenance scripts):
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
This seems to be necessary for labels to appear correctly in the wiki, and for search to work.
Rather than call that method, we have been trying to invoke the maintenance scripts directly and play with arguments that might help, such as batch size. However, some of the scripts still take a long time, even considering the small size of what we are loading. For example:
docker exec wikibase-docker_wikibase_1 bash "-c" "php extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --sleep 0.1 --batch-size 10000"
Takes around 2 hours on the small sample (which we could multiply by a thousand for the full dataset, i.e., 83 days as an estimate). Investigating the mysql database, it seems to be generating four tables: wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang, but these are in the order of 20,000 tuples when finished, so it is surprising that the process takes so long. My guess is that the PHP code is looking up pages per item, generating thousands of random accesses on the disk, when it would seem better to just stream tuples/pages contiguously from the table/disk?
Later on the CirrusSearch indexing is also taking a long time for the small sample, generating jobs for batches that take a long time to clear. In previous experience, ElasticSearch will happily eat millions of documents in an hour. We are still looking at how batch sizes might help, but it feels like it is taking much longer than it should.
Overall, we were wondering if we are approaching this bulk import in the right way? It seems that the PHP scripts are not optimised for performance/scale? Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase? Is there no complete solution (envisaged) for this right now?
Best, Aidan
Dear Aidan,
Thanks a lot for your comments and questions. I am a developer of RaiseWikibase and, in fact, I have the same questions as you. I am not sure whether we receive the answers here. So may be we could open an issue at Phabricator explaining performance issues with rebuildItemTerms.php and CirrusSearch indexing.
Regarding "the labels to appear correctly" using rebuildItemTerms.php: Yes, it is slow. :-( May be we should avoid using it at all. Inserting data directly into wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables from Python could be faster. I need to understand the logic of creating those indexes first.
Regarding the CirrusSearch indexing: I thought about using parallel indexing as explained at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/R.... But I have not tried it.
Overall, we were wondering if we are approaching this bulk import in the right way?
What could be an alternative? There is MySQL with many tables. `LOAD DATA INFILE` would require creating many files. I wanted to avoid that. But may be that's something to try.
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase?
1. Only to Blazegraph: http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData 2. To MariaDB with internal PHP scripts: https://wikibase.consulting/transferring-wikibase-data-between-wikis/ 3. Apart of that people mainly use the Wikibase API and its wrappers (to the best of my knowledge)
Is there no complete solution (envisaged) for this right now?
RaiseWikibase could become such a solution, but the issues you mentioned need to be addressed.
Kind regards, Renat
Hi Renat,
On 2021-07-15 5:51, renat.shigapov@bib.uni-mannheim.de wrote:
Dear Aidan,
Thanks a lot for your comments and questions. I am a developer of RaiseWikibase and, in fact, I have the same questions as you. I am not sure whether we receive the answers here. So may be we could open an issue at Phabricator explaining performance issues with rebuildItemTerms.php and CirrusSearch indexing.
Many thanks for your work on RaiseWikibase! It really helped us out a lot in a project and accelerated the initial load considerably! :)
Regarding "the labels to appear correctly" using rebuildItemTerms.php: Yes, it is slow. :-( May be we should avoid using it at all. Inserting data directly into wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables from Python could be faster. I need to understand the logic of creating those indexes first.
Yes, what I have seen is that in our monolingual mono-label dataset, it is creating very regular tables with item, lang, item label id, text, etc. Some of the tables just contain something like:
1 1 1 1 2 2 2 1 3 3 3 1 ... n n n 1
They appear to be foreign key IDs for items, labels, etc. There is then another table that is just a dictionary for labels. We're looking into just "faking" the tables in our use case for the moment since they are so regular and we are hoping to have an internal demo based on Wikibase ready soon.
My guess is that the tables will not be so regular when one is working with multilingual labels, or more generally, multiple labels per entity.
Regarding the CirrusSearch indexing: I thought about using parallel indexing as explained at https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/R.... But I have not tried it.
Indeed, I think this might help. I also think the batch size might help as I suspect the issue is the number of jobs/requests being made and not really the data volume involvedd. In experience in other projects with Elasticsearch, having batch sizes of size n can generally lead to O(n) speed ups, especially when n is a low-ish value (but I would suggest batches of around 10000). What I do not know if is CirrusSearch is doing something more intensive, such as applying ranking based on links; this might be more costly, but could be solved using external frameworks (like Hadoop, Spark, etc.).
Overall, we were wondering if we are approaching this bulk import in the right way?
What could be an alternative? There is MySQL with many tables. `LOAD DATA INFILE` would require creating many files. I wanted to avoid that. But may be that's something to try.
I mention a couple of alternatives below ...
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase?
- Only to Blazegraph: http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
- To MariaDB with internal PHP scripts: https://wikibase.consulting/transferring-wikibase-data-between-wikis/
- Apart of that people mainly use the Wikibase API and its wrappers (to the best of my knowledge)
Is there no complete solution (envisaged) for this right now?
RaiseWikibase could become such a solution, but the issues you mentioned need to be addressed.
I've been thinking about this, and I see three possible directions, listed here from most "radical" to least radical, in order for Wikibase to support bulk inserts:
1) Extend RaiseBase with a custom Python script to stream pages from the MariaDB tables, and bulk load the corresponding data for the secondary tables into MariaDB.
2) SQL "Views": Use RaiseBase for the initial population of data and create the secondary tables within the database itself. This assumes that the data needed for the secondary tables can be queried and enumerated efficiently within the database. One of the issues is that the pages are stored in JSON, but many database engines now support a JSON datatype and SQL functions to extract data from the JSON datatype. Each table could then be created with a single SQL command (these secondary tables are more-or-less equivalent to a materialised view, assuming the JSON data can be queried from the pages). I think this should be a reasonable solution for medium-sized graphs (in the tens of millions of nodes I would say, taking maybe hours).
3) JSON dump -> SQL dump: Create external jobs to populate the data for the tables using an external framework, such as Hadoop or Spark. As input the process receives a JSON document (or collection thereof) representing the pages (similar to the JSON dump), and as output it creates an SQL document that one can call to load the data into the tables in batch. This would be a good solution for importing large-scale graphs into MariaDB (the limit on scale would rather become the MariaDB instance on one machine, but indexing billions of elements should be fine on a reasonable machine).
4) Migrate Wikibase: Create a new database connector such that MariaDB is no longer needed. For example, create a fork of the Wikibase repository that can query BlazeGraph directly. Then the data can be bulk loaded as RDF and accessed through SPARQL. This might lead to a performance cost as the notion of a materialised view does not really exist in RDF/SPARQL, though one could of course add additional triples to create a sort of concise "graph view" of the data needed. Another option would be to port to ElasticSearch, which offers distributed storage (as well as excellent text search with Lucene, interfaces like Kibana, etc). Data can be stored in a "denormalised" way so that precisely the data for each article can be loaded with one lookup.
Do you have further plans for extending RaiseBase? I could potentially try to find a student interested in working on something like (3) using Spark, for example, but at the earliest, this would yield results around this time next year. I think (4) is probably not a good idea as a "third party" project as it would be a radical change to Wikibase. :)
Best, Aidan
Kind regards, Renat _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Hi Aidan,
I am on holidays for a few days with a limited access to internet, but my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables.
Regarding (2): can that be done in MariaDB?
If you can make (3) done, the whole Wikibase community would be very thankful.
Regarding (4): yeah, too radical and maintenance problems.
Sorry for a short answer, I'll add on my return.
Kind regards, Renat
Hi Renat,
On 2021-07-16 15:54, Renat Shigapov wrote:
Hi Aidan,
I am on holidays for a few days with a limited access to internet, but my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables.
Go enjoy your holidays. :)
I can quickly respond to one point:
Regarding (2): can that be done in MariaDB?
Since 10.2.3, MariaDB has support for JSON:
https://mariadb.com/kb/en/json-functions/
I have no experience with handling JSON in a relational database, so not sure overall whether it's a good solution, but I think it should be considerably more performant so long as the pages are iterated over, rather than each one being searched in the index.
Best, Aidn
If you can make (3) done, the whole Wikibase community would be very thankful.
Regarding (4): yeah, too radical and maintenance problems.
Sorry for a short answer, I'll add on my return.
Kind regards, Renat _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Hi all,
Some general thoughts on this thread.
Overall, we were wondering if we are approaching this bulk import in the right way? It seems that the PHP scripts are not optimised for performance/scale?
In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db. Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured) If this is done, then having one or more separate job runners would be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase. Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc.
If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase.
Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves? (I see that runJobs is called as the last step at https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times. This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists. The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes. These jobs would run fairly soon after the initial write, leading to better performance.
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
Scale yes (these scripts ran across the whole of Wikidata.org) Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process.
Regarding rebuildItemterms
Takes around 2 hours on the small sample (which we could multiply by a thousand for the full dataset, i.e., 83 days as an estimate).
Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities. The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish. The next bottle neck would then be the speed of your SQL database.
There is probably lots more to discuss in this thread but I found navigating it quite hard. Hopefully the above will prompt some more discussion. It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too.
Adam / Addshore Wikidata Wikibase Tech Lead
On Sat, 17 Jul 2021 at 22:54, Aidan Hogan aidhog@gmail.com wrote:
Hi Renat,
On 2021-07-16 15:54, Renat Shigapov wrote:
Hi Aidan,
I am on holidays for a few days with a limited access to internet, but
my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables.
Go enjoy your holidays. :)
I can quickly respond to one point:
Regarding (2): can that be done in MariaDB?
Since 10.2.3, MariaDB has support for JSON:
https://mariadb.com/kb/en/json-functions/
I have no experience with handling JSON in a relational database, so not sure overall whether it's a good solution, but I think it should be considerably more performant so long as the pages are iterated over, rather than each one being searched in the index.
Best, Aidn
If you can make (3) done, the whole Wikibase community would be very
thankful.
Regarding (4): yeah, too radical and maintenance problems.
Sorry for a short answer, I'll add on my return.
Kind regards, Renat _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
This thread and my earlier email prompted me to write a deep dive blog post on exactly what happens when you send a create item request to wbeditentity.
You can find this newly published (and thus probably with some typos) below https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-it...
Hopefully this can prompt some more questions, thoughts, changes etc for the bulk import case both for RAISE and for the API etc.
Adam / Addshore Wikidata Wikibase Tech Lead
On Mon, 19 Jul 2021 at 17:45, Addshore addshorewiki@gmail.com wrote:
Hi all,
Some general thoughts on this thread.
Overall, we were wondering if we are approaching this bulk import in the right way? It seems that the PHP scripts are not optimised for performance/scale?
In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db. Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured) If this is done, then having one or more separate job runners would be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase. Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc.
If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase.
Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves? (I see that runJobs is called as the last step at https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times. This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists. The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes. These jobs would run fairly soon after the initial write, leading to better performance.
It seems that the PHP scripts are not optimised for performance/scale?
It seems so.
Scale yes (these scripts ran across the whole of Wikidata.org) Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process.
Regarding rebuildItemterms
Takes around 2 hours on the small sample (which we could multiply by a thousand for the full dataset, i.e., 83 days as an estimate).
Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities. The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish. The next bottle neck would then be the speed of your SQL database.
There is probably lots more to discuss in this thread but I found navigating it quite hard. Hopefully the above will prompt some more discussion. It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too.
Adam / Addshore Wikidata Wikibase Tech Lead
On Sat, 17 Jul 2021 at 22:54, Aidan Hogan aidhog@gmail.com wrote:
Hi Renat,
On 2021-07-16 15:54, Renat Shigapov wrote:
Hi Aidan,
I am on holidays for a few days with a limited access to internet, but
my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables.
Go enjoy your holidays. :)
I can quickly respond to one point:
Regarding (2): can that be done in MariaDB?
Since 10.2.3, MariaDB has support for JSON:
https://mariadb.com/kb/en/json-functions/
I have no experience with handling JSON in a relational database, so not sure overall whether it's a good solution, but I think it should be considerably more performant so long as the pages are iterated over, rather than each one being searched in the index.
Best, Aidn
If you can make (3) done, the whole Wikibase community would be very
thankful.
Regarding (4): yeah, too radical and maintenance problems.
Sorry for a short answer, I'll add on my return.
Kind regards, Renat _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Thanks Adam for the detailed blog post!
From your mail, I understand that the current Wikibase framework is based on continuous updates, and in that sense it can process a high rate of manual edits without problem. I think though that for high performance bulk updates, the best option would be an "ETL" style framework that loads data directly into the underlying databases using custom tools. What seems to be hurting performance in the bulk load scenario is:
1) The creation of massive amounts of requests/jobs all at once. 2) Random disk accesses that require about 10ms per go on a conventional disk if the data are not cached or read sequentially.
These are not a problem per se for the typical running and maintenance of a Wikibase instance, but rather only occur when one tries to import millions of items at once. Given that we can assume that an "admin" is importing the data, we can also bypass a lot of the typical processes that are run to verify edit permissions, rate limits, etc.
Overall, with the goal of importing a legacy dataset of around 10 million items, with maybe 100 million "values", in one day, on conventional hardware, these two problems would need to be resolved. The underlying databases should not have a problem processing data at this scale/rate using bulk update/transaction methods, but they will have a problem if each item is updated as a separate transaction; even with batching, it can still lead to hundreds of thousands of transactions that would choke up any persistent database running on a traditional hard-disk. So I think that the most performant solution would be to bypass Wikibase, and to prepare the data (e.g., a user-provided JSON dump) to be loaded into the underlying databases directly that Wikibase can later query. The issue with this approach is that it is not so trivial to understand what data need to be loaded where (this requires a deep understanding of Wikibase), and that such a bulk loader would need to be "synchronised" with changes to the main Wikibase software, so it would require maintenance over the years.
Best, Aidan
On 2021-07-19 17:50, Addshore wrote:
This thread and my earlier email prompted me to write a deep dive blog post on exactly what happens when you send a create item request to wbeditentity.
You can find this newly published (and thus probably with some typos) below https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-it... https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/
Hopefully this can prompt some more questions, thoughts, changes etc for the bulk import case both for RAISE and for the API etc.
Adam / Addshore Wikidata Wikibase Tech Lead
On Mon, 19 Jul 2021 at 17:45, Addshore <addshorewiki@gmail.com mailto:addshorewiki@gmail.com> wrote:
Hi all, Some general thoughts on this thread. > Overall, we were wondering if we are approaching this bulk import in the > right way? It seems that the PHP scripts are not optimised for > performance/scale? In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db. Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured) If this is done, then having one or more separate job runners would be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue <https://www.mediawiki.org/wiki/Manual:Job_queue> The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase. Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc. If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase. Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves? (I see that runJobs is called as the last step at https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175 <https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175> Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times. This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists. The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes. These jobs would run fairly soon after the initial write, leading to better performance. > > It seems that the PHP scripts are not optimised for performance/scale? > It seems so. Scale yes (these scripts ran across the whole of Wikidata.org) Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process. Regarding rebuildItemterms > Takes around 2 hours on the small sample (which we could multiply by a > thousand for the full dataset, i.e., 83 days as an estimate). Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities. The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish. The next bottle neck would then be the speed of your SQL database. There is probably lots more to discuss in this thread but I found navigating it quite hard. Hopefully the above will prompt some more discussion. It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too. Adam / Addshore Wikidata Wikibase Tech Lead On Sat, 17 Jul 2021 at 22:54, Aidan Hogan <aidhog@gmail.com <mailto:aidhog@gmail.com>> wrote: Hi Renat, On 2021-07-16 15:54, Renat Shigapov wrote: > Hi Aidan, > > I am on holidays for a few days with a limited access to internet, but my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables. Go enjoy your holidays. :) I can quickly respond to one point: > Regarding (2): can that be done in MariaDB? Since 10.2.3, MariaDB has support for JSON: https://mariadb.com/kb/en/json-functions/ <https://mariadb.com/kb/en/json-functions/> I have no experience with handling JSON in a relational database, so not sure overall whether it's a good solution, but I think it should be considerably more performant so long as the pages are iterated over, rather than each one being searched in the index. Best, Aidn > If you can make (3) done, the whole Wikibase community would be very thankful. > > Regarding (4): yeah, too radical and maintenance problems. > > Sorry for a short answer, I'll add on my return. > > Kind regards, > Renat > _______________________________________________ > Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org <mailto:wikibaseug@lists.wikimedia.org> > To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org <mailto:wikibaseug-leave@lists.wikimedia.org> > _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org <mailto:wikibaseug@lists.wikimedia.org> To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org <mailto:wikibaseug-leave@lists.wikimedia.org>
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
I wondered earlier if you were trying to use hard disks when talking about random access time. I'm interested because I may be deploying Wikibase on HDD in the near future. But I'm only considering that because I know size and usage will be relatively low; I won't be doing bulk loads.
The reality is, Wikibase is not designed for HDD, and it doesn't work well on it. It may work slightly better with hardware RAID, but realistically any high-speed Wikibase implementation needs to be on SSD. Wikidata uses several, just for WDQS: https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardware and https://phabricator.wikimedia.org/T221632 which mentions they are used in RAID 0. I think the MediaWiki side of things may be even larger, because more must be stored.
While testing optimal load strategy, Adam notes "SSDs should obviously be used for storage." https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/
This is in part a limitation of Blazegraph, for which fast disks are the primary requirement, after which CPU and RAM matter (RAM may matters more if you have to use HDD, and for reads; but for loading it has to save the results): https://github.com/blazegraph/database/wiki/Hardware_Configuration https://sourceforge.net/p/bigdata/discussion/676946/thread/5b4acb02/#2810 https://github.com/blazegraph/database/issues/94#issuecomment-399141685
If they are not available to you, it might be worth using a cloud system to perform the load and take a compressed dump off of it afterwards. They are not necessarily cheap, but you might only need it for a relatively brief time. (Oracle is even giving away ARM VMs and SSD storage for free now. Regrettably, Wikibase does not yet have an arm64 distribution, but I'm not aware of a specific reason why it could not eventually.)
It might also be worth looking into whether Blazegraph (and MySQL) has any degraded reliability nodes which you can enable during the loading process. From my own experience I know database tuning can be key when attempting to use HDDs as a platform. Some ideas for MySQL: https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_log_at_trx_c... https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#binlo...
You can also try 'nobarrier' as a filesystem mount option to disable fsync, and 'data=writeback' for ext4. Be aware it violates ACID expectations and should only be used for test purposes or when you are willing to rebuild the entire database/FS if something goes wrong. Similarly, RAID 0 might be useful if you bear in mind losing one disk means losing all the data.
Best regards, -- Laurence "GreenReaper" Parry https://GreenReaper.co.uk ________________________________ From: Aidan Hogan aidhog@gmail.com Sent: Thursday, July 22, 2021 10:50:06 PM To: wikibaseug@lists.wikimedia.org wikibaseug@lists.wikimedia.org Subject: [Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase
Thanks Adam for the detailed blog post!
From your mail, I understand that the current Wikibase framework is based on continuous updates, and in that sense it can process a high rate of manual edits without problem. I think though that for high performance bulk updates, the best option would be an "ETL" style framework that loads data directly into the underlying databases using custom tools. What seems to be hurting performance in the bulk load scenario is:
1) The creation of massive amounts of requests/jobs all at once. 2) Random disk accesses that require about 10ms per go on a conventional disk if the data are not cached or read sequentially.
These are not a problem per se for the typical running and maintenance of a Wikibase instance, but rather only occur when one tries to import millions of items at once. Given that we can assume that an "admin" is importing the data, we can also bypass a lot of the typical processes that are run to verify edit permissions, rate limits, etc.
Overall, with the goal of importing a legacy dataset of around 10 million items, with maybe 100 million "values", in one day, on conventional hardware, these two problems would need to be resolved. The underlying databases should not have a problem processing data at this scale/rate using bulk update/transaction methods, but they will have a problem if each item is updated as a separate transaction; even with batching, it can still lead to hundreds of thousands of transactions that would choke up any persistent database running on a traditional hard-disk. So I think that the most performant solution would be to bypass Wikibase, and to prepare the data (e.g., a user-provided JSON dump) to be loaded into the underlying databases directly that Wikibase can later query. The issue with this approach is that it is not so trivial to understand what data need to be loaded where (this requires a deep understanding of Wikibase), and that such a bulk loader would need to be "synchronised" with changes to the main Wikibase software, so it would require maintenance over the years.
Best, Aidan
On 2021-07-19 17:50, Addshore wrote:
This thread and my earlier email prompted me to write a deep dive blog post on exactly what happens when you send a create item request to wbeditentity.
You can find this newly published (and thus probably with some typos) below https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-it... https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/
Hopefully this can prompt some more questions, thoughts, changes etc for the bulk import case both for RAISE and for the API etc.
Adam / Addshore Wikidata Wikibase Tech Lead
On Mon, 19 Jul 2021 at 17:45, Addshore <addshorewiki@gmail.com mailto:addshorewiki@gmail.com> wrote:
Hi all, Some general thoughts on this thread. > Overall, we were wondering if we are approaching this bulk import in the > right way? It seems that the PHP scripts are not optimised for > performance/scale? In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db. Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured) If this is done, then having one or more separate job runners would be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue <https://www.mediawiki.org/wiki/Manual:Job_queue> The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase. Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc. If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase. Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves? (I see that runJobs is called as the last step at https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175 <https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175> Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times. This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists. The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes. These jobs would run fairly soon after the initial write, leading to better performance. > > It seems that the PHP scripts are not optimised for performance/scale? > It seems so. Scale yes (these scripts ran across the whole of Wikidata.org) Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process. Regarding rebuildItemterms > Takes around 2 hours on the small sample (which we could multiply by a > thousand for the full dataset, i.e., 83 days as an estimate). Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities. The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish. The next bottle neck would then be the speed of your SQL database. There is probably lots more to discuss in this thread but I found navigating it quite hard. Hopefully the above will prompt some more discussion. It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too. Adam / Addshore Wikidata Wikibase Tech Lead On Sat, 17 Jul 2021 at 22:54, Aidan Hogan <aidhog@gmail.com <mailto:aidhog@gmail.com>> wrote: Hi Renat, On 2021-07-16 15:54, Renat Shigapov wrote: > Hi Aidan, > > I am on holidays for a few days with a limited access to internet, but my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables. Go enjoy your holidays. :) I can quickly respond to one point: > Regarding (2): can that be done in MariaDB? Since 10.2.3, MariaDB has support for JSON: https://mariadb.com/kb/en/json-functions/ <https://mariadb.com/kb/en/json-functions/> I have no experience with handling JSON in a relational database, so not sure overall whether it's a good solution, but I think it should be considerably more performant so long as the pages are iterated over, rather than each one being searched in the index. Best, Aidn > If you can make (3) done, the whole Wikibase community would be very thankful. > > Regarding (4): yeah, too radical and maintenance problems. > > Sorry for a short answer, I'll add on my return. > > Kind regards, > Renat > _______________________________________________ > Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org <mailto:wikibaseug@lists.wikimedia.org> > To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org <mailto:wikibaseug-leave@lists.wikimedia.org> > _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org <mailto:wikibaseug@lists.wikimedia.org> To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org <mailto:wikibaseug-leave@lists.wikimedia.org>
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
_______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Indeed, yes! Seeks on a traditional disk are in the order of 10ms, while on a solid state disk they rather tend to be around 0.1 ms, so one could expect a 100x speed-up when doing a lot of random accesses. One could also memory-map the database files perhaps for a greater latency, assuming the data fit in RAM.
So an SSD should help a lot within the current framework, but it should not be necessary in general. Plus an ETL load framework should still be much more efficient for bulk loads than an SSD with the current process. SSDs have lower latency, but they still have latency. One million seeks on a hard disk takes around 3 hours, one million seeks on an SSD takes around 2 minutes, reading one million records (around 100MB) sequentially on a standard hard disk should take around one second.
Still I think that trying with an SSD for now is a good idea!
Best, Aidan
On 2021-07-22 19:02, Laurence Parry wrote:
I wondered earlier if you were trying to use hard disks when talking about random access time. I'm interested because I may be deploying Wikibase on HDD in the near future. But I'm only considering that because I know size and usage will be relatively low; I won't be doing bulk loads.
The reality is, Wikibase is not designed for HDD, and it doesn't work well on it. It may work slightly better with hardware RAID, but realistically any high-speed Wikibase implementation needs to be on SSD. Wikidata uses several, just for WDQS: https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardware and https://phabricator.wikimedia.org/T221632 which mentions they are used in RAID 0. I think the MediaWiki side of things may be even larger, because more must be stored.
While testing optimal load strategy, Adam notes "SSDs should obviously be used for storage." https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/
This is in part a limitation of Blazegraph, for which fast disks are the primary requirement, after which CPU and RAM matter (RAM may matters more if you have to use HDD, and for reads; but for loading it has to save the results): https://github.com/blazegraph/database/wiki/Hardware_Configuration https://sourceforge.net/p/bigdata/discussion/676946/thread/5b4acb02/#2810 https://github.com/blazegraph/database/issues/94#issuecomment-399141685
If they are not available to you, it might be worth using a cloud system to perform the load and take a compressed dump off of it afterwards. They are not necessarily cheap, but you might only need it for a relatively brief time. (Oracle is even giving away ARM VMs and SSD storage for free now. Regrettably, Wikibase does not yet have an arm64 distribution, but I'm not aware of a specific reason why it could not eventually.)
It might also be worth looking into whether Blazegraph (and MySQL) has any degraded reliability nodes which you can enable during the loading process. From my own experience I know database tuning can be key when attempting to use HDDs as a platform. Some ideas for MySQL: https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_log_at_trx_c... https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#binlo...
You can also try 'nobarrier' as a filesystem mount option to disable fsync, and 'data=writeback' for ext4. Be aware it violates ACID expectations and should only be used for test purposes or when you are willing to rebuild the entire database/FS if something goes wrong. Similarly, RAID 0 might be useful if you bear in mind losing one disk means losing all the data.
Best regards,
Laurence "GreenReaper" Parry https://GreenReaper.co.uk
*From:* Aidan Hogan aidhog@gmail.com *Sent:* Thursday, July 22, 2021 10:50:06 PM *To:* wikibaseug@lists.wikimedia.org wikibaseug@lists.wikimedia.org *Subject:* [Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase Thanks Adam for the detailed blog post!
From your mail, I understand that the current Wikibase framework is based on continuous updates, and in that sense it can process a high rate of manual edits without problem. I think though that for high performance bulk updates, the best option would be an "ETL" style framework that loads data directly into the underlying databases using custom tools. What seems to be hurting performance in the bulk load scenario is:
- The creation of massive amounts of requests/jobs all at once.
- Random disk accesses that require about 10ms per go on a conventional
disk if the data are not cached or read sequentially.
These are not a problem per se for the typical running and maintenance of a Wikibase instance, but rather only occur when one tries to import millions of items at once. Given that we can assume that an "admin" is importing the data, we can also bypass a lot of the typical processes that are run to verify edit permissions, rate limits, etc.
Overall, with the goal of importing a legacy dataset of around 10 million items, with maybe 100 million "values", in one day, on conventional hardware, these two problems would need to be resolved. The underlying databases should not have a problem processing data at this scale/rate using bulk update/transaction methods, but they will have a problem if each item is updated as a separate transaction; even with batching, it can still lead to hundreds of thousands of transactions that would choke up any persistent database running on a traditional hard-disk. So I think that the most performant solution would be to bypass Wikibase, and to prepare the data (e.g., a user-provided JSON dump) to be loaded into the underlying databases directly that Wikibase can later query. The issue with this approach is that it is not so trivial to understand what data need to be loaded where (this requires a deep understanding of Wikibase), and that such a bulk loader would need to be "synchronised" with changes to the main Wikibase software, so it would require maintenance over the years.
Best, Aidan
On 2021-07-19 17:50, Addshore wrote:
This thread and my earlier email prompted me to write a deep dive blog post on exactly what happens when you send a create item request to wbeditentity.
You can find this newly published (and thus probably with some typos) below https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-it...
https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/
<https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-it...
https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/>
Hopefully this can prompt some more questions, thoughts, changes etc for the bulk import case both for RAISE and for the API etc.
Adam / Addshore Wikidata Wikibase Tech Lead
On Mon, 19 Jul 2021 at 17:45, Addshore <addshorewiki@gmail.com <mailto:addshorewiki@gmail.com mailto:addshorewiki@gmail.com>> wrote:
Hi all,
Some general thoughts on this thread.
> Overall, we were wondering if we are approaching this bulk import in the > right way? It seems that the PHP scripts are not optimised for > performance/scale?
In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db. Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured) If this is done, then having one or more separate job runners would be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue
https://www.mediawiki.org/wiki/Manual:Job_queue
https://www.mediawiki.org/wiki/Manual:Job_queue>
The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase. Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc.
If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase.
Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves? (I see that runJobs is called as the last step at https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175
<https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser....
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175>
Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times. This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists. The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes. These jobs would run fairly soon after the initial write, leading to better performance.
> > It seems that the PHP scripts are not optimised for performance/scale? > It seems so.
Scale yes (these scripts ran across the whole of Wikidata.org) Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process.
Regarding rebuildItemterms > Takes around 2 hours on the small sample (which we could multiply by a > thousand for the full dataset, i.e., 83 days as an estimate). Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities. The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish. The next bottle neck would then be the speed of your SQL database.
There is probably lots more to discuss in this thread but I found navigating it quite hard. Hopefully the above will prompt some more discussion. It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too.
Adam / Addshore Wikidata Wikibase Tech Lead
On Sat, 17 Jul 2021 at 22:54, Aidan Hogan <aidhog@gmail.com <mailto:aidhog@gmail.com mailto:aidhog@gmail.com>> wrote:
Hi Renat,
On 2021-07-16 15:54, Renat Shigapov wrote: > Hi Aidan, > > I am on holidays for a few days with a limited access to internet, but my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables.
Go enjoy your holidays. :)
I can quickly respond to one point:
> Regarding (2): can that be done in MariaDB?
Since 10.2.3, MariaDB has support for JSON:
https://mariadb.com/kb/en/json-functions/
https://mariadb.com/kb/en/json-functions/
https://mariadb.com/kb/en/json-functions/>
I have no experience with handling JSON in a relational database, so not sure overall whether it's a good solution, but I think it should be considerably more performant so long as the pages are iterated over, rather than each one being searched in the index.
Best, Aidn
> If you can make (3) done, the whole Wikibase community would be very thankful. > > Regarding (4): yeah, too radical and maintenance problems. > > Sorry for a short answer, I'll add on my return. > > Kind regards, > Renat > _______________________________________________ > Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org <mailto:wikibaseug@lists.wikimedia.org
mailto:wikibaseug@lists.wikimedia.org>
> To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org <mailto:wikibaseug-leave@lists.wikimedia.org
mailto:wikibaseug-leave@lists.wikimedia.org>
> _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org <mailto:wikibaseug@lists.wikimedia.org
mailto:wikibaseug@lists.wikimedia.org>
To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org <mailto:wikibaseug-leave@lists.wikimedia.org
mailto:wikibaseug-leave@lists.wikimedia.org>
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Dear all,
We have the ticket "Improve bulk import via API" at phabricator now: https://phabricator.wikimedia.org/T287164. It's aimed to unite the related tickets and to discuss further development around bulk import in Wikibase. Your contributions are very welcome.
Kind regards, Renat
Hi all,
Just a quick update on RaiseWikibase (see the commit https://github.com/UB-Mannheim/RaiseWikibase/commit/428c28b9334e9bdb613b3a49...).
The fingerprint data (labels, aliases and descriptions) are now inserted into the secondary tables on the fly as well.
The results of the first tests using https://github.com/UB-Mannheim/RaiseWikibase/blob/main/megaWikibase.py are here:
1. 8965 properties with monolingual labels, descriptions and aliases are uploaded in 99 seconds (previously in 42 seconds). So roughly 90 properties per second. 2. 20000 items with one label and without aliases/descriptions (but with 2-3 claims with a qualifier & reference) are uploaded in 112 seconds (previously in 84 seconds). Roughly 178 items per second.
Performance has dropped. Let's see what can be optimized.
Kind regards, Renat
Hi all,
I'm going to try to reply to various bits of the last 10 messages in a single response / comment.
Overall, with the goal of importing a legacy dataset of around 10 million items, with maybe 100 million "values", in one day, on conventional hardware
Sounds like a great goal. Tying this with some real world data for Wikidata. I believe (looking at some chat messages I have) on or around the 4th March 2021 Wikidata made it to 1.8k edits per minute. If this were sustained for a 24 hour period this would be 2.5 million changes in a 24 hour period. Continuing to throw some napkin math numbers around, and the "What happens in Wikibase when you make a new Item" blog post [0] Roughly 75% of save time is spent in Abusefilter, which could be turned off / import usecases wouldn't need to worry about. So napkin math says in WIkidata production we could probably live in the realm of 2.5-10 million entity changes per day using the main web APIs, hypothetically. I'm sure that would expose some other stresses etc though. Of course it would not be possible to realize this speed with a single Wikibase instance setup, which I believe is the setup for most current testing & profiling.
It might also be worth looking into whether Blazegraph (and MySQL) has
any degraded reliability nodes which you can enable during the loading process.
Blazegraph does, I have experimented with them before, but ultimately I didn't manage to make them increase load time any more than we already have. You can read some stuff about that in another blog post [1]
We have the ticket "Improve bulk import via API" at phabricator now:
https://phabricator.wikimedia.org/T287164. It's aimed to unite the related tickets and to discuss further development around bulk import in Wikibase. Your contributions are very welcome.
Thanks for the ticket I look forward to seeing this continue to move forward!
The greatest speedup is achieved when many thousands of items and statements are added in a single database transaction. The normal importer starts and commits a database transaction for every single item
- or maybe even statement.
Indeed during regular editing a transaction will happen around each edit, or rather each web request. Looking forward to a bulk loading / importing API this is certainly one of the things that could be considered.
The fingerprint data (labels, aliases and descriptions) are now inserted
into the secondary tables on the fly as well.
I had a quick chat with "jwass" today and we touched on labels, aliases and descriptions briefly. One thing to note here, and that hopefully we can discuss more now that all of the moving parts are more apparent in [0] is if some of these things are needed. The secondary term storage as some usecases, but there will be some situations where those features not not important etc for users, or the performance tradeoff is fine. And in those cases we could probably avoid writing to such stores at all. These, and more questions, I think are really important to be raised by Wikibase users.
Looking forward to continuing on this topic!
Adam
[0] https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-it... [1] https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/
On Tue, 3 Aug 2021 at 15:45, Renat Shigapov < renat.shigapov@bib.uni-mannheim.de> wrote:
Hi all,
Just a quick update on RaiseWikibase (see the commit https://github.com/UB-Mannheim/RaiseWikibase/commit/428c28b9334e9bdb613b3a49... ).
The fingerprint data (labels, aliases and descriptions) are now inserted into the secondary tables on the fly as well.
The results of the first tests using https://github.com/UB-Mannheim/RaiseWikibase/blob/main/megaWikibase.py are here:
- 8965 properties with monolingual labels, descriptions and aliases are
uploaded in 99 seconds (previously in 42 seconds). So roughly 90 properties per second. 2. 20000 items with one label and without aliases/descriptions (but with 2-3 claims with a qualifier & reference) are uploaded in 112 seconds (previously in 84 seconds). Roughly 178 items per second.
Performance has dropped. Let's see what can be optimized.
Kind regards, Renat _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase? Is there no complete solution (envisaged) for this right now?
Even though this topic is a few days old, I would like to add some of my experiences. I had the same problem about a year ago and wrote a Java program to insert millions of items pretty fast. It works for the LTS version. I don't know if it also works with the current version.
You can find the code here: https://github.com/jze/wikibase-insert
Best wishes, Jesper
@Renat, thank you for starting the Phabricator!
@Jesper, this looks really great! Do you know if this takes care of the secondary indexing for labels, specifically the wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables? I notice that these tables are not mentioned in the code.
Also one doubt: when you say "if you do this without a transaction", here you mean that if you do it without an explicit transaction / with a transaction per operation? I guess that the default behaviour is that each operation will form its own transaction and do the corresponding logging, latching, etc., for each insert, update, etc.?
@Dennis, agreed that this is part of the issue. Some of the scripts do provide options for batching, which certainly help significantly, but can still lead to lots of transactions / tasks / requests when importing at scale. It seems that some of the scripts for secondary indexing, however, do not support batching.
Best, Aidan
On 2021-07-23 10:39, Jesper Zedlitz wrote:
Anyone has experience, tips or pointers on converting and loading large-ish scale legacy data into Wikibase? Is there no complete solution (envisaged) for this right now?
Even though this topic is a few days old, I would like to add some of my experiences. I had the same problem about a year ago and wrote a Java program to insert millions of items pretty fast. It works for the LTS version. I don't know if it also works with the current version.
You can find the code here: https://github.com/jze/wikibase-insert
Best wishes, Jesper _______________________________________________ Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
Hi all,
@Jesper, thank you for your code! It helped me a lot!
@Aidan, just a short comment: Jesper's wikibase-insert is adapted to Wikibase 1.34 and `wb_terms` is the table where the labels, aliases and descriptions were stored (https://www.mediawiki.org/wiki/Wikibase/Schema/wb_terms). I started to develop RaiseWikibase based on WB 1.34 as well and inserted data into `wb_terms`, but during development the Wikibase 1.35.1 was realeased, so I adapted RaiseWikibase to 1.35.1. Since Wikibase 1.35.1 `wb_terms` is deprecated and the secondary storage for items and properties was introduced (https://doc.wikimedia.org/Wikibase/master/php/md_docs_storage_terms.html).
Kind regards, Renat
@Jesper, this looks really great! Do you know if this takes care of the secondary indexing for labels, specifically the wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang tables? I notice that these tables are not mentioned in the code.
The experiment was performed with version 1.34. I have used and extended the code in a real-world project to work with LTS version 1.35. That was quite a bit of work again. I will check if it is possible to understand the new code without the context of the real-world project.
Also one doubt: when you say "if you do this without a transaction", here you mean that if you do it without an explicit transaction / with a transaction per operation?
The greatest speedup is achieved when many thousands of items and statements are added in a single database transaction. The normal importer starts and commits a database transaction for every single item - or maybe even statement.
Best wishes, Jesper
Thanks for this, I found your thread from last year. I've not yet installed WikiBase, but I'm browsing the mailing list to see if it's worth even trying for my use case. That would be 100 million items and 1 billion statements. Did you use this impoter in production? What's the largest amount of data you imported?
wikibaseug@lists.wikimedia.org