I wondered earlier if you were trying to use hard disks when talking about random access time. I'm interested because I may be deploying Wikibase on HDD in the near future. But I'm only considering that because I know size and usage will be relatively low; I won't be doing bulk loads.

The reality is, Wikibase is not designed for HDD, and it doesn't work well on it. It may work slightly better with hardware RAID, but realistically any high-speed Wikibase implementation needs to be on SSD. Wikidata uses several, just for WDQS: https://mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardware and https://phabricator.wikimedia.org/T221632 which mentions they are used in RAID 0. I think the MediaWiki side of things may be even larger, because more must be stored.

While testing optimal load strategy, Adam notes "SSDs should obviously be used for storage." https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/

This is in part a limitation of Blazegraph, for which fast disks are the primary requirement, after which CPU and RAM matter (RAM may matters more if you have to use HDD, and for reads; but for loading it has to save the results):
https://github.com/blazegraph/database/wiki/Hardware_Configuration
https://sourceforge.net/p/bigdata/discussion/676946/thread/5b4acb02/#2810
https://github.com/blazegraph/database/issues/94#issuecomment-399141685

If they are not available to you, it might be worth using a cloud system to perform the load and take a compressed dump off of it afterwards. They are not necessarily cheap, but you might only need it for a relatively brief time. (Oracle is even giving away ARM VMs and SSD storage for free now. Regrettably, Wikibase does not yet have an arm64 distribution, but I'm not aware of a specific reason why it could not eventually.)

It might also be worth looking into whether Blazegraph (and MySQL) has any degraded reliability nodes which you can enable during the loading process. From my own experience I know database tuning can be key when attempting to use HDDs as a platform. Some ideas for MySQL:
https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_log_at_trx_commit
https://mariadb.com/kb/en/replication-and-binary-log-system-variables/#binlog_commit_wait_usec

You can also try 'nobarrier' as a filesystem mount option to disable fsync, and 'data=writeback' for ext4. Be aware it violates ACID expectations and should only be used for test purposes or when you are willing to rebuild the entire database/FS if something goes wrong. Similarly, RAID 0 might be useful if you bear in mind losing one disk means losing all the data.

Best regards,
--
Laurence "GreenReaper" Parry
https://GreenReaper.co.uk

From: Aidan Hogan <aidhog@gmail.com>
Sent: Thursday, July 22, 2021 10:50:06 PM
To: wikibaseug@lists.wikimedia.org <wikibaseug@lists.wikimedia.org>
Subject: [Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase
 
Thanks Adam for the detailed blog post!

 From your mail, I understand that the current Wikibase framework is
based on continuous updates, and in that sense it can process a high
rate of manual edits without problem. I think though that for high
performance bulk updates, the best option would be an "ETL" style
framework that loads data directly into the underlying databases using
custom tools. What seems to be hurting performance in the bulk load
scenario is:

1) The creation of massive amounts of requests/jobs all at once.
2) Random disk accesses that require about 10ms per go on a conventional
disk if the data are not cached or read sequentially.

These are not a problem per se for the typical running and maintenance
of a Wikibase instance, but rather only occur when one tries to import
millions of items at once. Given that we can assume that an "admin" is
importing the data, we can also bypass a lot of the typical processes
that are run to verify edit permissions, rate limits, etc.

Overall, with the goal of importing a legacy dataset of around 10
million items, with maybe 100 million "values", in one day, on
conventional hardware, these two problems would need to be resolved. The
underlying databases should not have a problem processing data at this
scale/rate using bulk update/transaction methods, but they will have a
problem if each item is updated as a separate transaction; even with
batching, it can still lead to hundreds of thousands of transactions
that would choke up any persistent database running on a traditional
hard-disk. So I think that the most performant solution would be to
bypass Wikibase, and to prepare the data (e.g., a user-provided JSON
dump) to be loaded into the underlying databases directly that Wikibase
can later query. The issue with this approach is that it is not so
trivial to understand what data need to be loaded where (this requires a
deep understanding of Wikibase), and that such a bulk loader would need
to be "synchronised" with changes to the main Wikibase software, so it
would require maintenance over the years.

Best,
Aidan

On 2021-07-19 17:50, Addshore wrote:
> This thread and my earlier email prompted me to write a deep dive blog
> post on exactly what happens when you send a create item request to
> wbeditentity.
>
> You can find this newly published (and thus probably with some typos) below
> https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/
> <https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/>
>
> Hopefully this can prompt some more questions, thoughts, changes etc for
> the bulk import case both for RAISE and for the API etc.
>
> Adam / Addshore
> Wikidata Wikibase Tech Lead
>
> On Mon, 19 Jul 2021 at 17:45, Addshore <addshorewiki@gmail.com
> <mailto:addshorewiki@gmail.com>> wrote:
>
>     Hi all,
>
>     Some general thoughts on this thread.
>
>      > Overall, we were wondering if we are approaching this bulk import
>     in the
>      > right way? It seems that the PHP scripts are not optimised for
>      > performance/scale?
>
>     In general I would say that the "right" thing to do is to speed up
>     the app, rather than try to write directly to the SQL db.
>     Lots / all of the work currently being done in these maintenance
>     scripts can be / is done by a deferred job queue already within
>     Wikibase (if configured)
>     If this is done, then having one or more separate job runners would
>     be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue
>     <https://www.mediawiki.org/wiki/Manual:Job_queue>
>     The other key things could be a fast primary SQL database, and fast
>     host (or hosts) serving MediaWiki/Wikibase.
>     Wikidata.org runs on a very large shared cluster (shared with all
>     other Wikimedia sites) with multiple web, db, cache, and job hosts etc.
>
>     If you are running these maintenance scripts as part of an import
>     where performance of the wikibase does not matter to users then
>     using sleep 0 will provide a performance increase.
>
>     Something else that might make all of this slightly simpler would be
>     the ability to have this secondary index work head straight to the
>     job queue, rather than run as part of the maintenance scripts
>     themselves?
>     (I see that runJobs is called as the last step at
>     https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175
>     <https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175>
>
>     Currently it looks like in the building_indexing method in Raise all
>     items will be loaded from the DB 2 times, and all properties will be
>     loaded 3 times.
>     This pattern for large imports is likely to get slower and slower,
>     and also not take advantage of any caching etc that exists.
>     The regular API sequence for example would 1) store data in the DB
>     2) write to any quick secondary stores 3) add the entity to a shared
>     cache and 4) schedule jobs to populate the secondary indexes.
>     These jobs would run fairly soon after the initial write, leading to
>     better performance.
>
>     > > It seems that the PHP scripts are not optimised for performance/scale?
>      > It seems so.
>
>     Scale yes (these scripts ran across the whole of Wikidata.org)
>     Performance not so much, these scripts were primarily designed to be
>     a long running task between Wikibase updates, not as part of an
>     import process.
>
>     Regarding rebuildItemterms
>      > Takes around 2 hours on the small sample (which we could multiply
>     by a
>      > thousand for the full dataset, i.e., 83 days as an estimate).
>     Indeed this script will only go so fast, it processes all entities
>     in series, so you will have the bottleneck of a single process
>     iterating through all entities.
>     The jobqueue comparison again is jobs get queued and executed by any
>     number of wjob queue runners that you wish.
>     The next bottle neck would then be the speed of your SQL database.
>
>     There is probably lots more to discuss in this thread but I found
>     navigating it quite hard.
>     Hopefully the above will prompt some more discussion.
>     It would be great to be able to chat somewhere more realtime (not an
>     email thread) on this topic too.
>
>     Adam / Addshore
>     Wikidata Wikibase Tech Lead
>
>     On Sat, 17 Jul 2021 at 22:54, Aidan Hogan <aidhog@gmail.com
>     <mailto:aidhog@gmail.com>> wrote:
>
>         Hi Renat,
>
>         On 2021-07-16 15:54, Renat Shigapov wrote:
>          > Hi Aidan,
>          >
>          > I am on holidays for a few days with a limited access to
>         internet, but my quick reply to "Do you have further plans for
>         extending RaiseWikibase?" is yes. I'll try to handle with those
>         secondary tables.
>
>         Go enjoy your holidays. :)
>
>         I can quickly respond to one point:
>
>          > Regarding (2): can that be done in MariaDB?
>
>         Since 10.2.3, MariaDB has support for JSON:
>
>         https://mariadb.com/kb/en/json-functions/
>         <https://mariadb.com/kb/en/json-functions/>
>
>         I have no experience with handling JSON in a relational
>         database, so not
>         sure overall whether it's a good solution, but I think it should be
>         considerably more performant so long as the pages are iterated
>         over,
>         rather than each one being searched in the index.
>
>         Best,
>         Aidn
>
>          > If you can make (3) done, the whole Wikibase community would
>         be very thankful.
>          >
>          > Regarding (4): yeah, too radical and maintenance problems.
>          >
>          > Sorry for a short answer, I'll add on my return.
>          >
>          > Kind regards,
>          > Renat
>          > _______________________________________________
>          > Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
>         <mailto:wikibaseug@lists.wikimedia.org>
>          > To unsubscribe send an email to
>         wikibaseug-leave@lists.wikimedia.org
>         <mailto:wikibaseug-leave@lists.wikimedia.org>
>          >
>         _______________________________________________
>         Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
>         <mailto:wikibaseug@lists.wikimedia.org>
>         To unsubscribe send an email to
>         wikibaseug-leave@lists.wikimedia.org
>         <mailto:wikibaseug-leave@lists.wikimedia.org>
>
>
> _______________________________________________
> Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
> To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
>
_______________________________________________
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org