Some general thoughts on this thread.
In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db.
Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured)
The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase.
Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc.
If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase.
Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves?
Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times.
This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists.
The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes.
These jobs would run fairly soon after the initial write, leading to better performance.
> > It seems that the PHP scripts are not optimised for performance/scale?
> It seems so.
Scale yes (these scripts ran across the whole of Wikidata.org)
Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process.
Regarding rebuildItemterms
> Takes around 2 hours on the small sample (which we could multiply by a
> thousand for the full dataset, i.e., 83 days as an estimate).
Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities.
The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish.
The next bottle neck would then be the speed of your SQL database.
There is probably lots more to discuss in this thread but I found navigating it quite hard.
Hopefully the above will prompt some more discussion.
It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too.