This thread and my earlier email prompted me to write a deep dive blog post on exactly what happens when you send a create item request to wbeditentity.

You can find this newly published (and thus probably with some typos) below
https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/

Hopefully this can prompt some more questions, thoughts, changes etc for the bulk import case both for RAISE and for the API etc.

Adam / Addshore
Wikidata Wikibase Tech Lead

On Mon, 19 Jul 2021 at 17:45, Addshore <addshorewiki@gmail.com> wrote:
Hi all,

Some general thoughts on this thread.

> Overall, we were wondering if we are approaching this bulk import in the
> right way? It seems that the PHP scripts are not optimised for
> performance/scale?

In general I would say that the "right" thing to do is to speed up the app, rather than try to write directly to the SQL db.
Lots / all of the work currently being done in these maintenance scripts can be / is done by a deferred job queue already within Wikibase (if configured)
If this is done, then having one or more separate job runners would be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue
The other key things could be a fast primary SQL database, and fast host (or hosts) serving MediaWiki/Wikibase.
Wikidata.org runs on a very large shared cluster (shared with all other Wikimedia sites) with multiple web, db, cache, and job hosts etc.

If you are running these maintenance scripts as part of an import where performance of the wikibase does not matter to users then using sleep 0 will provide a performance increase.

Something else that might make all of this slightly simpler would be the ability to have this secondary index work head straight to the job queue, rather than run as part of the maintenance scripts themselves?
(I see that runJobs is called as the last step at https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175

Currently it looks like in the building_indexing method in Raise all items will be loaded from the DB 2 times, and all properties will be loaded 3 times.
This pattern for large imports is likely to get slower and slower, and also not take advantage of any caching etc that exists.
The regular API sequence for example would 1) store data in the DB 2) write to any quick secondary stores 3) add the entity to a shared cache and 4) schedule jobs to populate the secondary indexes.
These jobs would run fairly soon after the initial write, leading to better performance.

> > It seems that the PHP scripts are not optimised for performance/scale?
> It seems so.  

Scale yes (these scripts ran across the whole of Wikidata.org)
Performance not so much, these scripts were primarily designed to be a long running task between Wikibase updates, not as part of an import process.

Regarding rebuildItemterms
> Takes around 2 hours on the small sample (which we could multiply by a
> thousand for the full dataset, i.e., 83 days as an estimate).
Indeed this script will only go so fast, it processes all entities in series, so you will have the bottleneck of a single process iterating through all entities.
The jobqueue comparison again is jobs get queued and executed by any number of wjob queue runners that you wish.
The next bottle neck would then be the speed of your SQL database.

There is probably lots more to discuss in this thread but I found navigating it quite hard.
Hopefully the above will prompt some more discussion.
It would be great to be able to chat somewhere more realtime (not an email thread) on this topic too.

Adam / Addshore
Wikidata Wikibase Tech Lead

On Sat, 17 Jul 2021 at 22:54, Aidan Hogan <aidhog@gmail.com> wrote:
Hi Renat,

On 2021-07-16 15:54, Renat Shigapov wrote:
> Hi Aidan,
>
> I am on holidays for a few days with a limited access to internet, but my quick reply to "Do you have further plans for extending RaiseWikibase?" is yes. I'll try to handle with those secondary tables.

Go enjoy your holidays. :)

I can quickly respond to one point:

> Regarding (2): can that be done in MariaDB?

Since 10.2.3, MariaDB has support for JSON:

https://mariadb.com/kb/en/json-functions/

I have no experience with handling JSON in a relational database, so not
sure overall whether it's a good solution, but I think it should be
considerably more performant so long as the pages are iterated over,
rather than each one being searched in the index.

Best,
Aidn

> If you can make (3) done, the whole Wikibase community would be very thankful.
>
> Regarding (4): yeah, too radical and maintenance problems.
>
> Sorry for a short answer, I'll add on my return.
>
> Kind regards,
> Renat
> _______________________________________________
> Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
> To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org
>
_______________________________________________
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org