[Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase

20 Jul 2021

Hi all,

Some general thoughts on this thread.

...
  Overall, we were wondering if we are approaching this
bulk import in the
 right way? It seems that the PHP scripts are not optimised for
 performance/scale? 
In general I would say that the "right" thing to do is to speed up the app,
rather than try to write directly to the SQL db.
Lots / all of the work currently being done in these maintenance scripts
can be / is done by a deferred job queue already within Wikibase (if
configured)
If this is done, then having one or more separate job runners would be
beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue
The other key things could be a fast primary SQL database, and fast host
(or hosts) serving MediaWiki/Wikibase.
Wikidata.org runs on a very large shared cluster (shared with all other
Wikimedia sites) with multiple web, db, cache, and job hosts etc.

If you are running these maintenance scripts as part of an import where
performance of the wikibase does not matter to users then using sleep 0
will provide a performance increase.

Something else that might make all of this slightly simpler would be the
ability to have this secondary index work head straight to the job queue,
rather than run as part of the maintenance scripts themselves?
(I see that runJobs is called as the last step at
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser…

Currently it looks like in the building_indexing method in Raise all items
will be loaded from the DB 2 times, and all properties will be loaded 3
times.
This pattern for large imports is likely to get slower and slower, and also
not take advantage of any caching etc that exists.
The regular API sequence for example would 1) store data in the DB 2) write
to any quick secondary stores 3) add the entity to a shared cache and 4)
schedule jobs to populate the secondary indexes.
These jobs would run fairly soon after the initial write, leading to better
performance.

...
   It seems that
the PHP scripts are not optimised for performance/scale? 
 It seems so. 
Scale yes (these scripts ran across the whole of Wikidata.org)
Performance not so much, these scripts were primarily designed to be a long
running task between Wikibase updates, not as part of an import process.

Regarding rebuildItemterms
...
  Takes around 2 hours on the small sample (which we
could multiply by a
 thousand for the full dataset, i.e., 83 days as an estimate). Indeed this script
will only go so fast, it processes all entities in
series, so you will have the bottleneck of a single process iterating
through all entities.
The jobqueue comparison again is jobs get queued and executed by any number
of wjob queue runners that you wish.
The next bottle neck would then be the speed of your SQL database.

There is probably lots more to discuss in this thread but I found
navigating it quite hard.
Hopefully the above will prompt some more discussion.
It would be great to be able to chat somewhere more realtime (not an email
thread) on this topic too.

Adam / Addshore
Wikidata Wikibase Tech Lead

On Sat, 17 Jul 2021 at 22:54, Aidan Hogan &lt;aidhog(a)gmail.com&gt; wrote:

...
  Hi Renat,

 On 2021-07-16 15:54, Renat Shigapov wrote:
  Hi Aidan,

 I am on holidays for a few days with a limited access to internet, but  my quick
reply to "Do you have further plans for extending RaiseWikibase?"
 is yes. I'll try to handle with those secondary tables.

 Go enjoy your holidays. :)

 I can quickly respond to one point:

  Regarding (2): can that be done in MariaDB?

 Since 10.2.3, MariaDB has support for JSON:

 https://mariadb.com/kb/en/json-functions/

 I have no experience with handling JSON in a relational database, so not
 sure overall whether it's a good solution, but I think it should be
 considerably more performant so long as the pages are iterated over,
 rather than each one being searched in the index.

 Best,
 Aidn

  If you can make (3) done, the whole Wikibase
community would be very  thankful.

 Regarding (4): yeah, too radical and maintenance problems.

 Sorry for a short answer, I'll add on my return.

 Kind regards,
 Renat
 _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
 To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org
  _______________________________________________
 Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
 To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org

2024

2023

2022

2021

2020

2019

2018

[Wikibase] Re: Experiences/doubts regarding bulk imports into Wikibase