Experiences/doubts regarding bulk imports into Wikibase

12 Jul 2021

Hi all,

Some general thoughts on this thread.

 > Overall, we were wondering if we are approaching this bulk import
in the
 > right way? It seems that the PHP scripts are not optimised for
 > performance/scale?

In general I would say that the "right" thing to do is to speed up
the app, rather than try to write directly to the SQL db.
Lots / all of the work currently being done in these maintenance
scripts can be / is done by a deferred job queue already within
Wikibase (if configured)
If this is done, then having one or more separate job runners would
be beneficial, see https://www.mediawiki.org/wiki/Manual:Job_queue
<https://www.mediawiki.org/wiki/Manual:Job_queue>
The other key things could be a fast primary SQL database, and fast
host (or hosts) serving MediaWiki/Wikibase.
Wikidata.org runs on a very large shared cluster (shared with all
other Wikimedia sites) with multiple web, db, cache, and job hosts etc.

If you are running these maintenance scripts as part of an import
where performance of the wikibase does not matter to users then
using sleep 0 will provide a performance increase.

Something else that might make all of this slightly simpler would be
the ability to have this secondary index work head straight to the
job queue, rather than run as part of the maintenance scripts
themselves?
(I see that runJobs is called as the last step at
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175
<https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser.py#L172-L175>

Currently it looks like in the building_indexing method in Raise all
items will be loaded from the DB 2 times, and all properties will be
loaded 3 times.
This pattern for large imports is likely to get slower and slower,
and also not take advantage of any caching etc that exists.
The regular API sequence for example would 1) store data in the DB
2) write to any quick secondary stores 3) add the entity to a shared
cache and 4) schedule jobs to populate the secondary indexes.
These jobs would run fairly soon after the initial write, leading to
better performance.

> > It seems that the PHP scripts are not optimised for performance/scale?
 > It seems so.

Scale yes (these scripts ran across the whole of Wikidata.org)
Performance not so much, these scripts were primarily designed to be
a long running task between Wikibase updates, not as part of an
import process.

Regarding rebuildItemterms
 > Takes around 2 hours on the small sample (which we could multiply
by a
 > thousand for the full dataset, i.e., 83 days as an estimate).
Indeed this script will only go so fast, it processes all entities
in series, so you will have the bottleneck of a single process
iterating through all entities.
The jobqueue comparison again is jobs get queued and executed by any
number of wjob queue runners that you wish.
The next bottle neck would then be the speed of your SQL database.

There is probably lots more to discuss in this thread but I found
navigating it quite hard.
Hopefully the above will prompt some more discussion.
It would be great to be able to chat somewhere more realtime (not an
email thread) on this topic too.

Adam / Addshore
Wikidata Wikibase Tech Lead

On Sat, 17 Jul 2021 at 22:54, Aidan Hogan <aidhog@gmail.com
<mailto:aidhog@gmail.com>> wrote:

    Hi Renat,

    On 2021-07-16 15:54, Renat Shigapov wrote:
     > Hi Aidan,
     >
     > I am on holidays for a few days with a limited access to
    internet, but my quick reply to "Do you have further plans for
    extending RaiseWikibase?" is yes. I'll try to handle with those
    secondary tables.

    Go enjoy your holidays. :)

    I can quickly respond to one point:

     > Regarding (2): can that be done in MariaDB?

    Since 10.2.3, MariaDB has support for JSON:

    https://mariadb.com/kb/en/json-functions/
    <https://mariadb.com/kb/en/json-functions/>

    I have no experience with handling JSON in a relational
    database, so not
    sure overall whether it's a good solution, but I think it should be
    considerably more performant so long as the pages are iterated
    over,
    rather than each one being searched in the index.

    Best,
    Aidn

     > If you can make (3) done, the whole Wikibase community would
    be very thankful.
     >
     > Regarding (4): yeah, too radical and maintenance problems.
     >
     > Sorry for a short answer, I'll add on my return.
     >
     > Kind regards,
     > Renat
     > _______________________________________________
     > Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
    <mailto:wikibaseug@lists.wikimedia.org>
     > To unsubscribe send an email to
    wikibaseug-leave@lists.wikimedia.org
    <mailto:wikibaseug-leave@lists.wikimedia.org>
     >
    _______________________________________________
    Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
    <mailto:wikibaseug@lists.wikimedia.org>
    To unsubscribe send an email to
    wikibaseug-leave@lists.wikimedia.org
    <mailto:wikibaseug-leave@lists.wikimedia.org>

Best regards,