Hi all,

I'm going to try to reply to various bits of the last 10 messages in a single response / comment.

> Overall, with the goal of importing a legacy dataset of around 10
>   million items, with maybe 100 million "values", in one day, on
>   conventional hardware

Sounds like a great goal.
Tying this with some real world data for Wikidata.
I believe (looking at some chat messages I have) on or around the 4th March 2021 Wikidata made it to 1.8k edits per minute.
If this were sustained for a 24 hour period this would be 2.5 million changes in a 24 hour period.
Continuing to throw some napkin math numbers around, and the "What happens in Wikibase when you make a new Item" blog post [0]
Roughly 75% of save time is spent in Abusefilter, which could be turned off / import usecases wouldn't need to worry about.
So napkin math says in WIkidata production we could probably live in the realm of 2.5-10 million entity changes per day using the main web APIs, hypothetically.
I'm sure that would expose some other stresses etc though.
Of course it would not be possible to realize this speed with a single Wikibase instance setup, which I believe is the setup for most current testing & profiling.

It might also be worth looking into whether Blazegraph (and MySQL) has any degraded reliability nodes which you can enable during the loading process.

Blazegraph does, I have experimented with them before, but ultimately I didn't manage to make them increase load time any more than we already have.
You can read some stuff about that in another blog post [1]

> We have the ticket "Improve bulk import via API" at phabricator now: https://phabricator.wikimedia.org/T287164. It's aimed to unite the related tickets and to discuss further development around bulk import in Wikibase. Your contributions are very welcome.

Thanks for the ticket
I look forward to seeing this continue to move forward!

> The greatest speedup is achieved when many thousands of items and
>   statements are added in a single database transaction. The normal
>   importer starts and commits a database transaction for every single item
>   - or maybe even statement.

Indeed during regular editing a transaction will happen around each edit, or rather each web request.
Looking forward to a bulk loading / importing API this is certainly one of the things that could be considered.

> The fingerprint data (labels, aliases and descriptions) are now inserted into the secondary tables on the fly as well.

I had a quick chat with "jwass" today and we touched on labels, aliases and descriptions briefly.
One thing to note here, and that hopefully we can discuss more now that all of the moving parts are more apparent in [0] is if some of these things are needed.
The secondary term storage as some usecases, but there will be some situations where those features not not important etc for users, or the performance tradeoff is fine.
And in those cases we could probably avoid writing to such stores at all.
These, and more questions, I think are really important to be raised by Wikibase users.

Looking forward to continuing on this topic!

Adam

[0] https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-item/
[1] https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/



On Tue, 3 Aug 2021 at 15:45, Renat Shigapov <renat.shigapov@bib.uni-mannheim.de> wrote:
Hi all,

Just a quick update on RaiseWikibase (see the commit https://github.com/UB-Mannheim/RaiseWikibase/commit/428c28b9334e9bdb613b3a49d42384a88c7398a2).

The fingerprint data (labels, aliases and descriptions) are now inserted into the secondary tables on the fly as well.

The results of the first tests using https://github.com/UB-Mannheim/RaiseWikibase/blob/main/megaWikibase.py are here:

1. 8965 properties with monolingual labels, descriptions and aliases are uploaded in 99 seconds (previously in 42 seconds). So roughly 90 properties per second.
2. 20000 items with one label and without aliases/descriptions (but with 2-3 claims with a qualifier & reference) are uploaded in 112 seconds (previously in 84 seconds). Roughly 178 items per second.

Performance has dropped. Let's see what can be optimized.

Kind regards,
Renat
_______________________________________________
Wikibaseug mailing list -- wikibaseug@lists.wikimedia.org
To unsubscribe send an email to wikibaseug-leave@lists.wikimedia.org