Hi all,
I'm going to try to reply to various bits of the last 10 messages in a
single response / comment.
Overall, with the goal of importing a legacy dataset
of around 10
million items, with maybe 100 million "values", in one day, on
conventional hardware
Sounds like a great goal.
Tying this with some real world data for Wikidata.
I believe (looking at some chat messages I have) on or around the 4th March
2021 Wikidata made it to 1.8k edits per minute.
If this were sustained for a 24 hour period this would be 2.5 million
changes in a 24 hour period.
Continuing to throw some napkin math numbers around, and the "What happens
in Wikibase when you make a new Item" blog post [0]
Roughly 75% of save time is spent in Abusefilter, which could be turned off
/ import usecases wouldn't need to worry about.
So napkin math says in WIkidata production we could probably live in the
realm of 2.5-10 million entity changes per day using the main web APIs,
hypothetically.
I'm sure that would expose some other stresses etc though.
Of course it would not be possible to realize this speed with a single
Wikibase instance setup, which I believe is the setup for most current
testing & profiling.
It might also be worth looking into whether Blazegraph
(and MySQL) has
any degraded reliability nodes which you can enable during the
loading
process.
Blazegraph does, I have experimented with them before, but ultimately I
didn't manage to make them increase load time any more than we already have.
You can read some stuff about that in another blog post [1]
We have the ticket "Improve bulk import via
API" at phabricator now:
https://phabricator.wikimedia.org/T287164. It's
aimed to unite the related
tickets and to discuss further development around bulk import in Wikibase.
Your contributions are very welcome.
Thanks for the ticket
I look forward to seeing this continue to move forward!
The greatest speedup is achieved when many thousands
of items and
statements are added in a single database transaction. The normal
importer starts and commits a database transaction for every single item
- or maybe even statement.
Indeed during regular editing a transaction will happen around each edit,
or rather each web request.
Looking forward to a bulk loading / importing API this is certainly one of
the things that could be considered.
The fingerprint data (labels, aliases and
descriptions) are now inserted
into the secondary tables on the fly as well.
I had a quick chat with "jwass" today and we touched on labels, aliases and
descriptions briefly.
One thing to note here, and that hopefully we can discuss more now that all
of the moving parts are more apparent in [0] is if some of these things are
needed.
The secondary term storage as some usecases, but there will be some
situations where those features not not important etc for users, or the
performance tradeoff is fine.
And in those cases we could probably avoid writing to such stores at all.
These, and more questions, I think are really important to be raised by
Wikibase users.
Looking forward to continuing on this topic!
Adam
[0]
https://addshore.com/2021/07/what-happens-in-wikibase-when-you-make-a-new-i…
[1]
https://addshore.com/2021/02/testing-wdqs-blazegraph-data-load-performance/
On Tue, 3 Aug 2021 at 15:45, Renat Shigapov <
renat.shigapov(a)bib.uni-mannheim.de> wrote:
> Hi all,
>
> Just a quick update on RaiseWikibase (see the commit
>
https://github.com/UB-Mannheim/RaiseWikibase/commit/428c28b9334e9bdb613b3a4…
> ).
>
The fingerprint data (labels, aliases and
descriptions) are now inserted
> into the secondary tables on the fly as well.
>
> The results of the first tests using
>
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/megaWikibase.py
> are here:
>
> 1. 8965 properties with monolingual labels, descriptions and aliases are
> uploaded in 99 seconds (previously in 42 seconds). So roughly 90 properties
> per second.
> 2. 20000 items with one label and without aliases/descriptions (but with
> 2-3 claims with a qualifier & reference) are uploaded in 112 seconds
> (previously in 84 seconds). Roughly 178 items per second.
>
> Performance has dropped. Let's see what can be optimized.
>
> Kind regards,
> Renat
> _______________________________________________
> Wikibaseug mailing list -- wikibaseug(a)lists.wikimedia.org
> To unsubscribe send an email to wikibaseug-leave(a)lists.wikimedia.org
>