Hello,
In July, many of you participated in our Wikibase Installation & Updating
surveys (see announcement
<https://lists.wikimedia.org/hyperkitty/list/wikibaseug@lists.wikimedia.org/…>).
We compiled the results – you can have a look on Meta
<https://meta.wikimedia.org/wiki/Wikibase/Wikibase_Installation_%26_Updating…>
.
Many thanks to all those who participated in the survey. Your answers will
help us find ways to improve the installation & updating process for users.
If you have any questions or additional feedback, please feel free to let
us know in this discussion page
<https://meta.wikimedia.org/wiki/Talk:Wikibase/Wikibase_Installation_%26_Upd…>
or write to me privately.
Cheers,
--
Mohammed Sadat
*Community Communications Manager for Wikidata/Wikibase*
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de
Keep up to date! Current news and exciting stories about Wikimedia,
Wikipedia and Free Knowledge in our newsletter (in German): Subscribe now
<https://www.wikimedia.de/newsletter/>.
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us to achieve our vision!
https://spenden.wikimedia.de
Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Dear all,
I'm posting here because there is an open devOps position at the Lab where
I work.
>>
https://www.tib.eu/en/tib/careers-and-apprenticeships/vacancies/details/job…
We are looking for someone with experience in OSS / Mediawiki / Wikibase
software (ideally) hence I'm posting here. Please feel free to spread the
word if you know anyone who might be interested and feel free to reach out
to me directly at lozana.rossenova(a)tib.eu if you have any questions and
want to learn more.
Cheers,
Lozana
--
Lozana Rossenova (PhD, London South Bank University)
Digital Archives Designer and Researcher
Hello everyone,
The Wikibase development team is excited to see the emergence of
community-created import tooling such as RaiseWikibase and wikibase-insert,
particularly because Wikibase does not yet come equipped with its own
import mechanism “out of the box”. To help better support the community, we
would like to offer some advice to toolmakers and provide some insight into
our planned explorations into making API-based importing better and faster.
We anticipate an inherent issue with tools that directly inject information
into the Wikibase database tables. Namely, that the schemas that such tools
rely on are subject to change. Normal development processes across
Wikimedia can (and likely will) endanger the long-term health and stability
of these tools, an outcome we would like to avoid as much as possible.
Specifically, we can’t guarantee that the layout or content of the tables
that these tools write to will not change. Although Wikibase does not have
its own public stable interface policy, we work with an eye on Wikidata's
policy
<https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy#Unstable_Int…>
.
We would like to offer the following advice to tool developers:
1.
Wherever possible, use the HTTP APIs.
2.
When using them, make sure to batch requests when possible.
3.
Understand that by reading or writing directly to DB tables, Wikibase
behavior may end up broken in subtle ways.
4.
Understand that tools which read or write from the DB on today’s
Wikibase may not still work on tomorrow’s Wikibase.
The Wikibase development roadmap
<https://www.wikidata.org/wiki/Wikidata:Development_plan#Wikibase_ecosystem>
for the year is already tightly booked. However, Wikimedia Germany intends
to dedicate some resources during our next 2021 prototyping effort towards
exploring ways to optimize our API -- specifically, a solution approaching
something like an "import mode" which would bypass unnecessary actions when
inserting a body of previously vetted information. This might, for example,
include ignoring checks on the uniqueness of labels or user permissions. In
addition, we plan to dedicate time toward evaluating OpenRefine’s new Wikibase
reconciliation functionality
<https://docs.openrefine.org/next/manual/wikibase/overview> on behalf of
the community. We will keep you updated on our efforts related to these
topics.
If you are interested, we welcome you to watch the progress of these
related Phabricator tickets as well:
-
https://phabricator.wikimedia.org/T287164 -- Improve bulk import via API
-
https://phabricator.wikimedia.org/T285987 -- Do not generate full html
parser output at the end of Wikibase edit requests
Cheers,
--
Mohammed Sadat
*Community Communications Manager for Wikidata/Wikibase*
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de
Keep up to date! Current news and exciting stories about Wikimedia,
Wikipedia and Free Knowledge in our newsletter (in German): Subscribe now
<https://www.wikimedia.de/newsletter/>.
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us to achieve our vision!
https://spenden.wikimedia.de
Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
This breaking change is relevant for anyone who consumes Wikidata RDF data
through Special:EntityData (rather than the dumps) without using the “dump”
flavor.
When an Item references other entities (e.g. the statement P31:Q5), the
non-dump (?flavor=dump) RDF output of that Item would include the labels
and descriptions of the referenced entities (e.g. P31 and Q5) in all
languages. That bloats the output drastically and causes performance
issues. See Special:EntityData/Q1337.rdf
<https://www.wikidata.org/wiki/Special:EntityData/Q1337.rdf> as an example.
We will change this so that for referenced entities, only labels and
descriptions in the request language (set e.g. via ?uselang=) and its
fallback languages are included in the response. For the main entity being
requested, labels, descriptions and aliases are still included in all
languages available, of course.
If you don’t actually need this “stub” data of referenced entities at all,
and are only interested in data about the main entity being requested, we
encourage you to use the “dump” flavor instead (include flavor=dump in the
URL parameters). In that case, this change will not affect you at all,
since the dump flavor includes no stub data, regardless of language.
This change is currently available for testing at test.wikidata.org. It
will be deployed on Wikidata on August 23rd. You are welcome to give us
general feedback by leaving a comment in this ticket
<https://phabricator.wikimedia.org/T285795>.
If you have any questions please do not hesitate to ask.
Cheers,
--
Mohammed Sadat
*Community Communications Manager for Wikidata/Wikibase*
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de
Keep up to date! Current news and exciting stories about Wikimedia,
Wikipedia and Free Knowledge in our newsletter (in German): Subscribe now
<https://www.wikimedia.de/newsletter/>.
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us to achieve our vision!
https://spenden.wikimedia.de
Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Hi everyone,
We have our next Wikibase Live Session on Thursday, August 26th at
1600 UTC (18:00
Berlin).
What are you working on around Wikibase? You're welcome to come and share
with the Wikibase community.
*Details about how to participate are below:*
Time: 16:00 UTC (18:00 Berlin), 1 hour, Thursday 26th August 2021
Google Meet: https://meet.google.com/nky-nwdx-tuf
Join by phone:
https://meet.google.com/tel/nky-nwdx-tuf?pin=4267848269474&hs=1
Notes: https://etherpad.wikimedia.org/p/WBUG_2021.08.26
If you have any questions, please do not hesitate to ask.
Talk to you soon!
--
Mohammed Sadat
*Community Communications Manager for Wikidata/Wikibase*
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de
Keep up to date! Current news and exciting stories about Wikimedia,
Wikipedia and Free Knowledge in our newsletter (in German): Subscribe now
<https://www.wikimedia.de/newsletter/>.
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us to achieve our vision!
https://spenden.wikimedia.de
Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Hey all,
Henry (in CC) and I have been looking into the possibility of importing
a dataset in the order of around 10-20 million items into Wikibase, and
maybe around 50 million claims. Wikibase would be perfect for our needs,
but we have been struggling quite a lot to load the data.
We are using the Docker version. Initial attempts on a small sample of
10-20 thousand items were not promising, with the load taking a very
long time. We found that RaiseWikibase helped to considerably speed up
the initial load:
https://github.com/UB-Mannheim/RaiseWikibase
but on a small sample of 10-20 thousand items, the secondary indexing
process was taking several hours. This is the building_indexing()
process here (which just calls maintenance scripts):
https://github.com/UB-Mannheim/RaiseWikibase/blob/main/RaiseWikibase/raiser…
This seems to be necessary for labels to appear correctly in the wiki,
and for search to work.
Rather than call that method, we have been trying to invoke the
maintenance scripts directly and play with arguments that might help,
such as batch size. However, some of the scripts still take a long time,
even considering the small size of what we are loading. For example:
docker exec wikibase-docker_wikibase_1 bash "-c" "php
extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --sleep 0.1
--batch-size 10000"
Takes around 2 hours on the small sample (which we could multiply by a
thousand for the full dataset, i.e., 83 days as an estimate).
Investigating the mysql database, it seems to be generating four tables:
wbt_item_terms, wbt_term_in_lang, wbt_text, and wbt_text_in_lang, but
these are in the order of 20,000 tuples when finished, so it is
surprising that the process takes so long. My guess is that the PHP code
is looking up pages per item, generating thousands of random accesses on
the disk, when it would seem better to just stream tuples/pages
contiguously from the table/disk?
Later on the CirrusSearch indexing is also taking a long time for the
small sample, generating jobs for batches that take a long time to
clear. In previous experience, ElasticSearch will happily eat millions
of documents in an hour. We are still looking at how batch sizes might
help, but it feels like it is taking much longer than it should.
Overall, we were wondering if we are approaching this bulk import in the
right way? It seems that the PHP scripts are not optimised for
performance/scale? Anyone has experience, tips or pointers on converting
and loading large-ish scale legacy data into Wikibase? Is there no
complete solution (envisaged) for this right now?
Best,
Aidan