WDQS status - Wikidata

9 Jul 2020

Hello all!

The Search Platform team will join the WIkidata office hours on July 21st
16:00 UTC [1]. We are looking forward to discussing Wikidata Query Service
and anything else you might find of interest.

We've been hard at work on Wikimedia Commons Query Service (WCQS) [2]. This
will be a SPARL endpoint similar to WDQS, but serving the Structured Data
on Commons dataset. Our goal is to open a beta service, hosted on Wikimedia
Cloud Service (WMCS) by the end of July. The service will require an
account on Commons for authentication and will allow federation with WDQS.
We don't have a streaming update process ready yet, the data will be
reloaded from Commons dumps weekly for a start.

As part of that work, the dumps for Structured Data on Commons are now
available [3]. Note that the prefix used in the TTL dumps is "wd", which
does not make much sense. We are working with WMDE on renaming the
prefixes, but this is more complex than expected since "wd" is hardcoded in
more places than it should be. Those prefix should only be valid in the
local context of the dumps, so renaming them is technically a non breaking
change. That being said, if you start using those dumps, make sure you
don't rely on this prefix, or that you are ready for a rename [4].

We are planning to dig more into the data we have to get a better
understanding of the use cases around WDQS [5] (not much content on that
task yet, but it is coming). Some very preliminary analysis indicates that
less then 2% of the queries on WDQS generate more than 90% of the load.
This is definitely something we need to better understand. We will be
working on defining the kind of questions we need to answer, and improving
our data collection to be able to answer those questions.

We have started an internal discussion around "planning for disaster" [6].
We want to better understand the potential failure scenarios around WDQS
and have a plan if that worst case does happen. This will include some
analytics work and some testing to better understand the constraints and
what degraded mode we might still be able to provide in case of
catastrophic failure.

Thanks for reading!

   Guillaume

[1] https://www.wikidata.org/wiki/Wikidata:Events#Office_hours
[2] https://phabricator.wikimedia.org/T251488
[3] https://dumps.wikimedia.org/other/wikibase/commonswiki/
[4]
https://dumps.wikimedia.org/other/wikibase/commonswiki/README_commonsrdfdum…
[5] https://phabricator.wikimedia.org/T257045
[6] https://phabricator.wikimedia.org/T257055

-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET