Dear Wikidata community members,


The Search Platform team has been busy on the Search side improving how we ingest documents into Elasticsearch. This work is ramping down and our next priority is to focus on exploring options to scale Wikidata Query Service as mentioned in our annual plan.

What are we trying to address?

We are convinced that the current highest risk to Wikidata Query Service is the data size and data growth.


Wikidata is growing at a rate of roughly 1 billion triples per year and is already one of the largest public SPARQL endpoints on the internet. This is already causing visible issues, such as queries that used to run in a reasonable amount of time that are now timing out. It is also creating less visible issues, both in managing the infrastructure (it took us ~3 months to reload data from scratch last time we tried) and in the overall stability of the system (see the Blazegraph failure playbook). We have strong evidence that increased data size could lead to a hard scaling limit of the service.

What are we NOT trying to address?

Write load: Wikidata sees around 1 million edits per day. Wikidata Query Service used to be a bottleneck, and was preventing bots from editing via maxlag. This has been addressed with the Wikidata Streaming Updater and does not need further work at the moment.


Query load / query optimization: We know there are issues with queries timing out, and that Wikidata Query Service is sometimes overloaded to the point where we are dropping queries. The stability of the system is imperfectly addressed by throttling queries, and more servers have been added to handle additional load. While this is certainly inconvenient to WDQS users, we think this is manageable and does not have as much impact as complete failure of the system due to data size.


Replacing Blazegraph: Blazegraph is unmaintained and will eventually need to be replaced. Our analysis of alternative backend shows that there are limited options for a graph the size of Wikidata. Addressing the graph size first will open more options when we work on replacing Blazegraph.

What is the plan?

We want to experiment with splitting the Wikidata Query Service graph and use federation for the queries that need access to all subgraphs. This is a breaking change, which will require a number of queries to be rewritten, either to access a new SPARQL endpoint, or to use federation. We want to have a good understanding of the trade-offs before we commit to any long-term solution.


We’ve identified separation of scholarly articles as a good first experiment: Scholarly articles represent roughly half of Wikidata triples, affect only about 2% of queries (many of which are done as part of the data imports), and such a split would be easy to understand.


We did consider other potential splits, but they don’t seem as promising. For example:



To better understand the impact of such a split, we will create a test environment based on a split of a static dump, and create a test plan based on a subset of queries that we see in the current WDQS production environment. Eventually, we will expose this test environment to gather more feedback from all of you.

What is the approximate schedule?

We anticipate the following, but may need to adjust if there are unforeseen challenges.


What is NOT part of the plan?

Success criteria

Part of the experimentation is understanding the impacts of this split, so we only have imperfect metrics at this time.


How to learn more?

We will create a wiki page for the project shortly, this will be the main focus point for discussions. You are always welcome to join the Search Platform Office Hours (first Wednesday of every month) to ask more questions and have a direct discussion with the team.


This communication is also available on wiki.


Thank you all for your help and support!


   Guillaume




--
Guillaume Lederrey (he/him)
Engineering Manager
Wikimedia Foundation