Dear Wikidata community members,
The Search Platform team has been busy on the Search side improving how we ingest documents into Elasticsearch https://wikitech.wikimedia.org/wiki/Search/Update_Pipeline. This work is ramping down and our next priority is to focus on exploring options to scale Wikidata Query Service as mentioned in our annual plan. What are we trying to address?
We are convinced that the current highest risk to Wikidata Query Service is the data size and data growth.
Wikidata is growing at a rate of roughly 1 billion triples per year and is already one of the largest public SPARQL endpoints on the internet. This is already causing visible issues, such as queries that used to run in a reasonable amount of time that are now timing out. It is also creating less visible issues, both in managing the infrastructure (it took us ~3 months to reload data from scratch last time we tried) and in the overall stability of the system (see the Blazegraph failure playbook https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/Blazegraph_failure_playbook). We have strong evidence that increased data size could lead to a hard scaling limit of the service. What are we NOT trying to address?
Write load: Wikidata sees around 1 million edits per day. Wikidata Query Service used to be a bottleneck, and was preventing bots from editing via maxlag https://www.mediawiki.org/wiki/Manual:Maxlag_parameter. This has been addressed with the Wikidata Streaming Updater and does not need further work at the moment.
Query load / query optimization: We know there are issues with queries timing out, and that Wikidata Query Service is sometimes overloaded to the point where we are dropping queries. The stability of the system is imperfectly addressed by throttling queries, and more servers have been added to handle additional load. While this is certainly inconvenient to WDQS users, we think this is manageable and does not have as much impact as complete failure of the system due to data size.
Replacing Blazegraph: Blazegraph is unmaintained and will eventually need to be replaced. Our analysis of alternative backend https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives shows that there are limited options for a graph the size of Wikidata. Addressing the graph size first will open more options when we work on replacing Blazegraph. What is the plan?
We want to experiment with splitting the Wikidata Query Service graph and use federation for the queries that need access to all subgraphs. This is a breaking change, which will require a number of queries to be rewritten, either to access a new SPARQL endpoint, or to use federation. We want to have a good understanding of the trade-offs before we commit to any long-term solution.
We’ve identified separation of scholarly articles as a good first experiment: Scholarly articles represent roughly half of Wikidata triples https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis#Table_of_top_50_subgraph_information, affect only about 2% of queries https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Scholarly_Articles_Subgraph_Analysis#Queries_related_to_scholarly_articles_2 (many of which are done as part of the data imports), and such a split would be easy to understand.
We did consider other potential splits, but they don’t seem as promising. For example:
-
Truthy vs fully reified graph https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Statement_types: the truthy graph would be smaller, but we would still need to maintain the full graph -
Labels and descriptions: only amounts to ~20% of the graph, and would require most queries to use federation -
Astronomical objects: only amounts to ~10% of the graph
To better understand the impact of such a split, we will create a test environment based on a split of a static dump, and create a test plan based on a subset of queries that we see in the current WDQS production environment. Eventually, we will expose this test environment to gather more feedback from all of you. What is the approximate schedule?
We anticipate the following, but may need to adjust if there are unforeseen challenges.
-
By end of January 2024: availability of a somewhat stable testing environment -
By end of January 2024: testing of the split on a subset of existing queries, feedback from all of you about how this split is functioning for different workloads -
By end of March 2024: reflection on this experiment and next steps, either experimentation with a different split, or productionize the current one
What is NOT part of the plan?
-
Other splits than scholarly articles. Other experiments might come up in the future, but we want to focus on scholarly articles first. -
Real-time updates. To reduce the complexity of the experiment, we will focus on a static dump. If the experiment is successful, more work will be done to ensure that those split graphs can be updated in real time. -
Production implementation of multiple graphs: we will only commit to a production implementation if the experiment is successful.
Success criteria
Part of the experimentation is understanding the impacts of this split, so we only have imperfect metrics at this time.
-
Blazegraph stability is not threatened by the size of the graph. Our expectation is that a size reduction of 25% will give us leeway. A proxy metric for stability is our ability to reload the data from scratch in less than 10 days. -
Query time is not increased for most queries. -
The number of queries requiring rewrite due to federation is minimal. -
The number of queries rendered too expensive by federation is minimal.
How to learn more?
We will create a wiki page for the project shortly, this will be the main focus point for discussions. You are always welcome to join the Search Platform Office Hours https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours (first Wednesday of every month) to ask more questions and have a direct discussion with the team.
This communication is also available on wiki https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/October_2023_scaling_update .
Thank you all for your help and support!
Guillaume