Hello all!
We have been hard at work on our Graph Split experiment [1], and we now have a working graph split that is loaded onto 3 test servers. We are running tests on a selection of queries from our logs to help understand the impact of the split. We need your help to validate the impact of various use cases and workflows around Wikidata Query Service.
**What is the WDQS Graph Split experiment?**
We want to address the growing size of the Wikidata graph by splitting it into 2 subgraphs of roughly half the size of the full graph, which should support the growth of Wikidata for the next 5 years. This experiment is about splitting the full Wikidata graph into a scholarly articles subgraph and a “main” graph that contains everything else.
See our previous update for more details [2].
**Who should care?**
Anyone who uses WDQS through the UI or programmatically should check the impact on their use cases, scripts, bots, code, etc.
**What are those test endpoints?**
We expose 3 test endpoints, for the full, main and scholarly articles graphs. Those graphs are all created from the same dump and are not live updated. This allows us to compare queries between the different endpoints, with stable / non changing data (the data are from the middle of October 2023).
The endpoints are: * https://query-full-experimental.wikidata.org/ * https://query-main-experimental.wikidata.org/ * https://query-scholarly-experimental.wikidata.org/
Each of the endpoints is backed by a single dedicated server of performance similar to the production WDQS servers. We don’t expect performance to be representative of production due to the different load and to the lack of updates on the test servers.
**What kind of feedback is useful?**
We expect queries that don’t require scholarly articles to work transparently on the “main” subgraph. We expect queries that require scholarly articles to need to be rewritten with SPARQL federation between the “main” and scholarly subgraphs (federation is supported for some external SPARQL servers already [3], this just happens to be for internal server-to-server communication). We are doing tests and analysis based on a sample of query logs.
**We want to hear about:**
General use cases or classes of queries which break under federation Bots or applications that need significant rewrite of queries to work with federation And also about use cases that work just fine!
Examples of queries and pointers to code will be helpful in your feedback.
**Where should feedback be sent?**
You can reach out to us using the project’s talk page [1], the Phabricator ticket for community feedback [4] or by pinging directly Sannita (WMF) [5].
**Will feedback be taken into account?**
Yes! We will review feedback and it will influence our path forward. That being said, there are limits to what is possible. The size of the Wikidata graph is a threat to the stability of WDQS and thus a threat to the whole Wikidata project. Scholarly articles is the only split we know of that would reduce the graph size sufficiently. We can work together on providing support for a migration, on reviewing the rules used for the graph split, but we can’t just ignore the problem and continue with a WDQS that provides transparent access to the full Wikidata graph.
Have fun!
Guillaume
[1] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split [2] https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_upd... [3] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Federation [4] https://phabricator.wikimedia.org/T356773 [5] https://www.wikidata.org/wiki/User:Sannita_(WMF) -- Guillaume Lederrey (he/him) Engineering Manager Wikimedia Foundation
cloud-announce@lists.wikimedia.org