Thanks for the questions and comments. I’ll try to address what I can at the moment (any mistakes in understanding here are my own, and will hopefully be corrected by someone with more expertise!).
1.
LDF: We currently do have some version of Wikidata LDF https://query.wikidata.org/bigdata/ldf, though I do not believe it currently functions very well, and does not currently see much usage/traffic (these are probably related issues). LDF may be part of a solution, but likely cannot be the only solution. 2.
meaning of delete in communications. First, I want to re-emphasis that this would be part of a mitigation strategy for potential unintended catastrophic Blazegraph failure. In this scenario, we would be removing data from the WDQS graph, without permanently removing it from Wikidata itself until we are able to restore it. While these data would be invisible to WDQS, they should still be accessible from the dumps https://www.wikidata.org/wiki/Wikidata:Database_download. 3.
Java 9+ JVM, G1GC garbage collection. I do not have the personal technical knowledge to answer this question at the moment, but I trust that my team has considered and/or tried many of the documented tactics/patches to help with scaling. And if not, then I appreciate the suggestions for other things to try!
best, Mike
—
*Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation https://wikimediafoundation.org/
On 19August, 2021 at 08:52:49, Marco Fossati (fossati@spaziodati.eu) wrote:
Dropping my two cents here: I'm wondering about the Wikidata Linked Data Fragments (LDF) service [1] usage.
LDF [2] is nice because it shifts the computation burden to the client, at the cost of less expressive SPARQL queries, IIRC. I think it would be a good idea to forward simple queries to that service, instead of WDQS.
Cheers,
Marco
[1] https://query.wikidata.org/bigdata/ldf [2] https://linkeddatafragments.org/
On 8/19/21 12:48 AM, Imre Samu wrote:
(i) identify and delete lower priority data (e.g. labels,
descriptions, aliases, non-normalized values, etc);
Ouch. For me
- as a native Hungarian: the labels, descriptions, aliases - is
extremely important
- as a data user: I am using "labels","aliases" in my concordances tools
( mapping wikidata-ids with external ids )
So Please clarify the practical meaning of the *"delete"*
*Thanks in advance, Imre
Mike Pham <mpham@wikimedia.org mailto:mpham@wikimedia.org> ezt írta (időpont: 2021. aug. 18., Sze, 23:08):
Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
Blazegraph scaling
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7...),
and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit
of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560.
In the interim period, we are exploring disaster mitigation options for reducing Wikidata’s graph size in the case that we hit this upper graph size limit: (i) identify and delete lower priority data (e.g. labels, descriptions, aliases, non-normalized values, etc); (ii) separate out certain subgraphs (such as Lexemes and/or scholarly articles). This would be a last resort scenario to keep Wikidata and WDQS running with reduced functionality while we are able to deploy a more long-term solution.
Update and access scaling
Throughput. WDQS is currently trying to provide fast updates, and fast unlimited queries for all users. As the number of SPARQL queries grows over time https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathonalongside
graph updates, WDQS is struggling to sufficiently keep up <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8...
each dimension of service quality without compromising anywhere. For users, this often leads to timed out queries.
Equitable service. We are currently unable to adjust system behavior per user/agent. As such, it is not possible to provide equitable service to users: for example, a heavy user could swamp WDQS enough to hinder usability by community users.
In addition to being a querying service for Wikidata, WDQS is also part of the edit pipeline of Wikidata (every edit on Wikidata is pushed to WDQS to update the data there). While deploying the new Flink-based Streaming Updater https://phabricator.wikimedia.org/T244590will help with increasing throughput of Wikidata updates, there is a substantial risk that WDQS will be unable to keep up with the combination of increased querying and updating, resulting in more tradeoffs between update lag and querying latency/timeouts.
In the near term, we would like to work more closely with you to determine what acceptable trade-offs would be for preserving WDQS functionality while we scale up Wikidata querying. In the long term, we will be conducting more user research to better understand your needs so we can (i) optimize querying via SPARQL and/or other methods, (ii) explore better user management that will allow us to prevent heavy use of WDQS that does not align with the goals of our movement and projects, and (iii) make it easier for users to set up and run their own query services.
Though this information about the current state of WDQS may not be a total surprise to many of you, we want to be as transparent with you as possible to ensure that there are as few surprises as possible in the case of any potential service disruptions/catastrophic failures, and that we can accommodate your work as best as we can in the future evolution of WDQS. We plan on doing a session on WDQS scaling challenges during WikidataCon this year at the end of October.
Thanks for your understanding with these scaling challenges, and for any feedback you have already been providing. If you have new concerns, comments and questions, you can best reach us at this talk page <
https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug....
Additionally, if you have not had a chance to fill out our survey <
https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBB...,
please tell us how you use the Wikidata Query Service (see privacy statement <
https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statemen...
Whether you are an occasional user or create tools, your feedback is needed to decide our future development.
Best,
WMF Search + WMDE
Wikidata mailing list -- wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org mailto:wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org