Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
1.
Blazegraph scaling 1.
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m), and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560 .
In the interim period, we are exploring disaster mitigation options for reducing Wikidata’s graph size in the case that we hit this upper graph size limit: (i) identify and delete lower priority data (e.g. labels, descriptions, aliases, non-normalized values, etc); (ii) separate out certain subgraphs (such as Lexemes and/or scholarly articles). This would be a last resort scenario to keep Wikidata and WDQS running with reduced functionality while we are able to deploy a more long-term solution.
1.
Update and access scaling 1.
Throughput. WDQS is currently trying to provide fast updates, and fast unlimited queries for all users. As the number of SPARQL queries grows over time https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathonalongside graph updates, WDQS is struggling to sufficiently keep up https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&from=now-6M&to=now&refresh=1d in each dimension of service quality without compromising anywhere. For users, this often leads to timed out queries. 2.
Equitable service. We are currently unable to adjust system behavior per user/agent. As such, it is not possible to provide equitable service to users: for example, a heavy user could swamp WDQS enough to hinder usability by community users.
In addition to being a querying service for Wikidata, WDQS is also part of the edit pipeline of Wikidata (every edit on Wikidata is pushed to WDQS to update the data there). While deploying the new Flink-based Streaming Updater https://phabricator.wikimedia.org/T244590 will help with increasing throughput of Wikidata updates, there is a substantial risk that WDQS will be unable to keep up with the combination of increased querying and updating, resulting in more tradeoffs between update lag and querying latency/timeouts.
In the near term, we would like to work more closely with you to determine what acceptable trade-offs would be for preserving WDQS functionality while we scale up Wikidata querying. In the long term, we will be conducting more user research to better understand your needs so we can (i) optimize querying via SPARQL and/or other methods, (ii) explore better user management that will allow us to prevent heavy use of WDQS that does not align with the goals of our movement and projects, and (iii) make it easier for users to set up and run their own query services.
Though this information about the current state of WDQS may not be a total surprise to many of you, we want to be as transparent with you as possible to ensure that there are as few surprises as possible in the case of any potential service disruptions/catastrophic failures, and that we can accommodate your work as best as we can in the future evolution of WDQS. We plan on doing a session on WDQS scaling challenges during WikidataCon this year at the end of October.
Thanks for your understanding with these scaling challenges, and for any feedback you have already been providing. If you have new concerns, comments and questions, you can best reach us at this talk page https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug_2021. Additionally, if you have not had a chance to fill out our survey https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBBmiqsMLu0a6bhLg/viewform?usp=sf_link yet, please tell us how you use the Wikidata Query Service (see privacy statement https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statement)! Whether you are an occasional user or create tools, your feedback is needed to decide our future development.
Best,
WMF Search + WMDE
If you have new concerns, comments and questions, you can best reach us at this talk page
Oops, sorry, this is the link that should actually work: https://www.wikidata.org/w/index.php?title=Wikidata_talk:Query_Service_scali...
—
*Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation https://wikimediafoundation.org/
On 18August, 2021 at 16:07:01, Mike Pham (mpham@wikimedia.org) wrote:
Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
1.
Blazegraph scaling 1.
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m), and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560 .
In the interim period, we are exploring disaster mitigation options for reducing Wikidata’s graph size in the case that we hit this upper graph size limit: (i) identify and delete lower priority data (e.g. labels, descriptions, aliases, non-normalized values, etc); (ii) separate out certain subgraphs (such as Lexemes and/or scholarly articles). This would be a last resort scenario to keep Wikidata and WDQS running with reduced functionality while we are able to deploy a more long-term solution.
1.
Update and access scaling 1.
Throughput. WDQS is currently trying to provide fast updates, and fast unlimited queries for all users. As the number of SPARQL queries grows over time https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathonalongside graph updates, WDQS is struggling to sufficiently keep up https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&from=now-6M&to=now&refresh=1d in each dimension of service quality without compromising anywhere. For users, this often leads to timed out queries. 2.
Equitable service. We are currently unable to adjust system behavior per user/agent. As such, it is not possible to provide equitable service to users: for example, a heavy user could swamp WDQS enough to hinder usability by community users.
In addition to being a querying service for Wikidata, WDQS is also part of the edit pipeline of Wikidata (every edit on Wikidata is pushed to WDQS to update the data there). While deploying the new Flink-based Streaming Updater https://phabricator.wikimedia.org/T244590 will help with increasing throughput of Wikidata updates, there is a substantial risk that WDQS will be unable to keep up with the combination of increased querying and updating, resulting in more tradeoffs between update lag and querying latency/timeouts.
In the near term, we would like to work more closely with you to determine what acceptable trade-offs would be for preserving WDQS functionality while we scale up Wikidata querying. In the long term, we will be conducting more user research to better understand your needs so we can (i) optimize querying via SPARQL and/or other methods, (ii) explore better user management that will allow us to prevent heavy use of WDQS that does not align with the goals of our movement and projects, and (iii) make it easier for users to set up and run their own query services.
Though this information about the current state of WDQS may not be a total surprise to many of you, we want to be as transparent with you as possible to ensure that there are as few surprises as possible in the case of any potential service disruptions/catastrophic failures, and that we can accommodate your work as best as we can in the future evolution of WDQS. We plan on doing a session on WDQS scaling challenges during WikidataCon this year at the end of October.
Thanks for your understanding with these scaling challenges, and for any feedback you have already been providing. If you have new concerns, comments and questions, you can best reach us at this talk page https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug_2021. Additionally, if you have not had a chance to fill out our survey https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBBmiqsMLu0a6bhLg/viewform?usp=sf_link yet, please tell us how you use the Wikidata Query Service (see privacy statement https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statement)! Whether you are an occasional user or create tools, your feedback is needed to decide our future development.
Best,
WMF Search + WMDE
Hi Mike,
Is Blazegraph instances running within Java 9+ JVM ? I assume they are configured using G1GC garbage collector? Did you also try enabling the -XX+UseStringDeduplication to see if it can at least help a little and reduce some of the heap overhead on long lived strings? https://github.com/blazegraph/database/blob/master/blazegraph-rpm/src/rpm/co...
Also, has the team ever seen any Humongous objects fragmenting the old gen? Many sometimes can cause the heap to be fragmented as mentioned within https://www.oracle.com/technical-resources/articles/java/g1gc.html
There's also a few things to consider within that G1GC article that might be worth looking into or thinking about from a data <--> memory perspective since you don't have much control over heap usage in many places within Blazegraph.
Hope this helps a bit. If I think of any other quick patches I'll let you know.
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
(i) identify and delete lower priority data (e.g. labels, descriptions,
aliases, non-normalized values, etc);
Ouch. For me - as a native Hungarian: the labels, descriptions, aliases - is extremely important - as a data user: I am using "labels","aliases" in my concordances tools ( mapping wikidata-ids with external ids )
So Please clarify the practical meaning of the *"delete"*
Thanks in advance, Imre
Mike Pham mpham@wikimedia.org ezt írta (időpont: 2021. aug. 18., Sze, 23:08):
Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
Blazegraph scaling
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m), and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560.
In the interim period, we are exploring disaster mitigation options for reducing Wikidata’s graph size in the case that we hit this upper graph size limit: (i) identify and delete lower priority data (e.g. labels, descriptions, aliases, non-normalized values, etc); (ii) separate out certain subgraphs (such as Lexemes and/or scholarly articles). This would be a last resort scenario to keep Wikidata and WDQS running with reduced functionality while we are able to deploy a more long-term solution.
Update and access scaling
Throughput. WDQS is currently trying to provide fast updates, and fast unlimited queries for all users. As the number of SPARQL queries grows over time https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathonalongside graph updates, WDQS is struggling to sufficiently keep up https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&from=now-6M&to=now&refresh=1d in each dimension of service quality without compromising anywhere. For users, this often leads to timed out queries. 2.
Equitable service. We are currently unable to adjust system behavior per user/agent. As such, it is not possible to provide equitable service to users: for example, a heavy user could swamp WDQS enough to hinder usability by community users.
In addition to being a querying service for Wikidata, WDQS is also part of the edit pipeline of Wikidata (every edit on Wikidata is pushed to WDQS to update the data there). While deploying the new Flink-based Streaming Updater https://phabricator.wikimedia.org/T244590 will help with increasing throughput of Wikidata updates, there is a substantial risk that WDQS will be unable to keep up with the combination of increased querying and updating, resulting in more tradeoffs between update lag and querying latency/timeouts.
In the near term, we would like to work more closely with you to determine what acceptable trade-offs would be for preserving WDQS functionality while we scale up Wikidata querying. In the long term, we will be conducting more user research to better understand your needs so we can (i) optimize querying via SPARQL and/or other methods, (ii) explore better user management that will allow us to prevent heavy use of WDQS that does not align with the goals of our movement and projects, and (iii) make it easier for users to set up and run their own query services.
Though this information about the current state of WDQS may not be a total surprise to many of you, we want to be as transparent with you as possible to ensure that there are as few surprises as possible in the case of any potential service disruptions/catastrophic failures, and that we can accommodate your work as best as we can in the future evolution of WDQS. We plan on doing a session on WDQS scaling challenges during WikidataCon this year at the end of October.
Thanks for your understanding with these scaling challenges, and for any feedback you have already been providing. If you have new concerns, comments and questions, you can best reach us at this talk page https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug_2021. Additionally, if you have not had a chance to fill out our survey https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBBmiqsMLu0a6bhLg/viewform?usp=sf_link yet, please tell us how you use the Wikidata Query Service (see privacy statement https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statement)! Whether you are an occasional user or create tools, your feedback is needed to decide our future development.
Best,
WMF Search + WMDE _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Dropping my two cents here: I'm wondering about the Wikidata Linked Data Fragments (LDF) service [1] usage.
LDF [2] is nice because it shifts the computation burden to the client, at the cost of less expressive SPARQL queries, IIRC. I think it would be a good idea to forward simple queries to that service, instead of WDQS.
Cheers,
Marco
[1] https://query.wikidata.org/bigdata/ldf [2] https://linkeddatafragments.org/
On 8/19/21 12:48 AM, Imre Samu wrote:
(i) identify and delete lower priority data (e.g. labels,
descriptions, aliases, non-normalized values, etc);
Ouch. For me
- as a native Hungarian: the labels, descriptions, aliases - is
extremely important
- as a data user: I am using "labels","aliases" in my concordances tools
( mapping wikidata-ids with external ids )
So Please clarify the practical meaning of the *"delete"*
*Thanks in advance, Imre
Mike Pham <mpham@wikimedia.org mailto:mpham@wikimedia.org> ezt írta (időpont: 2021. aug. 18., Sze, 23:08):
Wikidata community members, Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them. WDQS currently risks failing to provide acceptable service quality due to the following reasons: 1. Blazegraph scaling 1. Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges <https://blazegraph.com/>, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m>), and there is a risk that we will reach a size <https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29>limit of what it can realistically support <https://phabricator.wikimedia.org/T213210>. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2. Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term. Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified <https://phabricator.wikimedia.org/T206560>. In the interim period, we are exploring disaster mitigation options for reducing Wikidata’s graph size in the case that we hit this upper graph size limit: (i) identify and delete lower priority data (e.g. labels, descriptions, aliases, non-normalized values, etc); (ii) separate out certain subgraphs (such as Lexemes and/or scholarly articles). This would be a last resort scenario to keep Wikidata and WDQS running with reduced functionality while we are able to deploy a more long-term solution. 2. Update and access scaling 1. Throughput. WDQS is currently trying to provide fast updates, and fast unlimited queries for all users. As the number of SPARQL queries grows over time <https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathon>alongside graph updates, WDQS is struggling to sufficiently keep up <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&from=now-6M&to=now&refresh=1d>in each dimension of service quality without compromising anywhere. For users, this often leads to timed out queries. 2. Equitable service. We are currently unable to adjust system behavior per user/agent. As such, it is not possible to provide equitable service to users: for example, a heavy user could swamp WDQS enough to hinder usability by community users. In addition to being a querying service for Wikidata, WDQS is also part of the edit pipeline of Wikidata (every edit on Wikidata is pushed to WDQS to update the data there). While deploying the new Flink-based Streaming Updater <https://phabricator.wikimedia.org/T244590>will help with increasing throughput of Wikidata updates, there is a substantial risk that WDQS will be unable to keep up with the combination of increased querying and updating, resulting in more tradeoffs between update lag and querying latency/timeouts. In the near term, we would like to work more closely with you to determine what acceptable trade-offs would be for preserving WDQS functionality while we scale up Wikidata querying. In the long term, we will be conducting more user research to better understand your needs so we can (i) optimize querying via SPARQL and/or other methods, (ii) explore better user management that will allow us to prevent heavy use of WDQS that does not align with the goals of our movement and projects, and (iii) make it easier for users to set up and run their own query services. Though this information about the current state of WDQS may not be a total surprise to many of you, we want to be as transparent with you as possible to ensure that there are as few surprises as possible in the case of any potential service disruptions/catastrophic failures, and that we can accommodate your work as best as we can in the future evolution of WDQS. We plan on doing a session on WDQS scaling challenges during WikidataCon this year at the end of October. Thanks for your understanding with these scaling challenges, and for any feedback you have already been providing. If you have new concerns, comments and questions, you can best reach us at this talk page <https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug_2021>. Additionally, if you have not had a chance to fill out our survey <https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBBmiqsMLu0a6bhLg/viewform?usp=sf_link>yet, please tell us how you use the Wikidata Query Service (see privacy statement <https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statement>)! Whether you are an occasional user or create tools, your feedback is needed to decide our future development. Best, WMF Search + WMDE _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Hi Marco,
The problem with LDF is that not only the compute is shifted to the client, but in order to do the compute... the data that is to be computed must sometimes be transferred...and sometimes for even a simple query, the data has to be transferred and can be substantial. It's not always substantial, but as I said, sometimes, and it all depends on what is being asked in a query. (Think those that don't have dependable internet or they are limited in their daily transfer quota or speed.) That's the con of LDF while it certainly provides a pro effect for computing things yourself.
So you have clients (users wanting free knowledge) that might be limited when it comes to LDF. But then there's Wikidata limits, specifically Blazegraph and WDQS. Us || Them , who do we choose or prioritize?
As part of the Wikimedia mission, the choice has always been *offering both*, so that those clients that do live in a world with limits, might still be afforded *a small luxury of free knowledge.* Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Thu, Aug 19, 2021 at 8:52 AM Marco Fossati fossati@spaziodati.eu wrote:
Dropping my two cents here: I'm wondering about the Wikidata Linked Data Fragments (LDF) service [1] usage.
LDF [2] is nice because it shifts the computation burden to the client, at the cost of less expressive SPARQL queries, IIRC. I think it would be a good idea to forward simple queries to that service, instead of WDQS.
Cheers,
Marco
[1] https://query.wikidata.org/bigdata/ldf [2] https://linkeddatafragments.org/
Thanks for the questions and comments. I’ll try to address what I can at the moment (any mistakes in understanding here are my own, and will hopefully be corrected by someone with more expertise!).
1.
LDF: We currently do have some version of Wikidata LDF https://query.wikidata.org/bigdata/ldf, though I do not believe it currently functions very well, and does not currently see much usage/traffic (these are probably related issues). LDF may be part of a solution, but likely cannot be the only solution. 2.
meaning of delete in communications. First, I want to re-emphasis that this would be part of a mitigation strategy for potential unintended catastrophic Blazegraph failure. In this scenario, we would be removing data from the WDQS graph, without permanently removing it from Wikidata itself until we are able to restore it. While these data would be invisible to WDQS, they should still be accessible from the dumps https://www.wikidata.org/wiki/Wikidata:Database_download. 3.
Java 9+ JVM, G1GC garbage collection. I do not have the personal technical knowledge to answer this question at the moment, but I trust that my team has considered and/or tried many of the documented tactics/patches to help with scaling. And if not, then I appreciate the suggestions for other things to try!
best, Mike
—
*Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation https://wikimediafoundation.org/
On 19August, 2021 at 08:52:49, Marco Fossati (fossati@spaziodati.eu) wrote:
Dropping my two cents here: I'm wondering about the Wikidata Linked Data Fragments (LDF) service [1] usage.
LDF [2] is nice because it shifts the computation burden to the client, at the cost of less expressive SPARQL queries, IIRC. I think it would be a good idea to forward simple queries to that service, instead of WDQS.
Cheers,
Marco
[1] https://query.wikidata.org/bigdata/ldf [2] https://linkeddatafragments.org/
On 8/19/21 12:48 AM, Imre Samu wrote:
(i) identify and delete lower priority data (e.g. labels,
descriptions, aliases, non-normalized values, etc);
Ouch. For me
- as a native Hungarian: the labels, descriptions, aliases - is
extremely important
- as a data user: I am using "labels","aliases" in my concordances tools
( mapping wikidata-ids with external ids )
So Please clarify the practical meaning of the *"delete"*
*Thanks in advance, Imre
Mike Pham <mpham@wikimedia.org mailto:mpham@wikimedia.org> ezt írta (időpont: 2021. aug. 18., Sze, 23:08):
Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
Blazegraph scaling
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7...),
and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit
of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560.
In the interim period, we are exploring disaster mitigation options for reducing Wikidata’s graph size in the case that we hit this upper graph size limit: (i) identify and delete lower priority data (e.g. labels, descriptions, aliases, non-normalized values, etc); (ii) separate out certain subgraphs (such as Lexemes and/or scholarly articles). This would be a last resort scenario to keep Wikidata and WDQS running with reduced functionality while we are able to deploy a more long-term solution.
Update and access scaling
Throughput. WDQS is currently trying to provide fast updates, and fast unlimited queries for all users. As the number of SPARQL queries grows over time https://www.mediawiki.org/wiki/User:MPopov_(WMF)/Wikimania_2021_Hackathonalongside
graph updates, WDQS is struggling to sufficiently keep up <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8...
each dimension of service quality without compromising anywhere. For users, this often leads to timed out queries.
Equitable service. We are currently unable to adjust system behavior per user/agent. As such, it is not possible to provide equitable service to users: for example, a heavy user could swamp WDQS enough to hinder usability by community users.
In addition to being a querying service for Wikidata, WDQS is also part of the edit pipeline of Wikidata (every edit on Wikidata is pushed to WDQS to update the data there). While deploying the new Flink-based Streaming Updater https://phabricator.wikimedia.org/T244590will help with increasing throughput of Wikidata updates, there is a substantial risk that WDQS will be unable to keep up with the combination of increased querying and updating, resulting in more tradeoffs between update lag and querying latency/timeouts.
In the near term, we would like to work more closely with you to determine what acceptable trade-offs would be for preserving WDQS functionality while we scale up Wikidata querying. In the long term, we will be conducting more user research to better understand your needs so we can (i) optimize querying via SPARQL and/or other methods, (ii) explore better user management that will allow us to prevent heavy use of WDQS that does not align with the goals of our movement and projects, and (iii) make it easier for users to set up and run their own query services.
Though this information about the current state of WDQS may not be a total surprise to many of you, we want to be as transparent with you as possible to ensure that there are as few surprises as possible in the case of any potential service disruptions/catastrophic failures, and that we can accommodate your work as best as we can in the future evolution of WDQS. We plan on doing a session on WDQS scaling challenges during WikidataCon this year at the end of October.
Thanks for your understanding with these scaling challenges, and for any feedback you have already been providing. If you have new concerns, comments and questions, you can best reach us at this talk page <
https://www.wikidata.org/wiki/Wikidata_talk:Query_Service_scaling_update_Aug....
Additionally, if you have not had a chance to fill out our survey <
https://docs.google.com/forms/d/e/1FAIpQLSe1H_OXQFDCiGlp0QRwP6-Z2CGCgm96MWBB...,
please tell us how you use the Wikidata Query Service (see privacy statement <
https://foundation.wikimedia.org/wiki/WDQS_User_Survey_2021_Privacy_Statemen...
Whether you are an occasional user or create tools, your feedback is needed to decide our future development.
Best,
WMF Search + WMDE
Wikidata mailing list -- wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org mailto:wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Marko,
First, I just wanted to say it is *awesome* to see this level of transparency and clarity about the state of the service.
Maybe this is over simplifying things but is it accurate to say that there are two orthogonal problems here?
1. The underlying technology (BlazeGraph) is end of life and needs to be replaced. 2. The Query service is open to the public, with no authentication, which means the Wikidata team have very little idea who/what depends on the service.
For 2 I wonder if it might make sense to start requiring registration & authentication? Or is this not the wiki way?
//Ed
Hey everyone,
On Thu, Aug 19, 2021 at 9:42 PM Ed Summers ehs@pobox.com wrote:
First, I just wanted to say it is *awesome* to see this level of transparency and clarity about the state of the service.
+1. Really glad that we're hearing about this now, before the problem becomes so seriously that measures are taken without informing the community beforehand.
- The Query service is open to the public, with no authentication,
which means the Wikidata team have very little idea who/what depends on the service.
For 2 I wonder if it might make sense to start requiring registration & authentication? Or is this not the wiki way?
Right now, the Wikimedia Commons Query Service (in beta) is doing something like that. You need to login and authenticate before you can use it. I'm not sure if there have been any insights on the points you mention (who uses it in what capacity). But if i'm reading Mike's original email i think the main problem is inputting the ever growing graph of data into Blazegraph, not people (ab)using the query service in such a capacity that it doesn't scale. I understand that perhaps it would make sense to add some kind of authentication layer, but it would be a shame. The Wikimedia projects API's (i'm sharing the query service among those) are one of the very few large-scale API's with useful data that don't need any kind of authentication. Making it really useful to teach to students or people just dabbling in writing software.
Kind regards, -- Hay / Husky
On 8/18/21 5:07 PM, Mike Pham wrote:
Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
Blazegraph scaling
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m), and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560.
Hi Mike,
Do bear in mind that pre and post selection of Blazegraph for Wikidata, we've always offered an RDF-based DBMS that can handle current and future requirements for Wikidata, just as we do DBpedia.
At the time of our first rendezvous, handling 50 billion triples would have typically required our Cluster Edition which is a Commercial Only offering -- basically, that was the deal breaker back then.
Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory.
I hope this helps.
Related:
[1] https://wikidata.demo.openlinksw.com/sparql https://wikidata.demo.openlinksw.com/sparql-- Our Live Wikidata SPARQL Query Endpoint [2] https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ... https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 -- Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints [3] https://t.co/EjAAO73wwE https://t.co/EjAAO73wwE -- this query doesn't complete with the current Blazegraph-based Wikidata endpoint [4] https://t.co/GTATPPJNBI https://t.co/GTATPPJNBI -- same query completing when applied to the Virtuoso-based endpoint [5] https://t.co/X7mLmcYC69 https://t.co/X7mLmcYC69 -- about loading Wikidata's datasets into a Virtuoso instance [6] https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&... https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live -- various demos shared via Twitter over the years regarding Wikidata
Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good parallel to keep in mind.
For Egon, Andra, others who work with them: Is there someone you'd recommend chatting with at uniprot? "scaling alongside uniprot" or at least engaging them on how to solve shared + comparable issues (they also offer authentication-free SPARQL querying) sounds like a compelling option.
S.
On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 8/18/21 5:07 PM, Mike Pham wrote:
Wikidata community members,
Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them.
WDQS currently risks failing to provide acceptable service quality due to the following reasons:
Blazegraph scaling
Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges https://blazegraph.com/, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m), and there is a risk that we will reach a size https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29limit of what it can realistically support https://phabricator.wikimedia.org/T213210. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2.
Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term.
Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified https://phabricator.wikimedia.org/T206560.
Hi Mike,
Do bear in mind that pre and post selection of Blazegraph for Wikidata, we've always offered an RDF-based DBMS that can handle current and future requirements for Wikidata, just as we do DBpedia.
At the time of our first rendezvous, handling 50 billion triples would have typically required our Cluster Edition which is a Commercial Only offering -- basically, that was the deal breaker back then.
Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory.
I hope this helps.
Related:
[1] https://wikidata.demo.openlinksw.com/sparql -- Our Live Wikidata SPARQL Query Endpoint [2] https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ... -- Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints [3] https://t.co/EjAAO73wwE -- this query doesn't complete with the current Blazegraph-based Wikidata endpoint [4] https://t.co/GTATPPJNBI -- same query completing when applied to the Virtuoso-based endpoint [5] https://t.co/X7mLmcYC69 -- about loading Wikidata's datasets into a Virtuoso instance [6] https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&... https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live -- various demos shared via Twitter over the years regarding Wikidata
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Hi Samuel, All,
I am the software engineer responsible for sparql.uniprot.org. I already offered to help in https://phabricator.wikimedia.org/T206561. So no need to ask Andra or Egon ;)
While we are good users of virtuoso, and strongly suggest it is evaluated. As it is in general a good product that does scale.[1]
One of the things we did differently than WDQS is to introduce a controlled layer between the "public" and the "database". To allow things like query rewriting/redirection upon data model changes, as well as rewriting some schema rediscovery queries to a known faster query. We also parse the queries with RDF4J before handing them to virtuoso. This makes sure that the queries that we accept are only valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects (i.e. retain the flexiblity to move to a different endpoint). We are in the process of updating this code and contributing it to RDF4J, with the first contribution in the develop/4.0.0 branch
I think a number of current customizations in WDQS can be moved to a front RDF4J layer. Then the RDF4J sail/repository layer can be used to preserve flexibility. So that WDQS can more easily switch between backend databases in the future.
One large difference between UniProt and WDQS is that WikiData is continually updated while UniProt is batch released a few times a year. WDQS is somewhat easier in some areas and more difficult in others because of that.
Regards, Jerven
[1] No Database is perfect, but it does scale a lot better than Blazegraph did. Which we also evaluated in the past. There is still a lot of potential in Virtuoso to scale even better in the future.
On 23/08/2021 21:36, Samuel Klein wrote:
Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good parallel to keep in mind.
For Egon, Andra, others who work with them: Is there someone you'd recommend chatting with at uniprot? "scaling alongside uniprot" or at least engaging them on how to solve shared + comparable issues (they also offer authentication-free SPARQL querying) sounds like a compelling option.
S.
On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org> wrote:
On 8/18/21 5:07 PM, Mike Pham wrote:
Wikidata community members, Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them. WDQS currently risks failing to provide acceptable service quality due to the following reasons: 1. Blazegraph scaling 1. Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges <https://blazegraph.com/>, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m>), and there is a risk that we will reach a size <https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29>limit of what it can realistically support <https://phabricator.wikimedia.org/T213210>. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2. Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term. Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified <https://phabricator.wikimedia.org/T206560>.
Hi Mike, Do bear in mind that pre and post selection of Blazegraph for Wikidata, we've always offered an RDF-based DBMS that can handle current and future requirements for Wikidata, just as we do DBpedia. At the time of our first rendezvous, handling 50 billion triples would have typically required our Cluster Edition which is a Commercial Only offering -- basically, that was the deal breaker back then. Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory. I hope this helps. Related: [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata SPARQL Query Endpoint [2] https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0> -- Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query doesn't complete with the current Blazegraph-based Wikidata endpoint [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query completing when applied to the Virtuoso-based endpoint [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about loading Wikidata's datasets into a Virtuoso instance [6] https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live <https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live> -- various demos shared via Twitter over the years regarding Wikidata -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com <http://www.openlinksw.com> Community Support:https://community.openlinksw.com <https://community.openlinksw.com> Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog <https://medium.com/openlink-software-blog> Virtuoso Blog:https://medium.com/virtuoso-blog <https://medium.com/virtuoso-blog> Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers> Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen <https://medium.com/@kidehen> Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ <http://www.openlinksw.com/blog/~kidehen/> http://kidehen.blogspot.com <http://kidehen.blogspot.com> Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ <https://www.pinterest.com/kidehen/> Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen <https://www.quora.com/profile/Kingsley-Uyi-Idehen> Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> Google+:https://plus.google.com/+KingsleyIdehen/about <https://plus.google.com/+KingsleyIdehen/about> LinkedIn:http://www.linkedin.com/in/kidehen <http://www.linkedin.com/in/kidehen> Web Identities (WebID): Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i> :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this> _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Samuel Klein @metasj w:user:sj +1 617 529 4266
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Aha, hello jerven :) I should have remembered your earlier comment, delighted you are here.
Thank you again for sharing your promising experience + benchmarks + suggestions -- and for highlighting both similarities and differences.
SJ
On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman jerven.bolleman@sib.swiss wrote:
Hi Samuel, All,
I am the software engineer responsible for sparql.uniprot.org. I already offered to help in https://phabricator.wikimedia.org/T206561. So no need to ask Andra or Egon ;)
While we are good users of virtuoso, and strongly suggest it is evaluated. As it is in general a good product that does scale.[1]
One of the things we did differently than WDQS is to introduce a controlled layer between the "public" and the "database". To allow things like query rewriting/redirection upon data model changes, as well as rewriting some schema rediscovery queries to a known faster query. We also parse the queries with RDF4J before handing them to virtuoso. This makes sure that the queries that we accept are only valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects (i.e. retain the flexiblity to move to a different endpoint). We are in the process of updating this code and contributing it to RDF4J, with the first contribution in the develop/4.0.0 branch
I think a number of current customizations in WDQS can be moved to a front RDF4J layer. Then the RDF4J sail/repository layer can be used to preserve flexibility. So that WDQS can more easily switch between backend databases in the future.
One large difference between UniProt and WDQS is that WikiData is continually updated while UniProt is batch released a few times a year. WDQS is somewhat easier in some areas and more difficult in others because of that.
Regards, Jerven
[1] No Database is perfect, but it does scale a lot better than Blazegraph did. Which we also evaluated in the past. There is still a lot of potential in Virtuoso to scale even better in the future.
On 23/08/2021 21:36, Samuel Klein wrote:
Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good parallel to keep in mind.
For Egon, Andra, others who work with them: Is there someone you'd recommend chatting with at uniprot? "scaling alongside uniprot" or at least engaging them on how to solve shared + comparable issues (they also offer authentication-free SPARQL querying) sounds like a compelling option.
S.
On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org>
wrote:
On 8/18/21 5:07 PM, Mike Pham wrote:
Wikidata community members, Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them. WDQS currently risks failing to provide acceptable service quality due to the following reasons: 1. Blazegraph scaling 1. Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges <https://blazegraph.com/>, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7...
),
and there is a risk that we will reach a size <
https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29%3Elimit
of what it can realistically support <https://phabricator.wikimedia.org/T213210>. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2. Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term. Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified <https://phabricator.wikimedia.org/T206560>.
Hi Mike, Do bear in mind that pre and post selection of Blazegraph for Wikidata, we've always offered an RDF-based DBMS that can handle current and future requirements for Wikidata, just as we do DBpedia. At the time of our first rendezvous, handling 50 billion triples would have typically required our Cluster Edition which is a Commercial Only offering -- basically, that was the deal breaker back then. Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory. I hope this helps. Related: [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata SPARQL Query Endpoint [2]
https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ...
<
https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ...
-- Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query doesn't complete with the current Blazegraph-based Wikidata endpoint [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query completing when applied to the Virtuoso-based endpoint [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about loading Wikidata's datasets into a Virtuoso instance [6]
https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&...
<
https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kideh...
-- various demos shared via Twitter over the years regarding Wikidata -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com <http://www.openlinksw.com> Community Support:https://community.openlinksw.com <
https://community.openlinksw.com%3E
Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog <
https://medium.com/openlink-software-blog%3E
Virtuoso Blog:https://medium.com/virtuoso-blog <
https://medium.com/virtuoso-blog%3E
Data Access Drivers Blog:
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers < https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers%3E
Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen <
https://medium.com/@kidehen%3E
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ <
http://www.openlinksw.com/blog/~kidehen/%3E
http://kidehen.blogspot.com <
http://kidehen.blogspot.com%3E
Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ <
https://www.pinterest.com/kidehen/%3E
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen <
https://www.quora.com/profile/Kingsley-Uyi-Idehen%3E
Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> Google+:https://plus.google.com/+KingsleyIdehen/about <
https://plus.google.com/+KingsleyIdehen/about%3E
LinkedIn:http://www.linkedin.com/in/kidehen <
http://www.linkedin.com/in/kidehen%3E
Web Identities (WebID): Personal:
http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i < http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i%3E
:
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this < http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Samuel Klein @metasj w:user:sj +1 617 529
4266
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
--
*Jerven Tjalling Bolleman*
Principal Software Developer *SIB | Swiss Institute of Bioinformatics* 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland t +41 22 379 58 85 Jerven.Bolleman@sib.swiss - www.sib.swiss _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Thanks for all suggestions, and general enthusiasm in helping scale WDQS! A number of you have suggested various graph backends to consider moving to from Blazegraph, and I wanted to take a minute to respond more generically.
There are several criteria we need to consider for a Blazegraph alternative. Ideally we would have this list of criteria ready and available to share, so that the community can help vet alternatives with us. Unfortunately, we do not currently have a full list of these criteria. While the criteria we judged candidate graph backends on are available here https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing, it is highly unlikely these will be the exact set we will use in this next stage of scaling, and should only be used as a historical reference.
It is likely that there is no silver bullet solution that will satisfy every criteria. We will probably need to make compromises in some areas in order to optimize for others. This is a primary reason for conducting the WDQS user survey https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2021/08#Wikidata_Query_Service_(WDQS)_User_Survey_2021: we would like a better understanding of what the overall community priorities are, including from those who may be less vocal in existing discussions. These priorities will then be a major component in distilling the criteria (and weights) for a new graph backend.
The current plan is to share the (most up to date as we can) survey results at WikidataCon https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021 this year. I appreciate the discussion around potential candidates so far, and welcome the continued insight/help, but wanted to also be clear that we will not be making any decisions about a new graph backend, or have a complete list of criteria or testing process, at the moment — WikidataCon will be the next strategic check-in point.
As always, your patience is appreciated, and I’m looking forward to the continuing discussions and collaboration!
Best, Mike
—
*Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation https://wikimediafoundation.org/
On 25August, 2021 at 09:41:28, Samuel Klein (meta.sj@gmail.com) wrote:
Aha, hello jerven :) I should have remembered your earlier comment, delighted you are here.
Thank you again for sharing your promising experience + benchmarks + suggestions -- and for highlighting both similarities and differences.
SJ
On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman jerven.bolleman@sib.swiss wrote:
Hi Samuel, All,
I am the software engineer responsible for sparql.uniprot.org. I already offered to help in https://phabricator.wikimedia.org/T206561. So no need to ask Andra or Egon ;)
While we are good users of virtuoso, and strongly suggest it is evaluated. As it is in general a good product that does scale.[1]
One of the things we did differently than WDQS is to introduce a controlled layer between the "public" and the "database". To allow things like query rewriting/redirection upon data model changes, as well as rewriting some schema rediscovery queries to a known faster query. We also parse the queries with RDF4J before handing them to virtuoso. This makes sure that the queries that we accept are only valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects (i.e. retain the flexiblity to move to a different endpoint). We are in the process of updating this code and contributing it to RDF4J, with the first contribution in the develop/4.0.0 branch
I think a number of current customizations in WDQS can be moved to a front RDF4J layer. Then the RDF4J sail/repository layer can be used to preserve flexibility. So that WDQS can more easily switch between backend databases in the future.
One large difference between UniProt and WDQS is that WikiData is continually updated while UniProt is batch released a few times a year. WDQS is somewhat easier in some areas and more difficult in others because of that.
Regards, Jerven
[1] No Database is perfect, but it does scale a lot better than Blazegraph did. Which we also evaluated in the past. There is still a lot of potential in Virtuoso to scale even better in the future.
On 23/08/2021 21:36, Samuel Klein wrote:
Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good parallel to keep in mind.
For Egon, Andra, others who work with them: Is there someone you'd recommend chatting with at uniprot? "scaling alongside uniprot" or at least engaging them on how to solve shared + comparable issues (they also offer authentication-free SPARQL querying) sounds like a compelling option.
S.
On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org>
wrote:
On 8/18/21 5:07 PM, Mike Pham wrote:
Wikidata community members, Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them. WDQS currently risks failing to provide acceptable service quality due to the following reasons: 1. Blazegraph scaling 1. Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges <https://blazegraph.com/>, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7...
),
and there is a risk that we will reach a size <
https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29%3Elimit
of what it can realistically support <https://phabricator.wikimedia.org/T213210>. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2. Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term. Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified <https://phabricator.wikimedia.org/T206560>.
Hi Mike, Do bear in mind that pre and post selection of Blazegraph for Wikidata, we've always offered an RDF-based DBMS that can handle current and future requirements for Wikidata, just as we do DBpedia. At the time of our first rendezvous, handling 50 billion triples would have typically required our Cluster Edition which is a Commercial Only offering -- basically, that was the deal breaker back then. Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory. I hope this helps. Related: [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata SPARQL Query Endpoint [2]
https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ...
<
https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ...
-- Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query doesn't complete with the current Blazegraph-based Wikidata endpoint [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query completing when applied to the Virtuoso-based endpoint [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about loading Wikidata's datasets into a Virtuoso instance [6]
https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&...
<
https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kideh...
-- various demos shared via Twitter over the years regarding Wikidata -- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com <http://www.openlinksw.com> Community Support:https://community.openlinksw.com <
https://community.openlinksw.com%3E
Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog <
https://medium.com/openlink-software-blog%3E
Virtuoso Blog:https://medium.com/virtuoso-blog <
https://medium.com/virtuoso-blog%3E
Data Access Drivers Blog:
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers < https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers%3E
Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen <
https://medium.com/@kidehen%3E
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ <
http://www.openlinksw.com/blog/~kidehen/%3E
http://kidehen.blogspot.com <
http://kidehen.blogspot.com%3E
Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ <
https://www.pinterest.com/kidehen/%3E
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen <
https://www.quora.com/profile/Kingsley-Uyi-Idehen%3E
Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> Google+:https://plus.google.com/+KingsleyIdehen/about <
https://plus.google.com/+KingsleyIdehen/about%3E
LinkedIn:http://www.linkedin.com/in/kidehen <
http://www.linkedin.com/in/kidehen%3E
Web Identities (WebID): Personal:
http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i < http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i%3E
:
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this < http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Samuel Klein @metasj w:user:sj +1 617 529
4266
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
--
*Jerven Tjalling Bolleman*
Principal Software Developer *SIB | Swiss Institute of Bioinformatics* 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland t +41 22 379 58 85 Jerven.Bolleman@sib.swiss - www.sib.swiss _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On 8/25/21 3:17 PM, Mike Pham wrote:
Thanks for all suggestions, and general enthusiasm in helping scale WDQS! A number of you have suggested various graph backends to consider moving to from Blazegraph, and I wanted to take a minute to respond more generically.
There are several criteria we need to consider for a Blazegraph alternative. Ideally we would have this list of criteria ready and available to share, so that the community can help vet alternatives with us. Unfortunately, we do not currently have a full list of these criteria. While the criteria we judged candidate graph backends on are available here https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing, it is highly unlikely these will be the exact set we will use in this next stage of scaling, and should only be used as a historical reference.
It is likely that there is no silver bullet solution that will satisfy every criteria. We will probably need to make compromises in some areas in order to optimize for others. This is a primary reason for conducting the WDQS user survey https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2021/08#Wikidata_Query_Service_(WDQS)_User_Survey_2021: we would like a better understanding of what the overall community priorities are, including from those who may be less vocal in existing discussions. These priorities will then be a major component in distilling the criteria (and weights) for a new graph backend.
The current plan is to share the (most up to date as we can) survey results at WikidataCon https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021 this year. I appreciate the discussion around potential candidates so far, and welcome the continued insight/help, but wanted to also be clear that we will not be making any decisions about a new graph backend, or have a complete list of criteria or testing process, at the moment — WikidataCon will be the next strategic check-in point.
As always, your patience is appreciated, and I’m looking forward to the continuing discussions and collaboration!
Best, Mike
—
*Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation https://wikimediafoundation.org/
Hi Mike,
Here's a suggestion regarding this important matter, circa 2021:
At the very least, a candidate platform should be able to deliver on a live instance of the Wikidata dataset accessible for interaction via SPARQL Query Services Endpoint.
Based on the interesting list of suggestions presented in this mailing list (and in the Google Spreadsheet https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0&range=M1 it's spawned), the larger goal of a vibrant LOD Cloud Knowledge Graph would benefit exponentially if each platform actually offered a live instance.
Irrespective of the final decision made, we are always going to offer a live Wikidata instance, just as we do a LOD Cloud Cache etc..
Also note, the WDQS and SPARQL loose-coupling suggested by Jerven is ultra-important, making that cool Query Services App independent of SPARQL Query Service backend will improve utility and general resilience, immensely.
*Links*
[1] https://wikidata.demo.openlinksw.com/sparql -- Wikidata instance we've been hosting for quite some time
[2] http://lod.openlinksw.com/sparql -- 40 Billion+ Triples instance (used to be the largest live SPARQL Query Service instance until Uniprot dethroned it!).
[3] https://medium.com/virtuoso-blog/on-the-mutually-beneficial-nature-of-dbpedi... -- On the Mutually Beneficial Nature of DBpedia and Wikidata
Kingsley
On 25August, 2021 at 09:41:28, Samuel Klein (meta.sj@gmail.com mailto:meta.sj@gmail.com) wrote:
Aha, hello jerven :) I should have remembered your earlier comment, delighted you are here.
Thank you again for sharing your promising experience + benchmarks + suggestions -- and for highlighting both similarities and differences.
SJ
On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman jerven.bolleman@sib.swiss wrote:
Hi Samuel, All, I am the software engineer responsible for sparql.uniprot.org <http://sparql.uniprot.org>. I already offered to help in https://phabricator.wikimedia.org/T206561 <https://phabricator.wikimedia.org/T206561>. So no need to ask Andra or Egon ;) While we are good users of virtuoso, and strongly suggest it is evaluated. As it is in general a good product that does scale.[1] One of the things we did differently than WDQS is to introduce a controlled layer between the "public" and the "database". To allow things like query rewriting/redirection upon data model changes, as well as rewriting some schema rediscovery queries to a known faster query. We also parse the queries with RDF4J before handing them to virtuoso. This makes sure that the queries that we accept are only valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects (i.e. retain the flexiblity to move to a different endpoint). We are in the process of updating this code and contributing it to RDF4J, with the first contribution in the develop/4.0.0 branch I think a number of current customizations in WDQS can be moved to a front RDF4J layer. Then the RDF4J sail/repository layer can be used to preserve flexibility. So that WDQS can more easily switch between backend databases in the future. One large difference between UniProt and WDQS is that WikiData is continually updated while UniProt is batch released a few times a year. WDQS is somewhat easier in some areas and more difficult in others because of that. Regards, Jerven [1] No Database is perfect, but it does scale a lot better than Blazegraph did. Which we also evaluated in the past. There is still a lot of potential in Virtuoso to scale even better in the future. On 23/08/2021 21:36, Samuel Klein wrote: > Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good > parallel to keep in mind. > > For Egon, Andra, others who work with them: Is there someone you'd > recommend chatting with at uniprot? > "scaling alongside uniprot" or at least engaging them on how to solve > shared + comparable issues (they also offer authentication-free SPARQL > querying) sounds like a compelling option. > > S. > > On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata > <wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> <mailto:wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>>> wrote: > > On 8/18/21 5:07 PM, Mike Pham wrote: >> >> Wikidata community members, >> >> >> Thank you for all of your work helping Wikidata grow and improve >> over the years. In the spirit of better communication, we would >> like to take this opportunity to share some of the current >> challenges Wikidata Query Service (WDQS) is facing, and some >> strategies we have for dealing with them. >> >> >> WDQS currently risks failing to provide acceptable service quality >> due to the following reasons: >> >> 1. >> >> Blazegraph scaling >> >> 1. >> >> Graph size. WDQS uses Blazegraph as our graph backend. >> While Blazegraph can theoretically support 50 billion >> edges <https://blazegraph.com/ <https://blazegraph.com/>>, in reality Wikidata is >> the largest graph we know of running on Blazegraph (~13 >> billion triples >> <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m>>), >> and there is a risk that we will reach a size >> <https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29 <https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29>>limit >> of what it can realistically support >> <https://phabricator.wikimedia.org/T213210 <https://phabricator.wikimedia.org/T213210>>. Once >> Blazegraph is maxed out, WDQS can no longer be updated. >> This will also break Wikidata tools that rely on WDQS. >> >> 2. >> >> Software support. Blazegraph is end of life software, >> which is no longer actively maintained, making it an >> unsustainable backend to continue moving forward with long >> term. >> >> >> Blazegraph maxing out in size poses the greatest risk for >> catastrophic failure, as it would effectively prevent WDQS from >> being updated further, and inevitably fall out of date. Our long >> term strategy to address this is to move to a new graph backend >> that best meets our WDQS needs and is actively maintained, and >> begin the migration off of Blazegraph as soon as a viable >> alternative is identified >> <https://phabricator.wikimedia.org/T206560 <https://phabricator.wikimedia.org/T206560>>. >> > > Hi Mike, > > Do bear in mind that pre and post selection of Blazegraph for > Wikidata, we've always offered an RDF-based DBMS that can handle > current and future requirements for Wikidata, just as we do DBpedia. > > At the time of our first rendezvous, handling 50 billion triples > would have typically required our Cluster Edition which is a > Commercial Only offering -- basically, that was the deal breaker > back then. > > Anyway, in recent times, our Open Source Edition has evolved to > handle some 80 Billion+ triples (exemplified by the live Uniprot > instance) where performance and scale is primary a function of > available memory. > > I hope this helps. > > Related: > > [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql> > <https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql>>-- Our Live Wikidata > SPARQL Query Endpoint > [2] > https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0> > <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0>> > -- Google Spreadsheet about various Virtuoso Configurations > associated with some well-known public endpoints > [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> <https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE>> -- this query > doesn't complete with the current Blazegraph-based Wikidata endpoint > [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> <https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI>> -- same query > completing when applied to the Virtuoso-based endpoint > [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> <https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69>> -- about > loading Wikidata's datasets into a Virtuoso instance > [6] > https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live <https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live> > <https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live <https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live>> > -- various demos shared via Twitter over the years regarding Wikidata > > -- > Regards, > > Kingsley Idehen > Founder & CEO > OpenLink Software > Home Page:http://www.openlinksw.com <http://www.openlinksw.com> <http://www.openlinksw.com <http://www.openlinksw.com>> > Community Support:https://community.openlinksw.com <https://community.openlinksw.com> <https://community.openlinksw.com <https://community.openlinksw.com>> > Weblogs (Blogs): > Company Blog:https://medium.com/openlink-software-blog <https://medium.com/openlink-software-blog> <https://medium.com/openlink-software-blog <https://medium.com/openlink-software-blog>> > Virtuoso Blog:https://medium.com/virtuoso-blog <https://medium.com/virtuoso-blog> <https://medium.com/virtuoso-blog <https://medium.com/virtuoso-blog>> > Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers> <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers>> > > Personal Weblogs (Blogs): > Medium Blog:https://medium.com/@kidehen <https://medium.com/@kidehen> <https://medium.com/@kidehen <https://medium.com/@kidehen>> > Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ <http://www.openlinksw.com/blog/~kidehen/> <http://www.openlinksw.com/blog/~kidehen/ <http://www.openlinksw.com/blog/~kidehen/>> > http://kidehen.blogspot.com <http://kidehen.blogspot.com> <http://kidehen.blogspot.com <http://kidehen.blogspot.com>> > > Profile Pages: > Pinterest:https://www.pinterest.com/kidehen/ <https://www.pinterest.com/kidehen/> <https://www.pinterest.com/kidehen/ <https://www.pinterest.com/kidehen/>> > Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen <https://www.quora.com/profile/Kingsley-Uyi-Idehen> <https://www.quora.com/profile/Kingsley-Uyi-Idehen <https://www.quora.com/profile/Kingsley-Uyi-Idehen>> > Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> <https://twitter.com/kidehen <https://twitter.com/kidehen>> > Google+:https://plus.google.com/+KingsleyIdehen/about <https://plus.google.com/+KingsleyIdehen/about> <https://plus.google.com/+KingsleyIdehen/about <https://plus.google.com/+KingsleyIdehen/about>> > LinkedIn:http://www.linkedin.com/in/kidehen <http://www.linkedin.com/in/kidehen> <http://www.linkedin.com/in/kidehen <http://www.linkedin.com/in/kidehen>> > > Web Identities (WebID): > Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i> <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i>> > :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this> <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this>> > > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> > <mailto:wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>> > To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org> > <mailto:wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>> > > > > -- > Samuel Klein @metasj w:user:sj +1 617 529 4266 > > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> > To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org> > -- *Jerven Tjalling Bolleman* Principal Software Developer *SIB | Swiss Institute of Bioinformatics* 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland t +41 22 379 58 85 Jerven.Bolleman@sib.swiss - www.sib.swiss _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org mailto:wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
I like the idea of comparing live instances; could we pose a test-instance challenge, with some benchmarks, and invite different communities to take it up, hosting their own demos of what a well-tuned instance of WD could look like? (Could also be hosted by us / spun up by advocates for a tool in our community; could also spur some kaggle interest)
The size of the community actively interested in the health of Wikidata seems complementary information; alongside overall community size/health (which appears on the existing metrics list). //S
On Fri, Aug 27, 2021 at 10:19 AM Kingsley Idehen via Wikidata < wikidata@lists.wikimedia.org> wrote:
On 8/25/21 3:17 PM, Mike Pham wrote:
Thanks for all suggestions, and general enthusiasm in helping scale WDQS! A number of you have suggested various graph backends to consider moving to from Blazegraph, and I wanted to take a minute to respond more generically.
There are several criteria we need to consider for a Blazegraph alternative. Ideally we would have this list of criteria ready and available to share, so that the community can help vet alternatives with us. Unfortunately, we do not currently have a full list of these criteria. While the criteria we judged candidate graph backends on are available here https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing, it is highly unlikely these will be the exact set we will use in this next stage of scaling, and should only be used as a historical reference.
It is likely that there is no silver bullet solution that will satisfy every criteria. We will probably need to make compromises in some areas in order to optimize for others. This is a primary reason for conducting the WDQS user survey https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2021/08#Wikidata_Query_Service_(WDQS)_User_Survey_2021: we would like a better understanding of what the overall community priorities are, including from those who may be less vocal in existing discussions. These priorities will then be a major component in distilling the criteria (and weights) for a new graph backend.
The current plan is to share the (most up to date as we can) survey results at WikidataCon https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021 this year. I appreciate the discussion around potential candidates so far, and welcome the continued insight/help, but wanted to also be clear that we will not be making any decisions about a new graph backend, or have a complete list of criteria or testing process, at the moment — WikidataCon will be the next strategic check-in point.
As always, your patience is appreciated, and I’m looking forward to the continuing discussions and collaboration!
Best, Mike
—
*Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation https://wikimediafoundation.org/
Hi Mike,
Here's a suggestion regarding this important matter, circa 2021:
At the very least, a candidate platform should be able to deliver on a live instance of the Wikidata dataset accessible for interaction via SPARQL Query Services Endpoint.
Based on the interesting list of suggestions presented in this mailing list (and in the Google Spreadsheet https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0&range=M1 it's spawned), the larger goal of a vibrant LOD Cloud Knowledge Graph would benefit exponentially if each platform actually offered a live instance.
Irrespective of the final decision made, we are always going to offer a live Wikidata instance, just as we do a LOD Cloud Cache etc..
Also note, the WDQS and SPARQL loose-coupling suggested by Jerven is ultra-important, making that cool Query Services App independent of SPARQL Query Service backend will improve utility and general resilience, immensely.
*Links*
[1] https://wikidata.demo.openlinksw.com/sparql -- Wikidata instance we've been hosting for quite some time
[2] http://lod.openlinksw.com/sparql -- 40 Billion+ Triples instance (used to be the largest live SPARQL Query Service instance until Uniprot dethroned it!).
[3] https://medium.com/virtuoso-blog/on-the-mutually-beneficial-nature-of-dbpedi... -- On the Mutually Beneficial Nature of DBpedia and Wikidata
Kingsley
On 25August, 2021 at 09:41:28, Samuel Klein (meta.sj@gmail.com) wrote:
Aha, hello jerven :) I should have remembered your earlier comment, delighted you are here.
Thank you again for sharing your promising experience + benchmarks + suggestions -- and for highlighting both similarities and differences.
SJ
On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman jerven.bolleman@sib.swiss jerven.bolleman@sib.swiss wrote:
Hi Samuel, All,
I am the software engineer responsible for sparql.uniprot.org. I already offered to help in https://phabricator.wikimedia.org/T206561. So no need to ask Andra or Egon ;)
While we are good users of virtuoso, and strongly suggest it is evaluated. As it is in general a good product that does scale.[1]
One of the things we did differently than WDQS is to introduce a controlled layer between the "public" and the "database". To allow things like query rewriting/redirection upon data model changes, as well as rewriting some schema rediscovery queries to a known faster query. We also parse the queries with RDF4J before handing them to virtuoso. This makes sure that the queries that we accept are only valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects (i.e. retain the flexiblity to move to a different endpoint). We are in the process of updating this code and contributing it to RDF4J, with the first contribution in the develop/4.0.0 branch
I think a number of current customizations in WDQS can be moved to a front RDF4J layer. Then the RDF4J sail/repository layer can be used to preserve flexibility. So that WDQS can more easily switch between backend databases in the future.
One large difference between UniProt and WDQS is that WikiData is continually updated while UniProt is batch released a few times a year. WDQS is somewhat easier in some areas and more difficult in others because of that.
Regards, Jerven
[1] No Database is perfect, but it does scale a lot better than Blazegraph did. Which we also evaluated in the past. There is still a lot of potential in Virtuoso to scale even better in the future.
On 23/08/2021 21:36, Samuel Klein wrote:
Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good parallel to keep in mind.
For Egon, Andra, others who work with them: Is there someone you'd recommend chatting with at uniprot? "scaling alongside uniprot" or at least engaging them on how to solve shared + comparable issues (they also offer authentication-free SPARQL querying) sounds like a compelling option.
S.
On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org>
wrote:
On 8/18/21 5:07 PM, Mike Pham wrote:
Wikidata community members, Thank you for all of your work helping Wikidata grow and improve over the years. In the spirit of better communication, we would like to take this opportunity to share some of the current challenges Wikidata Query Service (WDQS) is facing, and some strategies we have for dealing with them. WDQS currently risks failing to provide acceptable service quality due to the following reasons: 1. Blazegraph scaling 1. Graph size. WDQS uses Blazegraph as our graph backend. While Blazegraph can theoretically support 50 billion edges <https://blazegraph.com/>, in reality Wikidata is the largest graph we know of running on Blazegraph (~13 billion triples <
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7...
),
and there is a risk that we will reach a size <
https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29
limit
of what it can realistically support <https://phabricator.wikimedia.org/T213210>. Once Blazegraph is maxed out, WDQS can no longer be updated. This will also break Wikidata tools that rely on WDQS. 2. Software support. Blazegraph is end of life software, which is no longer actively maintained, making it an unsustainable backend to continue moving forward with long term. Blazegraph maxing out in size poses the greatest risk for catastrophic failure, as it would effectively prevent WDQS from being updated further, and inevitably fall out of date. Our long term strategy to address this is to move to a new graph backend that best meets our WDQS needs and is actively maintained, and begin the migration off of Blazegraph as soon as a viable alternative is identified <https://phabricator.wikimedia.org/T206560>.
Hi Mike, Do bear in mind that pre and post selection of Blazegraph for Wikidata, we've always offered an RDF-based DBMS that can handle current and future requirements for Wikidata, just as we do DBpedia. At the time of our first rendezvous, handling 50 billion triples would have typically required our Cluster Edition which is a Commercial Only offering -- basically, that was the deal breaker back then. Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory. I hope this helps. Related: [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql>-- Our Live Wikidata SPARQL Query Endpoint [2]
https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ...
<
https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ...
-- Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> -- this query doesn't complete with the current Blazegraph-based Wikidata endpoint [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> -- same query completing when applied to the Virtuoso-based endpoint [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> -- about loading Wikidata's datasets into a Virtuoso instance [6]
https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&...
<
https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kideh...
-- various demos shared via Twitter over the years regarding
Wikidata
-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page:http://www.openlinksw.com <http://www.openlinksw.com> Community Support:https://community.openlinksw.com <
https://community.openlinksw.com%3E
Weblogs (Blogs): Company Blog:https://medium.com/openlink-software-blog <
https://medium.com/openlink-software-blog%3E
Virtuoso Blog:https://medium.com/virtuoso-blog <
https://medium.com/virtuoso-blog%3E
Data Access Drivers Blog:
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers < https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers%3E
Personal Weblogs (Blogs): Medium Blog:https://medium.com/@kidehen <
https://medium.com/@kidehen%3E
Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ <
http://www.openlinksw.com/blog/~kidehen/%3E
http://kidehen.blogspot.com <
http://kidehen.blogspot.com%3E
Profile Pages: Pinterest:https://www.pinterest.com/kidehen/ <
https://www.pinterest.com/kidehen/%3E
Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen <
https://www.quora.com/profile/Kingsley-Uyi-Idehen%3E
Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> Google+:https://plus.google.com/+KingsleyIdehen/about <
https://plus.google.com/+KingsleyIdehen/about%3E
LinkedIn:http://www.linkedin.com/in/kidehen <
http://www.linkedin.com/in/kidehen%3E
Web Identities (WebID): Personal:
http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i < http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i%3E
:
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this < http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Samuel Klein @metasj w:user:sj +1 617 529
4266
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
--
*Jerven Tjalling Bolleman*
Principal Software Developer *SIB | Swiss Institute of Bioinformatics* 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland t +41 22 379 58 85 Jerven.Bolleman@sib.swiss - www.sib.swiss _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
-- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
-- Regards,
Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com Community Support: https://community.openlinksw.com Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog Virtuoso Blog: https://medium.com/virtuoso-blog Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers
Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ http://kidehen.blogspot.com
Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen Twitter: https://twitter.com/kidehen Google+: https://plus.google.com/+KingsleyIdehen/about LinkedIn: http://www.linkedin.com/in/kidehen
Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this
Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On 9/1/21 8:38 AM, Samuel Klein wrote:
I like the idea of comparing live instances; could we pose a test-instance challenge, with some benchmarks, and invite different communities to take it up, hosting their own demos of what a well-tuned instance of WD could look like? (Could also be hosted by us / spun up by advocates for a tool in our community; could also spur some kaggle interest)
The size of the community actively interested in the health of Wikidata seems complementary information; alongside overall community size/health (which appears on the existing metrics list). //S
Yes, but for best effect, in line with ultimate goal, it should progress in stages:
[1] A basic Wikidata instance and SPARQL Query Service endpoint -- that allows provides users and user-agents with 24/7 query capability
[2] Specific Query Related Challenges
There should be an open invite as part of an effort to move Wikidata forward in light of its current scaling related challenges.
If we reach 3 stage-1 participants that would be awesome!
Related Links:
[1] https://wikidata.demo.openlinksw.com/ -- our live instance and its free text query interface that's a segue into Faceted Search & Browsing
[2] https://wikidata.demo.openlinksw.com/sparql -- SPARQL Query Services Endpoint
[3] https://community.openlinksw.com/t/loading-wikidata-into-virtuoso-open-sourc... -- Loading Wikidata into a Virtuoso Open Source Edition instance
[4] https://github.com/openlink/virtuoso-opensource -- Virtuoso Open Source Edition Github Repo
Kingsley
On Fri, Aug 27, 2021 at 10:19 AM Kingsley Idehen via Wikidata <wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org> wrote:
On 8/25/21 3:17 PM, Mike Pham wrote:
Thanks for all suggestions, and general enthusiasm in helping scale WDQS! A number of you have suggested various graph backends to consider moving to from Blazegraph, and I wanted to take a minute to respond more generically. There are several criteria we need to consider for a Blazegraph alternative. Ideally we would have this list of criteria ready and available to share, so that the community can help vet alternatives with us. Unfortunately, we do not currently have a full list of these criteria. While the criteria we judged candidate graph backends on are available here <https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing>, it is highly unlikely these will be the exact set we will use in this next stage of scaling, and should only be used as a historical reference. It is likely that there is no silver bullet solution that will satisfy every criteria. We will probably need to make compromises in some areas in order to optimize for others. This is a primary reason for conducting the WDQS user survey <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2021/08#Wikidata_Query_Service_(WDQS)_User_Survey_2021>: we would like a better understanding of what the overall community priorities are, including from those who may be less vocal in existing discussions. These priorities will then be a major component in distilling the criteria (and weights) for a new graph backend. The current plan is to share the (most up to date as we can) survey results at WikidataCon <https://www.wikidata.org/wiki/Wikidata:WikidataCon_2021> this year. I appreciate the discussion around potential candidates so far, and welcome the continued insight/help, but wanted to also be clear that we will not be making any decisions about a new graph backend, or have a complete list of criteria or testing process, at the moment — WikidataCon will be the next strategic check-in point. As always, your patience is appreciated, and I’m looking forward to the continuing discussions and collaboration! Best, Mike — *Mike Pham* (he/him) Sr Product Manager, Search Wikimedia Foundation <https://wikimediafoundation.org/>
Hi Mike, Here's a suggestion regarding this important matter, circa 2021: At the very least, a candidate platform should be able to deliver on a live instance of the Wikidata dataset accessible for interaction via SPARQL Query Services Endpoint. Based on the interesting list of suggestions presented in this mailing list (and in the Google Spreadsheet <https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0&range=M1> it's spawned), the larger goal of a vibrant LOD Cloud Knowledge Graph would benefit exponentially if each platform actually offered a live instance. Irrespective of the final decision made, we are always going to offer a live Wikidata instance, just as we do a LOD Cloud Cache etc.. Also note, the WDQS and SPARQL loose-coupling suggested by Jerven is ultra-important, making that cool Query Services App independent of SPARQL Query Service backend will improve utility and general resilience, immensely. *Links* [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql> -- Wikidata instance we've been hosting for quite some time [2] http://lod.openlinksw.com/sparql <http://lod.openlinksw.com/sparql> -- 40 Billion+ Triples instance (used to be the largest live SPARQL Query Service instance until Uniprot dethroned it!). [3] https://medium.com/virtuoso-blog/on-the-mutually-beneficial-nature-of-dbpedia-and-wikidata-5fb2b9f22ada <https://medium.com/virtuoso-blog/on-the-mutually-beneficial-nature-of-dbpedia-and-wikidata-5fb2b9f22ada> -- On the Mutually Beneficial Nature of DBpedia and Wikidata Kingsley
On 25August, 2021 at 09:41:28, Samuel Klein (meta.sj@gmail.com <mailto:meta.sj@gmail.com>) wrote:
Aha, hello jerven :) I should have remembered your earlier comment, delighted you are here. Thank you again for sharing your promising experience + benchmarks + suggestions -- and for highlighting both similarities and differences. SJ On Tue, Aug 24, 2021 at 2:18 AM jerven Bolleman <jerven.bolleman@sib.swiss> <mailto:jerven.bolleman@sib.swiss> wrote: Hi Samuel, All, I am the software engineer responsible for sparql.uniprot.org <http://sparql.uniprot.org>. I already offered to help in https://phabricator.wikimedia.org/T206561 <https://phabricator.wikimedia.org/T206561>. So no need to ask Andra or Egon ;) While we are good users of virtuoso, and strongly suggest it is evaluated. As it is in general a good product that does scale.[1] One of the things we did differently than WDQS is to introduce a controlled layer between the "public" and the "database". To allow things like query rewriting/redirection upon data model changes, as well as rewriting some schema rediscovery queries to a known faster query. We also parse the queries with RDF4J before handing them to virtuoso. This makes sure that the queries that we accept are only valid SPARQL 1.1. Avoiding users getting used to almost SPARQL dialects (i.e. retain the flexiblity to move to a different endpoint). We are in the process of updating this code and contributing it to RDF4J, with the first contribution in the develop/4.0.0 branch I think a number of current customizations in WDQS can be moved to a front RDF4J layer. Then the RDF4J sail/repository layer can be used to preserve flexibility. So that WDQS can more easily switch between backend databases in the future. One large difference between UniProt and WDQS is that WikiData is continually updated while UniProt is batch released a few times a year. WDQS is somewhat easier in some areas and more difficult in others because of that. Regards, Jerven [1] No Database is perfect, but it does scale a lot better than Blazegraph did. Which we also evaluated in the past. There is still a lot of potential in Virtuoso to scale even better in the future. On 23/08/2021 21:36, Samuel Klein wrote: > Ah, that's lovely. Thanks for the update, Kingsley! Uniprot is a good > parallel to keep in mind. > > For Egon, Andra, others who work with them: Is there someone you'd > recommend chatting with at uniprot? > "scaling alongside uniprot" or at least engaging them on how to solve > shared + comparable issues (they also offer authentication-free SPARQL > querying) sounds like a compelling option. > > S. > > On Thu, Aug 19, 2021 at 4:32 PM Kingsley Idehen via Wikidata > <wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> <mailto:wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>>> wrote: > > On 8/18/21 5:07 PM, Mike Pham wrote: >> >> Wikidata community members, >> >> >> Thank you for all of your work helping Wikidata grow and improve >> over the years. In the spirit of better communication, we would >> like to take this opportunity to share some of the current >> challenges Wikidata Query Service (WDQS) is facing, and some >> strategies we have for dealing with them. >> >> >> WDQS currently risks failing to provide acceptable service quality >> due to the following reasons: >> >> 1. >> >> Blazegraph scaling >> >> 1. >> >> Graph size. WDQS uses Blazegraph as our graph backend. >> While Blazegraph can theoretically support 50 billion >> edges <https://blazegraph.com/ <https://blazegraph.com/>>, in reality Wikidata is >> the largest graph we know of running on Blazegraph (~13 >> billion triples >> <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m <https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m>>), >> and there is a risk that we will reach a size >> <https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29 <https://www.w3.org/wiki/LargeTripleStores#Bigdata.28R.29_.2812.7B.29>>limit >> of what it can realistically support >> <https://phabricator.wikimedia.org/T213210 <https://phabricator.wikimedia.org/T213210>>. Once >> Blazegraph is maxed out, WDQS can no longer be updated. >> This will also break Wikidata tools that rely on WDQS. >> >> 2. >> >> Software support. Blazegraph is end of life software, >> which is no longer actively maintained, making it an >> unsustainable backend to continue moving forward with long >> term. >> >> >> Blazegraph maxing out in size poses the greatest risk for >> catastrophic failure, as it would effectively prevent WDQS from >> being updated further, and inevitably fall out of date. Our long >> term strategy to address this is to move to a new graph backend >> that best meets our WDQS needs and is actively maintained, and >> begin the migration off of Blazegraph as soon as a viable >> alternative is identified >> <https://phabricator.wikimedia.org/T206560 <https://phabricator.wikimedia.org/T206560>>. >> > > Hi Mike, > > Do bear in mind that pre and post selection of Blazegraph for > Wikidata, we've always offered an RDF-based DBMS that can handle > current and future requirements for Wikidata, just as we do DBpedia. > > At the time of our first rendezvous, handling 50 billion triples > would have typically required our Cluster Edition which is a > Commercial Only offering -- basically, that was the deal breaker > back then. > > Anyway, in recent times, our Open Source Edition has evolved to > handle some 80 Billion+ triples (exemplified by the live Uniprot > instance) where performance and scale is primary a function of > available memory. > > I hope this helps. > > Related: > > [1] https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql> > <https://wikidata.demo.openlinksw.com/sparql <https://wikidata.demo.openlinksw.com/sparql>>-- Our Live Wikidata > SPARQL Query Endpoint > [2] > https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0> > <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0 <https://docs.google.com/spreadsheets/d/15AXnxMgKyCvLPil_QeGC0DiXOP-Hu8Ln97fZ683ZQF0/edit#gid=0>> > -- Google Spreadsheet about various Virtuoso Configurations > associated with some well-known public endpoints > [3] https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE> <https://t.co/EjAAO73wwE <https://t.co/EjAAO73wwE>> -- this query > doesn't complete with the current Blazegraph-based Wikidata endpoint > [4] https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI> <https://t.co/GTATPPJNBI <https://t.co/GTATPPJNBI>> -- same query > completing when applied to the Virtuoso-based endpoint > [5] https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69> <https://t.co/X7mLmcYC69 <https://t.co/X7mLmcYC69>> -- about > loading Wikidata's datasets into a Virtuoso instance > [6] > https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live <https://twitter.com/search?q=%23Wikidata%20%23VirtuosoRDBMS%20%40kidehen&src=typed_query&f=live> > <https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live <https://twitter.com/search?q=%2523Wikidata%20%2523VirtuosoRDBMS%20%2540kidehen&src=typed_query&f=live>> > -- various demos shared via Twitter over the years regarding Wikidata > > -- > Regards, > > Kingsley Idehen > Founder & CEO > OpenLink Software > Home Page:http://www.openlinksw.com <http://www.openlinksw.com> <http://www.openlinksw.com <http://www.openlinksw.com>> > Community Support:https://community.openlinksw.com <https://community.openlinksw.com> <https://community.openlinksw.com <https://community.openlinksw.com>> > Weblogs (Blogs): > Company Blog:https://medium.com/openlink-software-blog <https://medium.com/openlink-software-blog> <https://medium.com/openlink-software-blog <https://medium.com/openlink-software-blog>> > Virtuoso Blog:https://medium.com/virtuoso-blog <https://medium.com/virtuoso-blog> <https://medium.com/virtuoso-blog <https://medium.com/virtuoso-blog>> > Data Access Drivers Blog:https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers> <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers>> > > Personal Weblogs (Blogs): > Medium Blog:https://medium.com/@kidehen <https://medium.com/@kidehen> <https://medium.com/@kidehen <https://medium.com/@kidehen>> > Legacy Blogs:http://www.openlinksw.com/blog/~kidehen/ <http://www.openlinksw.com/blog/~kidehen/> <http://www.openlinksw.com/blog/~kidehen/ <http://www.openlinksw.com/blog/~kidehen/>> > http://kidehen.blogspot.com <http://kidehen.blogspot.com> <http://kidehen.blogspot.com <http://kidehen.blogspot.com>> > > Profile Pages: > Pinterest:https://www.pinterest.com/kidehen/ <https://www.pinterest.com/kidehen/> <https://www.pinterest.com/kidehen/ <https://www.pinterest.com/kidehen/>> > Quora:https://www.quora.com/profile/Kingsley-Uyi-Idehen <https://www.quora.com/profile/Kingsley-Uyi-Idehen> <https://www.quora.com/profile/Kingsley-Uyi-Idehen <https://www.quora.com/profile/Kingsley-Uyi-Idehen>> > Twitter:https://twitter.com/kidehen <https://twitter.com/kidehen> <https://twitter.com/kidehen <https://twitter.com/kidehen>> > Google+:https://plus.google.com/+KingsleyIdehen/about <https://plus.google.com/+KingsleyIdehen/about> <https://plus.google.com/+KingsleyIdehen/about <https://plus.google.com/+KingsleyIdehen/about>> > LinkedIn:http://www.linkedin.com/in/kidehen <http://www.linkedin.com/in/kidehen> <http://www.linkedin.com/in/kidehen <http://www.linkedin.com/in/kidehen>> > > Web Identities (WebID): > Personal:http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i> <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i>> > :http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this> <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this>> > > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> > <mailto:wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>> > To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org> > <mailto:wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>> > > > > -- > Samuel Klein @metasj w:user:sj +1 617 529 4266 > > _______________________________________________ > Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> > To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org> > -- *Jerven Tjalling Bolleman* Principal Software Developer *SIB | Swiss Institute of Bioinformatics* 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland t +41 22 379 58 85 Jerven.Bolleman@sib.swiss <mailto:Jerven.Bolleman@sib.swiss> - www.sib.swiss <http://www.sib.swiss> _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org> -- Samuel Klein @metasj w:user:sj +1 617 529 4266 _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
_______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Home Page: http://www.openlinksw.com <http://www.openlinksw.com> Community Support: https://community.openlinksw.com <https://community.openlinksw.com> Weblogs (Blogs): Company Blog: https://medium.com/openlink-software-blog <https://medium.com/openlink-software-blog> Virtuoso Blog: https://medium.com/virtuoso-blog <https://medium.com/virtuoso-blog> Data Access Drivers Blog: https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers <https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers> Personal Weblogs (Blogs): Medium Blog: https://medium.com/@kidehen <https://medium.com/@kidehen> Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/ <http://www.openlinksw.com/blog/~kidehen/> http://kidehen.blogspot.com <http://kidehen.blogspot.com> Profile Pages: Pinterest: https://www.pinterest.com/kidehen/ <https://www.pinterest.com/kidehen/> Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen <https://www.quora.com/profile/Kingsley-Uyi-Idehen> Twitter: https://twitter.com/kidehen <https://twitter.com/kidehen> Google+: https://plus.google.com/+KingsleyIdehen/about <https://plus.google.com/+KingsleyIdehen/about> LinkedIn: http://www.linkedin.com/in/kidehen <http://www.linkedin.com/in/kidehen> Web Identities (WebID): Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i <http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i> : http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this <http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this> _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> To unsubscribe send an email to wikidata-leave@lists.wikimedia.org <mailto:wikidata-leave@lists.wikimedia.org>
-- Samuel Klein @metasj w:user:sj +1 617 529 4266