Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with
10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
thank you Guillaume, when do you expect a public update on the security incident [1]? Is any of our personal and private data (email, password etc) affected?
best, Marco
[1] https://phabricator.wikimedia.org/T241410
On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with
10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann marco.neumann@gmail.com wrote:
thank you Guillaume, when do you expect a public update on the security incident [1]? Is any of our personal and private data (email, password etc) affected?
It should be made public in the next few days. I'm not going to go into any more details until this is made public, but overall, don't worry too much.
best, Marco
[1] https://phabricator.wikimedia.org/T241410
On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with
10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Fri, Feb 7, 2020 at 5:18 PM Guillaume Lederrey glederrey@wikimedia.org wrote:
On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann marco.neumann@gmail.com wrote:
thank you Guillaume, when do you expect a public update on the security incident [1]? Is any of our personal and private data (email, password etc) affected?
It should be made public in the next few days. I'm not going to go into any more details until this is made public, but overall, don't worry too much.
Corrections and apologies on what I said above. We are not actually ready to make this ticket public. The underlying issue is under control and does not require any user action to mitigate. Given the security aspect, I'm not going to do any further communication on this.
Sorry to have been misleading on this.
Enjoy your day!
Guillaume
best,
Marco
[1] https://phabricator.wikimedia.org/T241410
On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey < glederrey@wikimedia.org> wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with
10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET
I accept your apology Guillaume, no worries.
Regards, Marco
On Mon, Feb 10, 2020 at 2:37 PM Guillaume Lederrey glederrey@wikimedia.org wrote:
On Fri, Feb 7, 2020 at 5:18 PM Guillaume Lederrey glederrey@wikimedia.org wrote:
On Fri, Feb 7, 2020 at 2:54 PM Marco Neumann marco.neumann@gmail.com wrote:
thank you Guillaume, when do you expect a public update on the security incident [1]? Is any of our personal and private data (email, password etc) affected?
It should be made public in the next few days. I'm not going to go into any more details until this is made public, but overall, don't worry too much.
Corrections and apologies on what I said above. We are not actually ready to make this ticket public. The underlying issue is under control and does not require any user action to mitigate. Given the security aspect, I'm not going to do any further communication on this.
Sorry to have been misleading on this.
Enjoy your day!
Guillaume
best,
Marco
[1] https://phabricator.wikimedia.org/T241410
On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey < glederrey@wikimedia.org> wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with
10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Better update granularity is probably good and may be a good priority.
It is (still) unclear for me as a tool writer whether I can do anything. For instance it is not clear to me whether the parallel SPARQL queries that comes when a user visit a Scholia page is important for the load on WDQS (not likely) or it is miniscule (likely).
As far as I understand on http://ceur-ws.org/Vol-2073/article-03.pdf much of the query load comes via Magnus. I presume another big chunk is from the genewiki people.
If robotic queries are sources of problems then tool writers/users can do something. But fixing issues would require the WMF to tell if it really is a problem and what the problems are.
best regards Finn
On 07/02/2020 14:32, Guillaume Lederrey wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that. We will probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Fri, Feb 7, 2020 at 3:12 PM fn@imm.dtu.dk wrote:
Better update granularity is probably good and may be a good priority.
It is (still) unclear for me as a tool writer whether I can do anything. For instance it is not clear to me whether the parallel SPARQL queries that comes when a user visit a Scholia page is important for the load on WDQS (not likely) or it is miniscule (likely).
Sadly, I don't have a good answer to that at the moment. The work we've done to better log queries and their context should help us to get some of that understanding.
In the meantime, query run time is a good proxy for resource cost. If your queries have an aggregate run time of 100ms per minute, don't worry about it. If your queries have an aggregate runtime of 30 seconds per minute, there is probably a need to do something. Or if you have individual queries running regularly for more than 10 seconds.
As far as I understand on http://ceur-ws.org/Vol-2073/article-03.pdf much of the query load comes via Magnus. I presume another big chunk is from the genewiki people.
If robotic queries are sources of problems then tool writers/users can do something. But fixing issues would require the WMF to tell if it really is a problem and what the problems are.
Yep, we're working on that! But our highest priority at the moment is rewriting the updater to be more efficient. Once this is done, we should have some free cycles for a better analysis.
best regards Finn
On 07/02/2020 14:32, Guillaume Lederrey wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that. We will probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 07.02.20 14:32, Guillaume Lederrey wrote:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that. We will probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
I have not looked in detail at query runtimes nor how blazegraph indexing works internally, however I noticed that in many cases queries that involve SPARQL property paths (and especially joins of those) take a long time to run. At the same time, I recently discovered that if we only store which entity is connected to which other entity (without storing the actual statement details, like property, qualifiers or ranks), those only take up about 2GB compressed with Zstandard (I represented each connection as <32 bit int source entity> <32 bit int destination entity>). Of course that discards a lot of important information, but it made me wonder if there is perhaps something that could be done to more efficiently evaluate queries, given the relatively strict schema the RDF representation of Wikidata adheres to? (Since it is generated from a more structured form, Statements). As an example, blazegraph doesn't know the relationship between wdt:Pxxx and p:Pxxx, or even things like p:Pxxx/ps:Pxxx.
Another, somewhat related idea: perhaps it's possible to keep the SPARQL interface for the frontend, but use a more efficient, split representation of the graph in the backend? Not sure how different that would be from the indexing that blazegraph does already, though.
Regards,
Benno
PS: appologies to Guillaume if you receive this mail twice, i clicked the wrong button when replying
I don't know if this is helpful, as I'm not very familiar with Wikidata's infrastructure, but I think that an idea that was discussed in the Wikimedia Strategy 2030 process is charging real money to organizations that consume large amounts of data from the Wikimedia API. By extension, an idea to consider is charging real money to consumers that want use Wikidata services in resource intensive ways. That would have several potential benefits. Charging money for resource intensive requests could make consumers be more value conscious when deciding which queries to run, it would probably reduce the workload on WMF's end, the reduced workload on WMF's end could lead to faster performance, and the money could be used for the maintenance and/or upgrade of Wikidata services. I think that offering free services to consumers who are not making resource intensive requests is good, but I am also fine with charging real money for resource intensive requests by consumers. For anyone who wants to make a resource intensive request and is unwilling or unable to pay accordingly, an option is to put them into a "slow lane" where there requests will be fulfilled but at a slower pace than paid resource intensive requests will be fulfilled.
Hoi, In my opinion, Wikidata is a flagship project of the Wikimedia Foundation. With the current underperformance vis a vis demand, it is not only a technical question what to do, it is also an organisational issue.
As I have written in several blogposts, the current underperformance affects the projects using Wikidata as its infra structure. What is the official response to the underachievement of a key and strategic resource. thanks, GerardM
http://ultimategerardm.blogspot.com/2020/02/dear-krmaher-wikipedia-is-not-fl...
On Fri, 7 Feb 2020 at 19:01, Pine W wiki.pine@gmail.com wrote:
I don't know if this is helpful, as I'm not very familiar with Wikidata's infrastructure, but I think that an idea that was discussed in the Wikimedia Strategy 2030 process is charging real money to organizations that consume large amounts of data from the Wikimedia API. By extension, an idea to consider is charging real money to consumers that want use Wikidata services in resource intensive ways. That would have several potential benefits. Charging money for resource intensive requests could make consumers be more value conscious when deciding which queries to run, it would probably reduce the workload on WMF's end, the reduced workload on WMF's end could lead to faster performance, and the money could be used for the maintenance and/or upgrade of Wikidata services. I think that offering free services to consumers who are not making resource intensive requests is good, but I am also fine with charging real money for resource intensive requests by consumers. For anyone who wants to make a resource intensive request and is unwilling or unable to pay accordingly, an option is to put them into a "slow lane" where there requests will be fulfilled but at a slower pace than paid resource intensive requests will be fulfilled.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Indeed, having WikiData Query Service (WDQS) responsibility under a separate sub-organization as known as the 'Search Platform' is mischief.
Le lun. 10 févr. 2020 à 15:55, Gerard Meijssen gerard.meijssen@gmail.com a écrit :
Hoi, In my opinion, Wikidata is a flagship project of the Wikimedia Foundation. With the current underperformance vis a vis demand, it is not only a technical question what to do, it is also an organisational issue.
As I have written in several blogposts, the current underperformance affects the projects using Wikidata as its infra structure. What is the official response to the underachievement of a key and strategic resource. thanks, GerardM
http://ultimategerardm.blogspot.com/2020/02/dear-krmaher-wikipedia-is-not-fl...
On Fri, 7 Feb 2020 at 19:01, Pine W wiki.pine@gmail.com wrote:
I don't know if this is helpful, as I'm not very familiar with Wikidata's infrastructure, but I think that an idea that was discussed in the Wikimedia Strategy 2030 process is charging real money to organizations that consume large amounts of data from the Wikimedia API. By extension, an idea to consider is charging real money to consumers that want use Wikidata services in resource intensive ways. That would have several potential benefits. Charging money for resource intensive requests could make consumers be more value conscious when deciding which queries to run, it would probably reduce the workload on WMF's end, the reduced workload on WMF's end could lead to faster performance, and the money could be used for the maintenance and/or upgrade of Wikidata services. I think that offering free services to consumers who are not making resource intensive requests is good, but I am also fine with charging real money for resource intensive requests by consumers. For anyone who wants to make a resource intensive request and is unwilling or unable to pay accordingly, an option is to put them into a "slow lane" where there requests will be fulfilled but at a slower pace than paid resource intensive requests will be fulfilled.
Le ven. 7 févr. 2020 à 19:01, Pine W wiki.pine@gmail.com a écrit :
I don't know if this is helpful, as I'm not very familiar with Wikidata's infrastructure, but I think that an idea that was discussed in the Wikimedia Strategy 2030 process is charging real money to organizations that consume large amounts of data from the Wikimedia API. By extension, an idea to consider is charging real money to consumers that want to use Wikidata services in resource-intensive ways.
I was told that charging on an API request basis is very difficult to get correctly in terms of software because measuring things is in general difficult. Take, for instance, the case of a failed query it should not be charged, should it? That is the reason why most cell phone plans are unlimited.
Hello Amirouche,
Regarding "most cell phone plans" being unlimited, here in the United States there are many phone plans which are not unlimited. I don't know what the proportion of unlimited to limited users are.
My understanding is that Twitter charges money for the use of their API under some circumstances. See https://techcrunch.com/2019/03/19/twitter-developer-review/. If Twitter can be successful with this then I would think that WMF can too, although in WMF's case the goals do not include profits for shareholders.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 10, 2020 at 3:37 PM Amirouche Boubekki amirouche.boubekki@gmail.com wrote:
Le ven. 7 févr. 2020 à 19:01, Pine W wiki.pine@gmail.com a écrit :
I don't know if this is helpful, as I'm not very familiar with Wikidata's infrastructure, but I think that an idea that was discussed in the Wikimedia Strategy 2030 process is charging real money to organizations that consume large amounts of data from the Wikimedia API. By extension, an idea to consider is charging real money to consumers that want to use Wikidata services in resource-intensive ways.
I was told that charging on an API request basis is very difficult to get correctly in terms of software because measuring things is in general difficult. Take, for instance, the case of a failed query it should not be charged, should it? That is the reason why most cell phone plans are unlimited.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi,
if you need some "Wikibase item diff" function, have a look at the Rust crate I am co-authoring: https://gitlab.com/tobias47n9e/wikibase_rs
It comes with diff code: https://gitlab.com/tobias47n9e/wikibase_rs/-/blob/master/src/entity_diff.rs
Should not be too hard to build eg a simple diff command line tool from that.
Cheers, Magnus
On Fri, Feb 7, 2020 at 1:33 PM Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
First, we had a security incident at the end of December, which forced us to move from our Kafka based update stream back to the RecentChanges poller. The details are still private, but you will be able to get the full story soon on phabricator [1]. The RecentChange poller is less efficient and this is leading to high update lag again (just when we thought we had things slightly under control). We tried to mitigate this by improving the parallelism in the updater [2], which helped a bit, but not as much as we need.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes. This is a big chunk of work, almost a complete rewrite of the updater, and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Some misc things:
We have done some work to get better metrics and better understanding of what's going on. From collecting more metrics during the update [4] to loading RDF dumps into Hadoop for further analysis [5] and better logging of SPARQL requests. We are not focusing on this analysis until we are in a more stable situation regarding update lag.
We have a new team member working on WDQS. He is still ramping up, but we should have a bit more capacity from now on.
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term. We have not found examples of public SPARQL endpoints with
10 B triples and there is probably a good reason for that. We will
probably need to split the graphs at some point. We don't know how yet (that's why we loaded the dumps into Hadoop, that might give us some more insight). We might expose a subgraph with only truthy statements. Or have language specific graphs, with only language specific labels. Or something completely different.
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas. Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T241410 [2] https://phabricator.wikimedia.org/T238045 [3] https://phabricator.wikimedia.org/T244341 [4] https://phabricator.wikimedia.org/T239908 [5] https://phabricator.wikimedia.org/T241125 [6] https://phabricator.wikimedia.org/T221774
-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes.
When it will be done, it will be still a short term solution
This is a big chunk of work, almost a complete rewrite of the updater,
and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Guaranteed ordering in a multi-party distributed setting has no easy solution, and apparently it is not provided by Kafka. For a non-technical person, you can read https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term.
:(
We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that.
Because Wikimedia is the only non-profit in the field?
We will probably need to split the graphs at some point.
:(
We don't know how yet
:(
(that's why we loaded the dumps into Hadoop, that might give us some more insight).
:(
We might expose a subgraph with only truthy statements. Or have language-specific graphs, with only language-specific labels.
:(
Or something completely different.
:)
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas.
With blacklists and whitelists, but this is huge anyway.
Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Will do.
On Tue, Feb 11, 2020, 12:11 AM Amirouche Boubekki, < amirouche.boubekki@gmail.com> wrote:
Again, we're not sure. Of course, reducing the load (both in terms of
edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
I think Wikidata editors are the ones that really need up-to-date query results as a form of feedback (e.g., to check if imports are okay or QuickStatements batches are done right).
It might make sense to explore having two WDQS instances: 1. Public instance that is only updated daily or even weekly 2. Community instance that is up-to-date and where you need to log-in with your Wikimedia user account and therefore we can control access to via whitelisting or blacklisting
why all the sad faces? the Semantic Web will be distributed after all and there is no need to stuff everything into one graph. it just requires us as a RDF community to spend more time developing ideas around efficient query distribution and focus on relationships and links in wikidata rather than building a monolithic database for humongous arbitrary joins and table scans as a free for all. The slogan "sum of all human knowledge" in one place should not be taken too literally.
it's I believe what wikidata as a project already does in any event, actually the sparql endpoint as an extension to the wikidata architecture around wikibase should be used more pro-actively to connect multiple rdf data providers for search. I would think that this is actually already a common use case for wikidata users who enrich their remote queries with wikidata data.
all that said it's quite an achivement to scale the wikidata sparql endpoint to where it is now. congratulations to the team and I look forward to seeing more of it in the future.
On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki < amirouche.boubekki@gmail.com> wrote:
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better in
terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Another attempt to get update lag under control is to apply back
pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better
understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes.
When it will be done, it will be still a short term solution
This is a big chunk of work, almost a complete rewrite of the updater,
and we need a new solution to stream changes with guaranteed ordering
(something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Guaranteed ordering in a multi-party distributed setting has no easy solution, and apparently it is not provided by Kafka. For a non-technical person, you can read https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to
work long term.
:(
We have not found examples of public SPARQL endpoints with > 10 B
triples and there is probably a good reason for that.
Because Wikimedia is the only non-profit in the field?
We will probably need to split the graphs at some point.
:(
We don't know how yet
:(
(that's why we loaded the dumps into Hadoop, that might give us some
more insight).
:(
We might expose a subgraph with only truthy statements. Or have
language-specific graphs, with only language-specific labels.
:(
Or something completely different.
:)
Keeping WDQS / Wikidata as open as they are at the moment might not be
possible in the long term. We need to think if / how we want to implement some form of authentication and quotas.
With blacklists and whitelists, but this is huge anyway.
Potentially increasing quotas for some use cases, but keeping them
strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of
edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
We suspect that some use cases are more expensive than others (a single
property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Will do.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Le lun. 10 févr. 2020 à 18:23, Marco Neumann marco.neumann@gmail.com a écrit :
why all the sad faces?
the Semantic Web will be distributed after all
The semantic Web is already distributed.
and there is no need to stuff everything into one graph.
Everything into one graph, or if you prefer in one place, is the gist of the idea of a library or encyclopedia.
it just requires us as an RDF community to spend more time developing ideas around efficient query distribution
Maybe. But does not preclude the aggregation or sum of knowledge to happen.
and focus on relationships and links in wikidata
Like I wrote above, a distributed knowledge base is already the state of the things. I am not sure how to understand that part of the sentence.
rather than building a monolithic database
That is the gist of my proposal. Without the ability to run wikidata at a small scale, WMF will fail at knowledge equity.
for humongous arbitrary joins and table scans
I proposed something along the lines of https://linkeddatafragments.org as known as "thin server, thick client" I had no feedback :(
as a free for all.
With that, I heartily agree. With the ability to downscale wikidata infrastructure, and make companies and institutions pay for the stream of changes to apply to their local instance, it will make things much easier.
The slogan "sum of all human knowledge" in one place should not be taken too literally.
I disagree.
it's I believe what wikidata as a project already does in any event, actually, the SPARQL endpoint as an extension to the wikidata architecture around wikibase should be used more pro-actively to connect multiple RDF data providers for search.
Read my proposal at https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
The title is misleading, I intended to change it to Future-proof WikiData. WDQS or querying is an integral part of wikidata and must not be merely an addon.
I would think that this is already a common use case for wikidata users who enrich their remote queries with wikidata data.
I do not understand. Yes, people enrich wikidata queries with their data. And?
All that said it's quite an achievement to scale the wikidata SPARQL endpoint to where it is now. Congratulations to the team and I look forward to seeing more of it in the future.
Yes, I agree with that. Congratulations! I am very proud to be part of the Wikimedia community.
The current WMF proposal that is called "sharding", see details at:
https://en.wikipedia.org/wiki/Shard_(database_architecture)
It is, not future proof. I have not done any analysis, but I bet that most of the 2TB of wikidata is English, so even if you shard by language, you will still end up with a gigantic graph. Also, most of the data is not specific to a natural language, so one can not possibly split the data by language.
If WMF comes up with another sharding strategy, how will edits that span multiple regions will happen?
How will it make entering the wikidata party easier?
I dare to write in the open: it seems to me like we are witnessing "Earth is flat vs. Earth is not flat" kind of event.
Thanks for the reply!
On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki amirouche.boubekki@gmail.com wrote:
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Another attempt to get update lag under control is to apply back pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes.
When it will be done, it will be still a short term solution
This is a big chunk of work, almost a complete rewrite of the updater,
and we need a new solution to stream changes with guaranteed ordering (something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Guaranteed ordering in a multi-party distributed setting has no easy solution, and apparently it is not provided by Kafka. For a non-technical person, you can read https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to work long term.
:(
We have not found examples of public SPARQL endpoints with > 10 B triples and there is probably a good reason for that.
Because Wikimedia is the only non-profit in the field?
We will probably need to split the graphs at some point.
:(
We don't know how yet
:(
(that's why we loaded the dumps into Hadoop, that might give us some more insight).
:(
We might expose a subgraph with only truthy statements. Or have language-specific graphs, with only language-specific labels.
:(
Or something completely different.
:)
Keeping WDQS / Wikidata as open as they are at the moment might not be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas.
With blacklists and whitelists, but this is huge anyway.
Potentially increasing quotas for some use cases, but keeping them strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
We suspect that some use cases are more expensive than others (a single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Will do.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
The earth is not flat :)
I appreciate all of your thoughts in this thread, Amirouche. A linked data fragments approach w/ a [thick client] seems to me to be one useful option to explore and benchmark.
Other possible thoughts: ~ have some core highly-used subgraphs within which queries are lightning fast. ~ give queriers the option to search on fast subgraphs; and the option to set a quick query timeout. ~ give queriers quick estimates of how much load a query will impose ~ set the default query timeout to be quite quick (while letting any user raise their default to some cap, just like we can set how many results we want to see on RC / history pages)
//S
On Mon, Feb 10, 2020 at 1:53 PM Amirouche Boubekki < amirouche.boubekki@gmail.com> wrote:
Le lun. 10 févr. 2020 à 18:23, Marco Neumann marco.neumann@gmail.com a écrit :
why all the sad faces?
the Semantic Web will be distributed after all
The semantic Web is already distributed.
and there is no need to stuff everything into one graph.
Everything into one graph, or if you prefer in one place, is the gist of the idea of a library or encyclopedia.
it just requires us as an RDF community to spend more time developing
ideas around efficient query distribution
Maybe. But does not preclude the aggregation or sum of knowledge to happen.
and focus on relationships and links in wikidata
Like I wrote above, a distributed knowledge base is already the state of the things. I am not sure how to understand that part of the sentence.
rather than building a monolithic database
That is the gist of my proposal. Without the ability to run wikidata at a small scale, WMF will fail at knowledge equity.
for humongous arbitrary joins and table scans
I proposed something along the lines of https://linkeddatafragments.org as known as "thin server, thick client" I had no feedback :(
as a free for all.
With that, I heartily agree. With the ability to downscale wikidata infrastructure, and make companies and institutions pay for the stream of changes to apply to their local instance, it will make things much easier.
The slogan "sum of all human knowledge" in one place should not be taken
too literally.
I disagree.
it's I believe what wikidata as a project already does in any event,
actually, the SPARQL endpoint as an extension to the wikidata architecture around wikibase should be used more pro-actively to connect multiple RDF data providers for search.
Read my proposal at https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
The title is misleading, I intended to change it to Future-proof WikiData. WDQS or querying is an integral part of wikidata and must not be merely an addon.
I would think that this is already a common use case for wikidata users
who enrich their remote queries with wikidata data.
I do not understand. Yes, people enrich wikidata queries with their data. And?
All that said it's quite an achievement to scale the wikidata SPARQL
endpoint to where it is now.
Congratulations to the team and I look forward to seeing more of it in
the future.
Yes, I agree with that. Congratulations! I am very proud to be part of the Wikimedia community.
The current WMF proposal that is called "sharding", see details at:
https://en.wikipedia.org/wiki/Shard_(database_architecture)
It is, not future proof. I have not done any analysis, but I bet that most of the 2TB of wikidata is English, so even if you shard by language, you will still end up with a gigantic graph. Also, most of the data is not specific to a natural language, so one can not possibly split the data by language.
If WMF comes up with another sharding strategy, how will edits that span multiple regions will happen?
How will it make entering the wikidata party easier?
I dare to write in the open: it seems to me like we are witnessing "Earth is flat vs. Earth is not flat" kind of event.
Thanks for the reply!
On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki <
amirouche.boubekki@gmail.com> wrote:
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better
in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Another attempt to get update lag under control is to apply back
pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better
understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes.
When it will be done, it will be still a short term solution
This is a big chunk of work, almost a complete rewrite of the updater,
and we need a new solution to stream changes with guaranteed ordering
(something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Guaranteed ordering in a multi-party distributed setting has no easy solution, and apparently it is not provided by Kafka. For a non-technical person, you can read https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going
to work long term.
:(
We have not found examples of public SPARQL endpoints with > 10 B
triples and there is probably a good reason for that.
Because Wikimedia is the only non-profit in the field?
We will probably need to split the graphs at some point.
:(
We don't know how yet
:(
(that's why we loaded the dumps into Hadoop, that might give us some
more insight).
:(
We might expose a subgraph with only truthy statements. Or have
language-specific graphs, with only language-specific labels.
:(
Or something completely different.
:)
Keeping WDQS / Wikidata as open as they are at the moment might not
be possible in the long term. We need to think if / how we want to implement some form of authentication and quotas.
With blacklists and whitelists, but this is huge anyway.
Potentially increasing quotas for some use cases, but keeping them
strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of
edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
We suspect that some use cases are more expensive than others (a
single property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Will do.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Amirouche ~ https://hyper.dev
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi,
I have limited knowledge of SPARQL and RDF so I might not have understood fully.
On Tue, 11 Feb 2020 at 07:05, Samuel Klein meta.sj@gmail.com wrote:
A linked data fragments approach w/ a [thick client] seems to me to be one useful option to explore and benchmark.
As far as I can see WDQS does offer a linked data fragments service. See the documentation at https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Linked_Dat... and the service at https://query.wikidata.org/bigdata/ldf
Perhaps this is of some help?
Cheers, Tom
Hoi, I find it interesting that some do not need to see the result of what they do. What it tells me is that they deal with collections, big amounts of data that like stamp collections are dumped in Wikidata. What is the point of such? I consider them prime examples of what can be set aside.
I do need to know what the effects are of what i do. I add single items, link them to other items like awards and papers and use tools like Scholia to consider the effects. I blog regularly and typically it is based on the results that I see of what I do. It is of profound importance to people who edit like me that there is no lag.
An other thing to consider is that given the bias in our projects, the worse thing we can do is make ghettos of everything non English. It also totally destroys my approach where I have listeria lists about Africa so that we can follow what is known about Africa in Wikidata.. [1]
Again, what I notice is that the underperformance, the stagnance of Wikidata is only considered as a technical issue. It has a huge effect on how Wikidata may be used it is detrimental to all Wikimedia projects and therefore it deserves a reaction of the board, the director of the Wikimedia Foundation. Thanks, GerardM
[1] https://en.wikipedia.org/wiki/User:GerardM/Africa
So
On Mon, 10 Feb 2020 at 17:11, Amirouche Boubekki < amirouche.boubekki@gmail.com> wrote:
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better in
terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Another attempt to get update lag under control is to apply back
pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag [6]. This is obviously less than ideal (at least as long as WDQS updates are lagging as often as they are), but does allow the service to recover from time to time. We probably need to iterate on this, provide better granularity, differentiate better between operations that have an impact on update lag and those which don't.
On the slightly better news side, we now have a much better
understanding of the update process and of its shortcomings. The current process does a full diff between each updated entity and what we have in blazegraph. Even if a single triple needs to change, we still read tons of data from Blazegraph. While this approach is simple and robust, it is obviously not efficient. We need to rewrite the updater to take a more event streaming / reactive approach, and only work on the actual changes.
When it will be done, it will be still a short term solution
This is a big chunk of work, almost a complete rewrite of the updater,
and we need a new solution to stream changes with guaranteed ordering
(something that our kafka queues don't offer). This is where we are focusing our energy at the moment, this looks like the best option to improve the situation in the medium term. This change will probably have some functional impacts [3].
Guaranteed ordering in a multi-party distributed setting has no easy solution, and apparently it is not provided by Kafka. For a non-technical person, you can read https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Some longer term thoughts:
Keeping all of Wikidata in a single graph is most probably not going to
work long term.
:(
We have not found examples of public SPARQL endpoints with > 10 B
triples and there is probably a good reason for that.
Because Wikimedia is the only non-profit in the field?
We will probably need to split the graphs at some point.
:(
We don't know how yet
:(
(that's why we loaded the dumps into Hadoop, that might give us some
more insight).
:(
We might expose a subgraph with only truthy statements. Or have
language-specific graphs, with only language-specific labels.
:(
Or something completely different.
:)
Keeping WDQS / Wikidata as open as they are at the moment might not be
possible in the long term. We need to think if / how we want to implement some form of authentication and quotas.
With blacklists and whitelists, but this is huge anyway.
Potentially increasing quotas for some use cases, but keeping them
strict for others. Again, we don't know how this will look like, but we're thinking about it.
What you can do to help:
Again, we're not sure. Of course, reducing the load (both in terms of
edits on Wikidata and of reads on WDQS) will help. But not using those services makes them useless.
What about making the lag part of the service. I mean, you could reload WDQS periodically, for instance daily, and drop the updater altogether. Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
We suspect that some use cases are more expensive than others (a single
property change to a large entity will require a comparatively insane amount of work to update it on the WDQS side). We'd like to have real data on the cost of various operations, but we only have guesses at this point.
If you've read this far, thanks a lot for your engagement!
Have fun!
Will do.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Mon, 10 Feb 2020 at 16:11, Amirouche Boubekki amirouche.boubekki@gmail.com wrote:
Who needs to see the updates live in WDQS as soon as edits are done in wikidata?
https://www.wikidata.org/w/index.php?title=Wikidata:Contact_the_development_...
I am sorry to bring more problems to the table, but the indexing of lexemes in the "ordinary" elasticsearch-based search is now also often slow. The Q-items are also indexed slowly, but there you can at least type in the Q-number in the edit field and it will lookup the item. For L-numbers, I have not found a way to type in and one would have to wait for some minutes before L-items are indexed.
An example use case is the entry of "fordømme" and "dømme" where one links to the other by P5238, see https://www.wikidata.org/wiki/Lexeme:L245454. As is apparent from the edit histories, I waited over 10 minutes for the indexing before I could link the two lexemes.
best regards Finn Årup Nielsen
Oh, wow, I just tried that out too, and indeed, it used to be possible to link to the L-number very quickly if I remember correctly, but now this is not the case anymore.
Weirdly enough, the SPARQL endpoint got updated with the new Lexeme very quickly. So I think these two things are not related.
On Tue, Feb 11, 2020 at 7:45 AM fn@imm.dtu.dk wrote:
I am sorry to bring more problems to the table, but the indexing of lexemes in the "ordinary" elasticsearch-based search is now also often slow. The Q-items are also indexed slowly, but there you can at least type in the Q-number in the edit field and it will lookup the item. For L-numbers, I have not found a way to type in and one would have to wait for some minutes before L-items are indexed.
An example use case is the entry of "fordømme" and "dømme" where one links to the other by P5238, see https://www.wikidata.org/wiki/Lexeme:L245454. As is apparent from the edit histories, I waited over 10 minutes for the indexing before I could link the two lexemes.
best regards Finn Årup Nielsen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Le lun. 10 févr. 2020 à 17:11, Amirouche Boubekki amirouche.boubekki@gmail.com a écrit :
Hello Guillaume,
Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey glederrey@wikimedia.org a écrit :
Hello all!
First of all, my apologies for the long silence. We need to do better in terms of communication. I'll try my best to send a monthly update from now on. Keep me honest, remind me if I fail.
It will be nice to have some feedback on my grant request at:
https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
Or one of the other threads on the very same mailing list.
Official feedback was written on the wiki https://meta.wikimedia.org/wiki/Grants_talk:Project/Future-proof_WDQS#Some_m...
Thanks,
Amirouche ~ https://hyper.dev ~ "There is no city that is truly one other than this city that we are involved in bringing forth." Averroes