Performance and update versus query

List overview All Threads
Download

newer

older

Wednesday: Technical Advice IRC...

Cadastro

Gerard Meijssen

26 Jun 2019 26 Jun '19

2:02 a.m.

Hoi, The performance of the query update is getting worse. Questions about this have been raised before. I do remember quality replies like it is not exponential so there is no problem. However, here we are and there is a problem.

The problem is that I run batch jobs, batch jobs that do not run [1]. I have the impression that they are put in some kind of suspended animation by a person. These jobs are submitted by the SourceMD tool by Magnus, Magnus is well known for being responsive to suggestions on how he can improve them. So do not use as an argument that there is something wrong with these job. At most it is acceptable for these run to put on some kind of hold for the duration of a crisis and then there has to be a release.

At the same time I notice that the reports indicating multiple items with the same ORCiD id include items that should have been picked up by earlier reports. I notice that query does not pick up existing items with an ORCid id and creates new ones. For me this is an indication that Query is not reliable.

There is talk on the Wiki that there is no point in having fixed descriptions in anything but English. What caused this discussion is the sheer amount of updates needed just for one language. At the London Wikimania this perceived need for fixed descriptions was discussed vis a vis automated descriptions and as I recall the only argument for having them at all was "standards" in relation to dumps. Yes, automated descriptions may be cached and included in a dump.

I have been asked to write for the ORCiD blog and thereby in effect plug the relevance of the Scholia presentation for scientists. When I do, the number of jobs like the ones I run will mushroom. It is why I have not put anything forward so far because we cannot cope as it is.

The issues I see is, * again to what extend can we grow our content, both for query and update for the short medium and long term * will batch jobs like mine be able to complete * can we ingest the attention when scholars discover how relevant Scholia is for them, the subject they care for. * do we care that motivation of volunteers relies on the availability of sufficient performance to do the tasks they care for.

Thanks, Gerard

[1] https://tools.wmflabs.org/sourcemd/?action=batches&user=GerardM

Attachments:

attachment.htm (text/html — 2.7 KB)

Show replies by date

Guillaume Lederrey

27 Jun 27 Jun

4:12 a.m.

Hello!

I'm not familiar with some of the issues you raised, but let me try a few guesses...

On Wed, Jun 26, 2019 at 8:02 AM Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, The performance of the query update is getting worse. Questions about this have been raised before. I do remember quality replies like it is not exponential so there is no problem. However, here we are and there is a problem.

The problem is that I run batch jobs, batch jobs that do not run [1]. I have the impression that they are put in some kind of suspended animation by a person. These jobs are submitted by the SourceMD tool by Magnus, Magnus is well known for being responsive to suggestions on how he can improve them. So do not use as an argument that there is something wrong with these job. At most it is acceptable for these run to put on some kind of hold for the duration of a crisis and then there has to be a release.

I'm not familiar with sourcemd, and the link you provided isn't very clear on what the actual error is. I just guessing, but maybe sourcemd has some assumptions about updates to WDQS being synchronous, or quasi-synchronous. Another guess is that it might be subject to throttling and not backing off appropriately, and maybe it ends up being banned for some time. If anyone knows what user agent is used by sourcemd, I can have a look into the WDQS logs to get more information.

...

At the same time I notice that the reports indicating multiple items with the same ORCiD id include items that should have been picked up by earlier reports. I notice that query does not pick up existing items with an ORCid id and creates new ones. For me this is an indication that Query is not reliable.

There is talk on the Wiki that there is no point in having fixed descriptions in anything but English. What caused this discussion is the sheer amount of updates needed just for one language. At the London Wikimania this perceived need for fixed descriptions was discussed vis a vis automated descriptions and as I recall the only argument for having them at all was "standards" in relation to dumps. Yes, automated descriptions may be cached and included in a dump.

I have been asked to write for the ORCiD blog and thereby in effect plug the relevance of the Scholia presentation for scientists. When I do, the number of jobs like the ones I run will mushroom. It is why I have not put anything forward so far because we cannot cope as it is.

The issues I see is,

again to what extend can we grow our content, both for query and update for the short medium and long term

will batch jobs like mine be able to complete

Honestly, I'm not sure what the issue is, so I can't assure you those batches will be able to complete. What we can do is work together to understand the issue and see what needs to be fixed.

...

can we ingest the attention when scholars discover how relevant Scholia is for them, the subject they care for.

do we care that motivation of volunteers relies on the availability of sufficient performance to do the tasks they care for.

It depends on who "we" is. I care, and I know that people on my team care. Which does not mean we will be able to magically fix everything, but we're trying.

In more general terms, scaling Wikidata and Wikidata Query Service will require challenging some of our assumptions. Workflows that assume WDQS to be updated synchronously will fail more and more. Throttling is becoming more and more important to the stability of the service and to a fair access to resources, so clients will need to be able to smooth their load and backoff appropriately.

Sorry to not have a direct solution to your current issues, but let's try to find one!

Have fun!

Guillaume

...

Thanks, Gerard

[1] https://tools.wmflabs.org/sourcemd/?action=batches&user=GerardM _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+2 / CEST

Gerard Meijssen

10:18 a.m.

Hoi, The good news, the issue with my jobs has been isolated. It is a bug in the software that occasionally manifests itself. Good because it has nothing to do with performance at this time. Magnus has a ticket at Crossref [1] so that will be fixed at some stage.

The reason why I need to be certain about the functionality is that when scientists find that their papers are well presented in Wikidata, they will submit jobs for their data to be imported from ORCID into Wikidata.. Given the number of scientists we already know about, this may have as a result of many more jobs updating what we know of science. I have been asked to write for the ORCID blog and that only makes sense when we can accomodate the traffic.\ Thanks, GerardM

[1] https://github.com/MattsSe/crossref-rs/issues/5

On Thu, 27 Jun 2019 at 10:39, Guillaume Lederrey glederrey@wikimedia.org wrote:

...

Hello!

I'm not familiar with some of the issues you raised, but let me try a few guesses...

On Wed, Jun 26, 2019 at 8:02 AM Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, The performance of the query update is getting worse. Questions about

this have been raised before. I do remember quality replies like it is not exponential so there is no problem. However, here we are and there is a problem.

...
The problem is that I run batch jobs, batch jobs that do not run [1]. I

have the impression that they are put in some kind of suspended animation by a person. These jobs are submitted by the SourceMD tool by Magnus, Magnus is well known for being responsive to suggestions on how he can improve them. So do not use as an argument that there is something wrong with these job. At most it is acceptable for these run to put on some kind of hold for the duration of a crisis and then there has to be a release.

I'm not familiar with sourcemd, and the link you provided isn't very clear on what the actual error is. I just guessing, but maybe sourcemd has some assumptions about updates to WDQS being synchronous, or quasi-synchronous. Another guess is that it might be subject to throttling and not backing off appropriately, and maybe it ends up being banned for some time. If anyone knows what user agent is used by sourcemd, I can have a look into the WDQS logs to get more information.

...
At the same time I notice that the reports indicating multiple items

with the same ORCiD id include items that should have been picked up by earlier reports. I notice that query does not pick up existing items with an ORCid id and creates new ones. For me this is an indication that Query is not reliable.

...
There is talk on the Wiki that there is no point in having fixed

descriptions in anything but English. What caused this discussion is the sheer amount of updates needed just for one language. At the London Wikimania this perceived need for fixed descriptions was discussed vis a vis automated descriptions and as I recall the only argument for having them at all was "standards" in relation to dumps. Yes, automated descriptions may be cached and included in a dump.

...
I have been asked to write for the ORCiD blog and thereby in effect plug

the relevance of the Scholia presentation for scientists. When I do, the number of jobs like the ones I run will mushroom. It is why I have not put anything forward so far because we cannot cope as it is.

...
The issues I see is,

again to what extend can we grow our content, both for query and

update for the short medium and long term

...

will batch jobs like mine be able to complete

Honestly, I'm not sure what the issue is, so I can't assure you those batches will be able to complete. What we can do is work together to understand the issue and see what needs to be fixed.

...

can we ingest the attention when scholars discover how relevant

Scholia is for them, the subject they care for.

...

do we care that motivation of volunteers relies on the availability of

sufficient performance to do the tasks they care for.

It depends on who "we" is. I care, and I know that people on my team care. Which does not mean we will be able to magically fix everything, but we're trying.

In more general terms, scaling Wikidata and Wikidata Query Service will require challenging some of our assumptions. Workflows that assume WDQS to be updated synchronously will fail more and more. Throttling is becoming more and more important to the stability of the service and to a fair access to resources, so clients will need to be able to smooth their load and backoff appropriately.

Sorry to not have a direct solution to your current issues, but let's try to find one!

Have fun!
Guillaume
...
Thanks, Gerard

[1] https://tools.wmflabs.org/sourcemd/?action=batches&user=GerardM _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+2 / CEST

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Guillaume Lederrey

28 Jun 28 Jun

5:40 a.m.

On Thu, Jun 27, 2019 at 4:37 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, The good news, the issue with my jobs has been isolated. It is a bug in the software that occasionally manifests itself. Good because it has nothing to do with performance at this time. Magnus has a ticket at Crossref [1] so that will be fixed at some stage.

The reason why I need to be certain about the functionality is that when scientists find that their papers are well presented in Wikidata, they will submit jobs for their data to be imported from ORCID into Wikidata.. Given the number of scientists we already know about, this may have as a result of many more jobs updating what we know of science. I have been asked to write for the ORCID blog and that only makes sense when we can accomodate the traffic.\

Predictions are hard, especially when about the future :)

It is difficult to tell you if Wikidata / WDQS is going to be able to handle this additional load, without knowing what that additional load will look like, both in term of complexity and in term of volume. That being said, for WDQS, the capacity issues we have seen so far are usually about peak traffic. A badly behaved bot starts sending way more traffic than we usually have (either read or write traffic) and we start lagging. Additional well behaved clients are probably not going to be an issue short term (but again, I'm just guessing).

Again, we are working on several performance improvements and we will hire an additional engineer to work on WDQS. For example, we're working on a custom code to process updates into Blazegraph, hoping that this will help us reduce the load related to edits and reduce the update lag. We won't know the actual impact until this is implemented and tested, but we're doing our best.

I understand that my answer is vague enough to be disappointing, but at least you have some additional context.

Have fun!

Guillaume

...

Thanks, GerardM

[1] https://github.com/MattsSe/crossref-rs/issues/5

On Thu, 27 Jun 2019 at 10:39, Guillaume Lederrey glederrey@wikimedia.org wrote:

...
Hello!

I'm not familiar with some of the issues you raised, but let me try a few guesses...

On Wed, Jun 26, 2019 at 8:02 AM Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, The performance of the query update is getting worse. Questions about this have been raised before. I do remember quality replies like it is not exponential so there is no problem. However, here we are and there is a problem.

The problem is that I run batch jobs, batch jobs that do not run [1]. I have the impression that they are put in some kind of suspended animation by a person. These jobs are submitted by the SourceMD tool by Magnus, Magnus is well known for being responsive to suggestions on how he can improve them. So do not use as an argument that there is something wrong with these job. At most it is acceptable for these run to put on some kind of hold for the duration of a crisis and then there has to be a release.

I'm not familiar with sourcemd, and the link you provided isn't very clear on what the actual error is. I just guessing, but maybe sourcemd has some assumptions about updates to WDQS being synchronous, or quasi-synchronous. Another guess is that it might be subject to throttling and not backing off appropriately, and maybe it ends up being banned for some time. If anyone knows what user agent is used by sourcemd, I can have a look into the WDQS logs to get more information.

...
At the same time I notice that the reports indicating multiple items with the same ORCiD id include items that should have been picked up by earlier reports. I notice that query does not pick up existing items with an ORCid id and creates new ones. For me this is an indication that Query is not reliable.

There is talk on the Wiki that there is no point in having fixed descriptions in anything but English. What caused this discussion is the sheer amount of updates needed just for one language. At the London Wikimania this perceived need for fixed descriptions was discussed vis a vis automated descriptions and as I recall the only argument for having them at all was "standards" in relation to dumps. Yes, automated descriptions may be cached and included in a dump.

I have been asked to write for the ORCiD blog and thereby in effect plug the relevance of the Scholia presentation for scientists. When I do, the number of jobs like the ones I run will mushroom. It is why I have not put anything forward so far because we cannot cope as it is.

The issues I see is,

again to what extend can we grow our content, both for query and update for the short medium and long term

will batch jobs like mine be able to complete

Honestly, I'm not sure what the issue is, so I can't assure you those batches will be able to complete. What we can do is work together to understand the issue and see what needs to be fixed.

...

can we ingest the attention when scholars discover how relevant Scholia is for them, the subject they care for.

do we care that motivation of volunteers relies on the availability of sufficient performance to do the tasks they care for.

It depends on who "we" is. I care, and I know that people on my team care. Which does not mean we will be able to magically fix everything, but we're trying.

In more general terms, scaling Wikidata and Wikidata Query Service will require challenging some of our assumptions. Workflows that assume WDQS to be updated synchronously will fail more and more. Throttling is becoming more and more important to the stability of the service and to a fair access to resources, so clients will need to be able to smooth their load and backoff appropriately.

Sorry to not have a direct solution to your current issues, but let's try to find one!

Have fun!
Guillaume
...
Thanks, Gerard

[1] https://tools.wmflabs.org/sourcemd/?action=batches&user=GerardM _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+2 / CEST

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+2 / CEST

Patricia Collins

29 Jun 29 Jun

8:59 p.m.

I dont have a clue what u are talking about? I wish I did! Are u sure u have the right person? How did u get my email address?

On Tue, Jun 25, 2019, 11:02 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, The performance of the query update is getting worse. Questions about this have been raised before. I do remember quality replies like it is not exponential so there is no problem. However, here we are and there is a problem.

The problem is that I run batch jobs, batch jobs that do not run [1]. I have the impression that they are put in some kind of suspended animation by a person. These jobs are submitted by the SourceMD tool by Magnus, Magnus is well known for being responsive to suggestions on how he can improve them. So do not use as an argument that there is something wrong with these job. At most it is acceptable for these run to put on some kind of hold for the duration of a crisis and then there has to be a release.

At the same time I notice that the reports indicating multiple items with the same ORCiD id include items that should have been picked up by earlier reports. I notice that query does not pick up existing items with an ORCid id and creates new ones. For me this is an indication that Query is not reliable.

There is talk on the Wiki that there is no point in having fixed descriptions in anything but English. What caused this discussion is the sheer amount of updates needed just for one language. At the London Wikimania this perceived need for fixed descriptions was discussed vis a vis automated descriptions and as I recall the only argument for having them at all was "standards" in relation to dumps. Yes, automated descriptions may be cached and included in a dump.

I have been asked to write for the ORCiD blog and thereby in effect plug the relevance of the Scholia presentation for scientists. When I do, the number of jobs like the ones I run will mushroom. It is why I have not put anything forward so far because we cannot cope as it is.

The issues I see is,

again to what extend can we grow our content, both for query and update

for the short medium and long term

will batch jobs like mine be able to complete

can we ingest the attention when scholars discover how relevant Scholia

is for them, the subject they care for.

do we care that motivation of volunteers relies on the availability of

sufficient performance to do the tasks they care for.

Thanks, Gerard

[1] https://tools.wmflabs.org/sourcemd/?action=batches&user=GerardM _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

2001

Age (days ago)

2005

Last active (days ago)

wikidata@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Gerard Meijssen
Guillaume Lederrey
Patricia Collins