Hoi,
The good news, the issue with my jobs has been isolated. It is a bug in the
software that occasionally manifests itself. Good because it has nothing to
do with performance at this time. Magnus has a ticket at Crossref [1] so
that will be fixed at some stage.
The reason why I need to be certain about the functionality is that when
scientists find that their papers are well presented in Wikidata, they will
submit jobs for their data to be imported from ORCID into Wikidata.. Given
the number of scientists we already know about, this may have as a result
of many more jobs updating what we know of science. I have been asked to
write for the ORCID blog and that only makes sense when we can accomodate
the traffic.\
Thanks,
GerardM
[1]
On Thu, 27 Jun 2019 at 10:39, Guillaume Lederrey <glederrey(a)wikimedia.org>
wrote:
Hello!
I'm not familiar with some of the issues you raised, but let me try a
few guesses...
On Wed, Jun 26, 2019 at 8:02 AM Gerard Meijssen
<gerard.meijssen(a)gmail.com> wrote:
Hoi,
The performance of the query update is getting worse. Questions about
this have
been raised before. I do remember quality replies like it is not
exponential so there is no problem. However, here we are and there is a
problem.
The problem is that I run batch jobs, batch jobs that do not run [1]. I
have the
impression that they are put in some kind of suspended animation
by a person. These jobs are submitted by the SourceMD tool by Magnus,
Magnus is well known for being responsive to suggestions on how he can
improve them. So do not use as an argument that there is something wrong
with these job. At most it is acceptable for these run to put on some kind
of hold for the duration of a crisis and then there has to be a release.
I'm not familiar with sourcemd, and the link you provided isn't very
clear on what the actual error is. I just guessing, but maybe sourcemd
has some assumptions about updates to WDQS being synchronous, or
quasi-synchronous. Another guess is that it might be subject to
throttling and not backing off appropriately, and maybe it ends up
being banned for some time. If anyone knows what user agent is used by
sourcemd, I can have a look into the WDQS logs to get more
information.
At the same time I notice that the reports
indicating multiple items
with the same ORCiD id include items that should have
been picked up by
earlier reports. I notice that query does not pick up existing items with
an ORCid id and creates new ones. For me this is an indication that Query
is not reliable.
There is talk on the Wiki that there is no point in having fixed
descriptions in
anything but English. What caused this discussion is the
sheer amount of updates needed just for one language. At the London
Wikimania this perceived need for fixed descriptions was discussed vis a
vis automated descriptions and as I recall the only argument for having
them at all was "standards" in relation to dumps. Yes, automated
descriptions may be cached and included in a dump.
I have been asked to write for the ORCiD blog and thereby in effect plug
the
relevance of the Scholia presentation for scientists. When I do, the
number of jobs like the ones I run will mushroom. It is why I have not put
anything forward so far because we cannot cope as it is.
The issues I see is,
* again to what extend can we grow our content, both for query and
update for the
short medium and long term
* will batch jobs like mine be able to complete
Honestly, I'm not sure what the issue is, so I can't assure you those
batches will be able to complete. What we can do is work together to
understand the issue and see what needs to be fixed.
* can we ingest the attention when scholars
discover how relevant
Scholia is for them, the subject they care for.
* do we care that motivation of volunteers relies
on the availability of
sufficient performance to do the tasks they care for.
It depends on who "we" is. I care, and I know that people on my team
care. Which does not mean we will be able to magically fix everything,
but we're trying.
In more general terms, scaling Wikidata and Wikidata Query Service
will require challenging some of our assumptions. Workflows that
assume WDQS to be updated synchronously will fail more and more.
Throttling is becoming more and more important to the stability of the
service and to a fair access to resources, so clients will need to be
able to smooth their load and backoff appropriately.
Sorry to not have a direct solution to your current issues, but let's
try to find one!
Have fun!
Guillaume
Thanks,
Gerard
[1]
https://tools.wmflabs.org/sourcemd/?action=batches&user=GerardM
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata