Hoi,
For your information, these huge numbers of authors are particularly noticable when organisations like CERN are involved. Those people have all an ORCID identifier and slowly but surely more authors are being associated with publications. As a consequence papers are getting to be complete for their authors.. As more authors become available, it will be possible to get more initial value in the first instance of a paper.

Given that the SOURCEMD jobs run in a narrow batch mode, more jobs running concurrently offset the impact of "CERN" jobs. Over time fewer edits associated with big articles with large amounts of co-authors will be processed.

NB this is an answer to off topic issues raised. This is only one instance of functionality that we support.
Thanks,
       GerardM

On Sun, 5 May 2019 at 17:06, Andrew Gray <andrew@generalist.org.uk> wrote:
So, I'm not particularly involved with the scholarly-papers work, but
with my day-job bibliographic analysis hat on...

Papers like this are a *remarkable* anomaly - hyperauthorship like
this is confined to some quite specific areas of physics, and is still
relatively uncommon even in those. I don't think we have to worry
about it approaching anything like 2% of papers any time soon :-)

For 2018 publications, the global mean number of authors/paper is
slightly under five (all disciplines). Over all time, allowing for
there being more new papers than old ones, I'd guess it's something
like three.

Andrew.



On Sat, 4 May 2019 at 08:58, Stas Malyshev <smalyshev@wikimedia.org> wrote:
>
> Hi!
>
> > For the technical guys, consider our growth and plan for at least one
> > year. When the impression exists that the current architecture will not
> > scale beyond two years, start a project to future proof Wikidata.
>
> We may also want to consider if Wikidata is actually the best store for
> all kinds of data. Let's consider example:
>
> https://www.wikidata.org/w/index.php?title=Q57009452
>
> This is an entity that is almost 2M in size, almost 3000 statements and
> each edit to it produces another 2M data structure. And its dump, albeit
> slightly smaller, still 780K and will need to be updated on each edit.
>
> Our database is obviously not optimized for such entities, and they
> won't perform very well. We have 21 million scientific articles in the
> DB, and if even 2% of them would be like this, it's almost a terabyte of
> data (multiplied by number of revisions) and billions of statements.
>
> While I am not against storing this as such, I do wonder if it's
> sustainable to keep such kind of data together with other Wikidata data
> in a single database. After all, each query that you run - even if not
> related to that 21 million in any way - will have to still run in within
> the same enormous database and be hosted on the same hardware. This is
> especially important for services like Wikidata Query Service where all
> data (at least currently) occupies a shared space and can not be easily
> separated.
>
> Any thoughts on this?
>
> --
> Stas Malyshev
> smalyshev@wikimedia.org
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



--
- Andrew Gray
  andrew@generalist.org.uk

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata