On Thu, Jul 9, 2020 at 4:52 PM Egon Willighagen <egon.willighagen@gmail.com> wrote:

Dear Guillaume,

On Thu, Jul 9, 2020 at 3:23 PM Guillaume Lederrey <glederrey@wikimedia.org> wrote:
Some very preliminary analysis indicates that less then 2% of the queries on WDQS generate more than 90% of the load. This is definitely something we need to better understand.

Is the data behind that available? I wonder if I recognize any of the top 25 queries.

No, the data isn't publicly available. Queries can (and do) contain private information, so we don't publish raw queries. We might publish a subset of those queries at some point, but only after having reviewed them manually to ensure they are clean.

(I guess the top 2% can be simple queries run very many times, as well as hard queries rarely run, correct?)

The analysis at this point is just on individual queries, with no aggregation of similar queries. This means that this 2% of queries are very expensive queries. We need to refine that analysis, and aggregation of similar queries is one of the things we should be working on.

Egon

--
Hi, do you like citation networks? Already 51% of all citations are available available for innovative new uses. Join me in asking the American Chemical Society to join the Initiative for Open Citations too. SpringerNature, the RSC and many others already did.

-----
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
Blog: http://chem-bla-ics.blogspot.com/
PubList: https://www.zotero.org/egonw
ORCID: 0000-0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET