On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev <smalyshev@wikimedia.org> wrote:

Hi!

> The top 1000
> is: https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing

This one is pretty interesting, how do I extract this data? It may be
useful independently of what we're discussing here.

This can be extracted from elastic using aggregations, to obtain a top1000 of the terms that do match P21= or P279 you can run this:

curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0&pretty' -d '{"aggs": {"item_usage": { "terms": { "field": "statement_keywords", "exclude": "P(31|279)=.*", "size": 1000 }}}}' > top1k.json

To obtain an approximation of the cardinality (unique terms) of a field:

curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs": {"item_usage": { "cardinality": { "field": "statement_keywords" }}}}'

Note that I used the spare cluster to run these.

As for Property usage I just realized that we the outgoing_link which contains a array like:

outgoing_link": ["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18","Property:P1889","Property:P248","Property:P2612","Property:P279","Property:P3221","Property:P3417","Property:P373","Property:P3827","Property:P577","Property:P646","Property:P910"],

We don't have doc values enabled for this one so we can't extract aggregations but if the list of terms is known it could be easily extracted by running X count queries where X is the number of possible possible properties.