On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev <smalyshev(a)wikimedia.org>
wrote:
Hi!
The top 1000
is:
https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEa…
This one is pretty interesting, how do I extract this data? It may be
useful independently of what we're discussing here.
This can be extracted from elastic using aggregations, to obtain a top1000
of the terms that do match P21= or P279 you can run this:
curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0&pretty' -d
'{"aggs": {"item_usage": { "terms": { "field":
"statement_keywords",
"exclude": "P(31|279)=.*", "size": 1000 }}}}' >
top1k.json
To obtain an approximation of the cardinality (unique terms) of a field:
curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs":
{"item_usage": { "cardinality": { "field":
"statement_keywords" }}}}'
Note that I used the spare cluster to run these.
As for Property usage I just realized that we the outgoing_link which
contains a array like:
outgoing_link":
["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18"
,"Property:P1889","Property:P248","Property:P2612","Property:P279","
Property:P3221","Property:P3417","Property:P373","Property:P3827","
Property:P577","Property:P646","Property:P910"],
We don't have doc values enabled for this one so we can't extract
aggregations but if the list of terms is known it could be easily extracted
by running X count queries where X is the number of possible possible
properties.