Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

28 Jul 2018

On Sat, Jul 28, 2018 at 2:02 AM Stas Malyshev &lt;smalyshev(a)wikimedia.org&gt;
wrote:

...
  Hi!

  The top 1000
 is: 
https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEa…

 This one is pretty interesting, how do I extract this data? It may be
 useful independently of what we're discussing here.

This can be extracted from elastic using aggregations, to obtain a top1000
of the terms that do match P21= or P279 you can run this:
 curl -XPOST 'localhost:9200/wikidatawiki_content/_search?size=0&pretty' -d
'{"aggs": {"item_usage": { "terms": { "field":
"statement_keywords",
"exclude": "P(31|279)=.*", "size": 1000 }}}}' >
top1k.json

To obtain an approximation of the cardinality (unique terms) of a field:

curl -XPOST localhost:9200/wikidatawiki_content/_search?size=0 -d '{"aggs":
{"item_usage": { "cardinality": { "field":
"statement_keywords" }}}}'

Note that I used the spare cluster to run these.
As for Property usage I just realized that we the outgoing_link which
contains a array like:
outgoing_link":
["Q1355298","Q1379672","Q15241312","Q8844594","Property:P18"
,"Property:P1889","Property:P248","Property:P2612","Property:P279","
Property:P3221","Property:P3417","Property:P373","Property:P3827","
Property:P577","Property:P646","Property:P910"],
We don't have doc values enabled for this one so we can't extract
aggregations but if the list of terms is known it could be easily extracted
by running X count queries where X is the number of possible possible
properties.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch