Wikidata SPARQL query logs available

List overview All Threads
Download

newer

older

ISBN as reference.

Wikidata's sixth birthday: now is...

Markus Kroetzsch

7 Aug 2018 7 Aug '18

11 p.m.

Dear all, I am happy to announce that as part of an ongoing research collaboration between TU Dresden researchers and Wikimedia [1], we could now release pre-processed logs from the Wikidata SPARQL Query Service [2]. You can find details and download links on the following page: https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en The data so far comprises over 200 million queries answered in June-August 2017. There is also an accompanying publication that describes the workings of and practical experiences with the SPARQL query service [3]. The logs have been pre-processed to remove information that could potentially be used for identifying individual users (e.g., comments were removed, geo-coordinates coarsened, and query strings reformatted completely -- see above page for details). Nevertheless, one can still learn many interesting things from the logs, e.g., which properties and entities are used in queries, which SPARQL features are most prominent, or which languages are requested. We also have preserved some amount of user agent information, but without overly detailed software versions and only in cases where the agents occurred many times across several weeks. This can at least be used to recognise the (significant amount) of queries generated, e.g., by Magnus' tools, or to do a rough analysis of which software platforms are mostly used to send queries from. We used #TOOL comments found in queries to refine user agent information in some cases. We also made an effort to identify those queries that come from browser agents *and* also behave like one would expect from a browser (not all "browsers" did). We called such queries "organic" and provide this classification with the logs (there is also a filtered dump of only organic queries, which is much smaller and therefore nicer to process, also for testing). See the paper for details on our methodology. Finally, the data contains the time of each request, so one can reconstruct query loads over time. Feedback is very welcome, both in terms of comments on the data (is it useful to you? would you like to see more? do you have concerns?) and in terms of insights that you can get from it (we did some analyses but one can surely do more). Cheers, Markus [1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries [2] https://query.wikidata.org/ (or rather the web service that powers this UI and many other applications). [3] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt: Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph. In Proceedings of the 17th International Semantic Web Conference (ISWC-18), Springer 2018. https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en -- Prof. Dr. Markus Kroetzsch Knowledge-Based Systems Group Center for Advancing Electronics Dresden (cfaed) Faculty of Computer Science TU Dresden +49 351 463 38486 https://kbs.inf.tu-dresden.de/

Attachments:

smime.p7s (application/pkcs7-signature — 5.0 KB)

Show replies by date

David Cuenca Tudela

7 Aug 7 Aug

11:26 p.m.

Hi Markus, Thanks for making the logs available. Personally I would be interested in knowing how often a certain item pops up in queries. That way it would make easier to know the popularity of certain items. Do you think it's something that could be accomplished? Regards, Micru On Tue, 7 Aug 2018, 17:01 Markus Kroetzsch, <markus.kroetzsch(a)tu-dresden.de> wrote:

...

Markus Kroetzsch

11:37 p.m.

Hi Micru, On 07/08/18 17:26, David Cuenca Tudela wrote:

...

This would be quite easy to do: since each query is one line in the files, and since we have expanded all URLs (meaning they close with ">", which is URL-encoded as "%3E"), you can simply do a zgrep -c over the files to count the queries that mention the item (and make sure to use the closing "%3E" to avoid Q1234 matching a search for Q123). One such grep over any of the three larger files takes less than a minute. If you want a sorted list of "most popular" items, this is a bit more work and would require at least some Python script, or some less obvious combination of sed (extracting all URLs of entities), and sort. Best, Markus

...

Regards, Micru On Tue, 7 Aug 2018, 17:01 Markus Kroetzsch, <markus.kroetzsch(a)tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de>> wrote: Dear all, I am happy to announce that as part of an ongoing research collaboration between TU Dresden researchers and Wikimedia [1], we could now release pre-processed logs from the Wikidata SPARQL Query Service [2]. You can find details and download links on the following page: https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en The data so far comprises over 200 million queries answered in June-August 2017. There is also an accompanying publication that describes the workings of and practical experiences with the SPARQL query service [3]. The logs have been pre-processed to remove information that could potentially be used for identifying individual users (e.g., comments were removed, geo-coordinates coarsened, and query strings reformatted completely -- see above page for details). Nevertheless, one can still learn many interesting things from the logs, e.g., which properties and entities are used in queries, which SPARQL features are most prominent, or which languages are requested. We also have preserved some amount of user agent information, but without overly detailed software versions and only in cases where the agents occurred many times across several weeks. This can at least be used to recognise the (significant amount) of queries generated, e.g., by Magnus' tools, or to do a rough analysis of which software platforms are mostly used to send queries from. We used #TOOL comments found in queries to refine user agent information in some cases. We also made an effort to identify those queries that come from browser agents *and* also behave like one would expect from a browser (not all "browsers" did). We called such queries "organic" and provide this classification with the logs (there is also a filtered dump of only organic queries, which is much smaller and therefore nicer to process, also for testing). See the paper for details on our methodology. Finally, the data contains the time of each request, so one can reconstruct query loads over time. Feedback is very welcome, both in terms of comments on the data (is it useful to you? would you like to see more? do you have concerns?) and in terms of insights that you can get from it (we did some analyses but one can surely do more). Cheers, Markus [1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries [2] https://query.wikidata.org/ (or rather the web service that powers this UI and many other applications). [3] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt: Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph. In Proceedings of the 17th International Semantic Web Conference (ISWC-18), Springer 2018. https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en -- Prof. Dr. Markus Kroetzsch Knowledge-Based Systems Group Center for Advancing Electronics Dresden (cfaed) Faculty of Computer Science TU Dresden +49 351 463 38486 https://kbs.inf.tu-dresden.de/ _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Maximilian Marx

11:59 p.m.

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> said:

...

If you want a sorted list of "most popular" items, this is a bit more work and would require at least some Python script, or some less obvious combination of sed (extracting all URLs of entities), and sort.

David Cuenca Tudela

8 Aug 8 Aug

12:06 a.m.

If someone could post the 10 (or 50!) more popular items, I would really appreciate it :-) Cheers, Micru On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx < maximilian.marx(a)tu-dresden.de> wrote:

...

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch < markus.kroetzsch(a)tu-dresden.de> said:

zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr should do the trick. Best, Maximilian -- Dipl.-Math. Maximilian Marx Knowledge-Based Systems Group Faculty of Computer Science TU Dresden +49 351 463 43510 https://kbs.inf.tu-dresden.de/max _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Etiamsi omnes, ego non

Daniel Mietchen

24 Aug 24 Aug

3:57 a.m.

I just ran Max' one-liner over one of the dump files, and it worked smoothly. Not sure where the best place would be to store such things, so I simply put it in my sandbox for now: https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox&… . d. On Tue, Aug 7, 2018 at 6:06 PM David Cuenca Tudela <dacuetu(a)gmail.com> wrote:

...

If someone could post the 10 (or 50!) more popular items, I would really appreciate it :-) Cheers, Micru On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx <maximilian.marx(a)tu-dresden.de> wrote:

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> said:

-- Etiamsi omnes, ego non _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

4:03 a.m.

Hi!

...

If you think it's a dataset others may want to reuse, tabular data on Commons may be a venue: https://www.mediawiki.org/wiki/Help:Tabular_Data -- Stas Malyshev smalyshev(a)wikimedia.org

Daniel Mietchen

4:08 a.m.

Hi Stas, I had thought about putting it on Commons as tabular data, but did not know how to reuse it from there for multilingual display using the Q template on Wikidata, so went the simpler route. Can you (or someone else) perhaps demo that briefly? Thanks, d. On Thu, Aug 23, 2018 at 10:03 PM Stas Malyshev <smalyshev(a)wikimedia.org> wrote:

...

Hi!

If you think it's a dataset others may want to reuse, tabular data on Commons may be a venue: https://www.mediawiki.org/wiki/Help:Tabular_Data -- Stas Malyshev smalyshev(a)wikimedia.org

Finn Aarup Nielsen

4:20 a.m.

I was wondering why our research section was number 8!? Then I recalled our dashboard running from "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html". It updates around each 3 minute all day long :) /Finn On 08/23/2018 09:57 PM, Daniel Mietchen wrote:

...

If someone could post the 10 (or 50!) more popular items, I would really appreciate it :-) Cheers, Micru On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx <maximilian.marx(a)tu-dresden.de> wrote:

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> said:

-- Etiamsi omnes, ego non _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

fn＠imm.dtu.dk

4:39 a.m.

I was wondering why our research section was number 8. Then I recalled our dashboard running from "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html". It updates around each 3 minute all day long... /Finn On 08/23/2018 09:57 PM, Daniel Mietchen wrote:

...

If someone could post the 10 (or 50!) more popular items, I would really appreciate it :-) Cheers, Micru On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx <maximilian.marx(a)tu-dresden.de> wrote:

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> said:

-- Etiamsi omnes, ego non _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Mietchen

5:07 a.m.

On Thu, Aug 23, 2018 at 10:44 PM <fn(a)imm.dtu.dk> wrote:

...

Such automated queries should not be in the organic query file that I looked at. d.

Stas Malyshev

5:10 a.m.

Hi! On 8/23/18 2:07 PM, Daniel Mietchen wrote:

...

On Thu, Aug 23, 2018 at 10:44 PM <fn(a)imm.dtu.dk> wrote:

Such automated queries should not be in the organic query file that I looked at.

If it's a browser page and the underlying code does not set distinctive user agent, I think they will be. It'd be hard to identify such cases otherwise (ccing Markus in case he knows more on the topic). -- Stas Malyshev smalyshev(a)wikimedia.org

Markus Kroetzsch

6:46 a.m.

On 23/08/18 23:10, Stas Malyshev wrote:

...

Hi! On 8/23/18 2:07 PM, Daniel Mietchen wrote:

On Thu, Aug 23, 2018 at 10:44 PM <fn(a)imm.dtu.dk> wrote:

Such automated queries should not be in the organic query file that I looked at.

Yes, the "organic" file is a subset of the queries from agents that pretended to be a browser. We filtered agents and query patterns that were clearly not "human-like" but a tool that asks one query every 3 min would not be recognised at this level. Such a tool would also not strongly affect most statistics, but it can in cases of statistics that have an extremely high number of possible values (e.g., items used in the query). In such cases, normal "organic" traffic is usually so diverse, that no individual value receives much prominence, so that even a rather small number of queries from one source could have an impact. In general, popularity measures based on query traffic, even on the organic part, must be taken with caution, because of the many effects that lead to skewed query volumes from a particular source (without this necessarily indicating real "popularity"). It is an open question how one should best evaluate the traffic in the presence of these skews. Our two-class system of "robotic" [massively skewed] and "organic" [less skewed] is only a first step there. Best, Markus

...

Lucas Werkmeister

5:31 a.m.

The top result freaks me out, to be honest. Are /that many/ people running the first query from the SPARQL tutorial <https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial>, or is there some other reason why Bach might be so overwhelmingly popular? On 8/23/18 9:57 PM, Daniel Mietchen wrote:

...

If someone could post the 10 (or 50!) more popular items, I would really appreciate it :-) Cheers, Micru On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx <maximilian.marx(a)tu-dresden.de> wrote:

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> said:

-- Etiamsi omnes, ego non _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Lucas Werkmeister

5:35 a.m.

Ah, and I think I found a bug in your command: by grepping for |Q[1-9][0-9]*+*|, you’re excluding single-digit item IDs. I’m going to speculate that if you fix that, Q5 will comfortably beat all other items :) On 8/23/18 11:31 PM, Lucas Werkmeister wrote:

...

If someone could post the 10 (or 50!) more popular items, I would really appreciate it :-) Cheers, Micru On Tue, Aug 7, 2018 at 5:59 PM Maximilian Marx <maximilian.marx(a)tu-dresden.de> wrote:

Hi, On Tue, 7 Aug 2018 17:37:34 +0200, Markus Kroetzsch <markus.kroetzsch(a)tu-dresden.de> said: > If you want a sorted list of "most popular" items, this is a bit more > work and would require at least some Python script, or some less > obvious combination of sed (extracting all URLs of entities), and > sort. zgrep -Eoe '%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ[1-9][0-9]+%3E' dump.gz | cut -d 'Q' -f 2 | cut -d '%' -f 1 | sort | uniq -c | sort -nr should do the trick. Best, Maximilian -- Dipl.-Math. Maximilian Marx Knowledge-Based Systems Group Faculty of Computer Science TU Dresden +49 351 463 43510 https://kbs.inf.tu-dresden.de/max _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Etiamsi omnes, ego non _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

2072

days inactive

2088

days old

wikidata@lists.wikimedia.org

Manage subscription

14 comments

8 participants

tags (0)

participants (8)

Daniel Mietchen
David Cuenca Tudela
Finn Aarup Nielsen
fn＠imm.dtu.dk
Lucas Werkmeister
Markus Kroetzsch
Maximilian Marx
Stas Malyshev