SPARQL power users and developers

List overview All Threads
Download

newer

older

List of WP-languages and language...

Wikidata weekly summary #229

Markus Kroetzsch

30 Sep 2016 30 Sep '16

10 a.m.

Dear SPARQL users,

We are starting a research project to investigate the use of the Wikidata SPARQL Query Service, with the goal to gain insights that may help to improve Wikidata and the query service [1]. Currently, we are still waiting for all data to become available. Meanwhile, we would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service varies greatly over time, presumably because power users and software tools are running large numbers of queries. For a meaningful analysis, we would like to understand such high-impact biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of queries (over 10,000)? If so, please let us know how your queries might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If so, then please let us know if there is any way to identify your queries.

If (1) or (2) applies to you, then it would be good if you could include an identifying comment into your SPARQL queries in the future, to make it easier to recognise them. In return, this would enable us to provide you with statistics on the usage of your tool [2].

Further feedback is welcome.

Cheers,

Markus

[1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

[2] Pending permission by the WMF. Like all Wikimedia usage data, the query logs are under strict privacy protection, so we will need to get clearance before sharing any findings with the public. We hope, however, that there won't be any reservations against publishing non-identifying information.

-- Prof. Dr. Markus Kroetzsch Knowledge-Based Systems Group Faculty of Computer Science TU Dresden +49 351 463 38486 https://iccl.inf.tu-dresden.de/web/KBS/en

Show replies by date

Andra Waagmeester

30 Sep 30 Sep

10:18 a.m.

Would it help if I add the following header to every large batch of queries?

####### # access: (http://query.wikidata.org or https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D .) # contact: email, acountname, twittername etc # bot: True/False # ......... ######

On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...

Dear SPARQL users,

We are starting a research project to investigate the use of the Wikidata SPARQL Query Service, with the goal to gain insights that may help to improve Wikidata and the query service [1]. Currently, we are still waiting for all data to become available. Meanwhile, we would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service varies greatly over time, presumably because power users and software tools are running large numbers of queries. For a meaningful analysis, we would like to understand such high-impact biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of queries (over 10,000)? If so, please let us know how your queries might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If so, then please let us know if there is any way to identify your queries.

If (1) or (2) applies to you, then it would be good if you could include an identifying comment into your SPARQL queries in the future, to make it easier to recognise them. In return, this would enable us to provide you with statistics on the usage of your tool [2].

Further feedback is welcome.

Cheers,

Markus

[1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikid ata_Queries

[2] Pending permission by the WMF. Like all Wikimedia usage data, the query logs are under strict privacy protection, so we will need to get clearance before sharing any findings with the public. We hope, however, that there won't be any reservations against publishing non-identifying information.

-- Prof. Dr. Markus Kroetzsch Knowledge-Based Systems Group Faculty of Computer Science TU Dresden +49 351 463 38486 https://iccl.inf.tu-dresden.de/web/KBS/en

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Kroetzsch

10:44 a.m.

On 30.09.2016 16:18, Andra Waagmeester wrote:

...

Would it help if I add the following header to every large batch of queries?

####### # access: (http://query.wikidata.org or https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D .) # contact: email, acountname, twittername etc # bot: True/False # ......... ######

This is already more detailed than what I had in mind. Having a way to tell apart bots and tools from "organic" queries would already be great. We are mainly looking for something that will help us to understand sudden peaks of activity. For this, it might be enough to have a short signature (a URL could be given, but a tool name with a version would also be fine). This is somewhat like the "user agent" field in HTTP.

But you are right that some formatting convention may help further here. How about this:

#TOOL:<any user agent information that you like to share>

Then one could look for comments of this form without knowing all the tools upfront. Of course, this is just a hint in any case, since one could always use the same comment in any manually written query.

Best regards,

Markus

...

On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:

Dear SPARQL users,

We are starting a research project to investigate the use of the
Wikidata SPARQL Query Service, with the goal to gain insights that
may help to improve Wikidata and the query service [1]. Currently,
we are still waiting for all data to become available. Meanwhile, we
would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service
varies greatly over time, presumably because power users and
software tools are running large numbers of queries. For a
meaningful analysis, we would like to understand such high-impact
biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of
queries (over 10,000)? If so, please let us know how your queries
might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If
so, then please let us know if there is any way to identify your
queries.

If (1) or (2) applies to you, then it would be good if you could
include an identifying comment into your SPARQL queries in the
future, to make it easier to recognise them. In return, this would
enable us to provide you with statistics on the usage of your tool [2].

Further feedback is welcome.

Cheers,

Markus


[1]
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
<https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries>

[2] Pending permission by the WMF. Like all Wikimedia usage data,
the query logs are under strict privacy protection, so we will need
to get clearance before sharing any findings with the public. We
hope, however, that there won't be any reservations against
publishing non-identifying information.

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en
<https://iccl.inf.tu-dresden.de/web/KBS/en>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andra Waagmeester

1:50 p.m.

Just curious while we are on the topic. When you are inspecting the headers to separate between "organic" queries and bot queries, would it be possible to count the times a set of properties is used in the different queries? This would be a nice way to demonstrate to original external resources how "their" data is used and which combination of properties are used together with "their" properties (eg. P351 for ncbi gene or P699 for the disease ontology). It would be interesting to know how often for example those two properties are used in one single query.

Cheers,

Andra

On Fri, Sep 30, 2016 at 4:44 PM, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...

On 30.09.2016 16:18, Andra Waagmeester wrote:

...
Would it help if I add the following header to every large batch of queries?

####### # access: (http://query.wikidata.org or https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D .) # contact: email, acountname, twittername etc # bot: True/False # ......... ######

This is already more detailed than what I had in mind. Having a way to tell apart bots and tools from "organic" queries would already be great. We are mainly looking for something that will help us to understand sudden peaks of activity. For this, it might be enough to have a short signature (a URL could be given, but a tool name with a version would also be fine). This is somewhat like the "user agent" field in HTTP.

But you are right that some formatting convention may help further here. How about this:

#TOOL:<any user agent information that you like to share>

Then one could look for comments of this form without knowing all the tools upfront. Of course, this is just a hint in any case, since one could always use the same comment in any manually written query.

Best regards,

Markus

...
On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>

wrote:
Dear SPARQL users,

We are starting a research project to investigate the use of the
Wikidata SPARQL Query Service, with the goal to gain insights that
may help to improve Wikidata and the query service [1]. Currently,
we are still waiting for all data to become available. Meanwhile, we
would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service
varies greatly over time, presumably because power users and
software tools are running large numbers of queries. For a
meaningful analysis, we would like to understand such high-impact
biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of
queries (over 10,000)? If so, please let us know how your queries
might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If
so, then please let us know if there is any way to identify your
queries.

If (1) or (2) applies to you, then it would be good if you could
include an identifying comment into your SPARQL queries in the
future, to make it easier to recognise them. In return, this would
enable us to provide you with statistics on the usage of your tool
[2].
Further feedback is welcome.

Cheers,

Markus


[1]
https://meta.wikimedia.org/wiki/Research:Understanding_Wikid
ata_Queries https://meta.wikimedia.org/wiki/Research:Understanding_Wiki data_Queries
[2] Pending permission by the WMF. Like all Wikimedia usage data,
the query logs are under strict privacy protection, so we will need
to get clearance before sharing any findings with the public. We
hope, however, that there won't be any reservations against
publishing non-identifying information.

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en
<https://iccl.inf.tu-dresden.de/web/KBS/en>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Kroetzsch

2:53 p.m.

On 30.09.2016 19:50, Andra Waagmeester wrote:

...

Just curious while we are on the topic. When you are inspecting the headers to separate between "organic" queries and bot queries, would it be possible to count the times a set of properties is used in the different queries? This would be a nice way to demonstrate to original external resources how "their" data is used and which combination of properties are used together with "their" properties (eg. P351 for ncbi gene or P699 for the disease ontology). It would be interesting to know how often for example those two properties are used in one single query.

Yes, we definitely want to do such analyses. The first task is to clean up and group/categorize queries so we can get a better understanding (if a property is used in 100K queries a day, it would still be nice to know if they come from a single script or from many users).

Once we have this, we would like to analyse for content (which properties and classes are used, etc.) but also for query feature (how many OPTIONALs, GROUP BYs, etc. are used). Ideas on what to analyse further are welcome. Of course, SPARQL can only give a partial idea of "usage", since Wikidata content can be used in ways that don't involve SPARQL. Moreover, counting raw numbers of queries can also be misleading: we have had cases where a single query result was discussed by hundreds of people (e.g. the Panama papers query that made it to Le Monde online), but in the logs it will still show up only as a single query among millions.

Best,

Markus

...

On Fri, Sep 30, 2016 at 4:44 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:

On 30.09.2016 16:18, Andra Waagmeester wrote:

    Would it help if I add the following header to every large batch
    of queries?

    #######
    # access: (http://query.wikidata.org
    or
    https://query.wikidata.org/bigdata/namespace/wdq/sparql?query={SPARQL}
    <https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D>
    .)
    # contact: email, acountname, twittername etc
    # bot: True/False
    # .........
    ######


This is already more detailed than what I had in mind. Having a way
to tell apart bots and tools from "organic" queries would already be
great. We are mainly looking for something that will help us to
understand sudden peaks of activity. For this, it might be enough to
have a short signature (a URL could be given, but a tool name with a
version would also be fine). This is somewhat like the "user agent"
field in HTTP.

But you are right that some formatting convention may help further
here. How about this:

#TOOL:<any user agent information that you like to share>

Then one could look for comments of this form without knowing all
the tools upfront. Of course, this is just a hint in any case, since
one could always use the same comment in any manually written query.

Best regards,

Markus


    On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch
    <markus.kroetzsch@tu-dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>
    <mailto:markus.kroetzsch@tu-dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>>>

    wrote:

        Dear SPARQL users,

        We are starting a research project to investigate the use of the
        Wikidata SPARQL Query Service, with the goal to gain
    insights that
        may help to improve Wikidata and the query service [1].
    Currently,
        we are still waiting for all data to become available.
    Meanwhile, we
        would like to ask for your input.

        Preliminary analyses show that the use of the SPARQL query
    service
        varies greatly over time, presumably because power users and
        software tools are running large numbers of queries. For a
        meaningful analysis, we would like to understand such
    high-impact
        biases in the data. We therefore need your help:

        (1) Are you a SPARQL power user who sometimes runs large
    numbers of
        queries (over 10,000)? If so, please let us know how your
    queries
        might typically look so we can identify them in the logs.

        (2) Are you the developer of a tool that launches SPARQL
    queries? If
        so, then please let us know if there is any way to identify your
        queries.

        If (1) or (2) applies to you, then it would be good if you could
        include an identifying comment into your SPARQL queries in the
        future, to make it easier to recognise them. In return, this
    would
        enable us to provide you with statistics on the usage of
    your tool [2].

        Further feedback is welcome.

        Cheers,

        Markus


        [1]

    https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
    <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries>

    <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
    <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries>>

        [2] Pending permission by the WMF. Like all Wikimedia usage
    data,
        the query logs are under strict privacy protection, so we
    will need
        to get clearance before sharing any findings with the public. We
        hope, however, that there won't be any reservations against
        publishing non-identifying information.

        --
        Prof. Dr. Markus Kroetzsch
        Knowledge-Based Systems Group
        Faculty of Computer Science
        TU Dresden
        +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
    <tel:%2B49%20351%20463%2038486>
        https://iccl.inf.tu-dresden.de/web/KBS/en
    <https://iccl.inf.tu-dresden.de/web/KBS/en>
        <https://iccl.inf.tu-dresden.de/web/KBS/en
    <https://iccl.inf.tu-dresden.de/web/KBS/en>>

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>
        <https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>>




    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Sebastian Burgstaller

3:02 p.m.

Hi Markus,

I assume I qualify for (1) and (2). I can add an identifyable comment with a '#Tool:' prefix to every major sparql query done by our tools.

One bot run usually generates a few very heavy queries, and 10,000s of smaller ones, depending on the actual task a bot performs. All of this serves to keep the data in WD consistent, avoid duplicates, etc and, in principle, acts as a combination of database connector and Wikidata API wrapper.

Best, Sebastian

-- Sebastian Burgstaller-Muehlbacher, PhD Research Associate Andrew Su Lab MEM-216, Department of Molecular and Experimental Medicine The Scripps Research Institute 10550 North Torrey Pines Road La Jolla, CA 92037 On Fri, Sep 30, 2016 at 11:53 AM, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote: > On 30.09.2016 19:50, Andra Waagmeester wrote: >> >> Just curious while we are on the topic. When you are inspecting the >> headers to separate between "organic" queries and bot queries, would it >> be possible to count the times a set of properties is used in the >> different queries? This would be a nice way to demonstrate to original >> external resources how "their" data is used and which combination of >> properties are used together with "their" properties (eg. P351 for ncbi >> gene or P699 for the disease ontology). It would be interesting to know >> how often for example those two properties are used in one single query. > > > Yes, we definitely want to do such analyses. The first task is to clean up > and group/categorize queries so we can get a better understanding (if a > property is used in 100K queries a day, it would still be nice to know if > they come from a single script or from many users). > > Once we have this, we would like to analyse for content (which properties > and classes are used, etc.) but also for query feature (how many OPTIONALs, > GROUP BYs, etc. are used). Ideas on what to analyse further are welcome. Of > course, SPARQL can only give a partial idea of "usage", since Wikidata > content can be used in ways that don't involve SPARQL. Moreover, counting > raw numbers of queries can also be misleading: we have had cases where a > single query result was discussed by hundreds of people (e.g. the Panama > papers query that made it to Le Monde online), but in the logs it will still > show up only as a single query among millions. > > Best, > > Markus > > >> On Fri, Sep 30, 2016 at 4:44 PM, Markus Kroetzsch >> <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> >> wrote: >> >> On 30.09.2016 16:18, Andra Waagmeester wrote: >> >> Would it help if I add the following header to every large batch >> of queries? >> >> ####### >> # access: (http://query.wikidata.org >> or >> >> https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D >> >> https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D >> .) >> # contact: email, acountname, twittername etc >> # bot: True/False >> # ......... >> ###### >> >> >> This is already more detailed than what I had in mind. Having a way >> to tell apart bots and tools from "organic" queries would already be >> great. We are mainly looking for something that will help us to >> understand sudden peaks of activity. For this, it might be enough to >> have a short signature (a URL could be given, but a tool name with a >> version would also be fine). This is somewhat like the "user agent" >> field in HTTP. >> >> But you are right that some formatting convention may help further >> here. How about this: >> >> #TOOL:<any user agent information that you like to share> >> >> Then one could look for comments of this form without knowing all >> the tools upfront. Of course, this is just a hint in any case, since >> one could always use the same comment in any manually written query. >> >> Best regards, >> >> Markus >> >> >> On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch >> <markus.kroetzsch@tu-dresden.de >> mailto:markus.kroetzsch@tu-dresden.de >> mailto:markus.kroetzsch@tu-dresden.de > >> mailto:markus.kroetzsch@tu-dresden.de>> >> >> wrote: >> >> Dear SPARQL users, >> >> We are starting a research project to investigate the use of >> the >> Wikidata SPARQL Query Service, with the goal to gain >> insights that >> may help to improve Wikidata and the query service [1]. >> Currently, >> we are still waiting for all data to become available. >> Meanwhile, we >> would like to ask for your input. >> >> Preliminary analyses show that the use of the SPARQL query >> service >> varies greatly over time, presumably because power users and >> software tools are running large numbers of queries. For a >> meaningful analysis, we would like to understand such >> high-impact >> biases in the data. We therefore need your help: >> >> (1) Are you a SPARQL power user who sometimes runs large >> numbers of >> queries (over 10,000)? If so, please let us know how your >> queries >> might typically look so we can identify them in the logs. >> >> (2) Are you the developer of a tool that launches SPARQL >> queries? If >> so, then please let us know if there is any way to identify >> your >> queries. >> >> If (1) or (2) applies to you, then it would be good if you >> could >> include an identifying comment into your SPARQL queries in the >> future, to make it easier to recognise them. In return, this >> would >> enable us to provide you with statistics on the usage of >> your tool [2]. >> >> Further feedback is welcome. >> >> Cheers, >> >> Markus >> >> >> [1] >> >> >> https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries >> >> https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries >> >> >> https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries > >> https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries> >> >> [2] Pending permission by the WMF. Like all Wikimedia usage >> data, >> the query logs are under strict privacy protection, so we >> will need >> to get clearance before sharing any findings with the public. >> We >> hope, however, that there won't be any reservations against >> publishing non-identifying information. >> >> -- >> Prof. Dr. Markus Kroetzsch >> Knowledge-Based Systems Group >> Faculty of Computer Science >> TU Dresden >> +49 351 463 38486 tel:%2B49%20351%20463%2038486 >> tel:%2B49%20351%20463%2038486 >> https://iccl.inf.tu-dresden.de/web/KBS/en >> https://iccl.inf.tu-dresden.de/web/KBS/en >> https://iccl.inf.tu-dresden.de/web/KBS/en > https://iccl.inf.tu-dresden.de/web/KBS/en> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org >> mailto:Wikidata@lists.wikimedia.org >> mailto:Wikidata@lists.wikimedia.org > mailto:Wikidata@lists.wikimedia.org> >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> https://lists.wikimedia.org/mailman/listinfo/wikidata > https://lists.wikimedia.org/mailman/listinfo/wikidata> >> >> >> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> >> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> >> >> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata

Andra Waagmeester

3:33 p.m.

...

Once we have this, we would like to analyse for content (which properties and classes are used, etc.) but also for query feature (how many OPTIONALs, GROUP BYs, etc. are used). Ideas on what to analyse further are welcome. Of course, SPARQL can only give a partial idea of "usage", since Wikidata content can be used in ways that don't involve SPARQL. Moreover, counting raw numbers of queries can also be misleading: we have had cases where a single query result was discussed by hundreds of people (e.g. the Panama papers query that made it to Le Monde online), but in the logs it will still show up only as a single query among millions.

Yes I agree and we certainly need to look into different metrics on how Wikidata is used. I am happy to join the discussion, but even the partial view on the usage is already a big step forward. A lot of the data being fed into Wikidata through the different bots resulted from funded initiatives. Currently, we have no way of demonstrating to funders how using Wikidata in distributing their efforts is beneficial to the community at large. Simply counting the shared use of different properties could already be a very crude metric on the dissemination of scientific knowledge over different domains.

Yuri Astrakhan

2:37 p.m.

I guess I qualify for #2 several times: * The <mapframe> & <maplink> support access to the geoshapes service, which in turn can make requests to WDQS. For example, see https://en.wikipedia.org/wiki/User:Yurik/maplink (click on "governor's link")

* The <graph> wiki tag supports the same geoshapes service, as well as direct queries to WDQS. This graph uses both (one to get all countries, the other is to get the list of disasters) https://www.mediawiki.org/wiki/Extension:Graph/Demo/Sparql/Largest_disasters

* There has been some discussion to allow direct WDQS querying from maps too - e.g. to draw points of interest based on Wikidata (very easy to implement, but we should be careful to cache it properly)

Since all these queries are called from either nodejs or our javascript, we could attach extra headers, like X-Analytics, which is already handled by Varnish. Also, NodeJS queries could set the user agent string.

On Fri, Sep 30, 2016 at 10:44 AM Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...

On 30.09.2016 16:18, Andra Waagmeester wrote:

...
Would it help if I add the following header to every large batch of

queries?

...
####### # access: (http://query.wikidata.org or

https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D .)

...
# contact: email, acountname, twittername etc # bot: True/False # ......... ######

This is already more detailed than what I had in mind. Having a way to tell apart bots and tools from "organic" queries would already be great. We are mainly looking for something that will help us to understand sudden peaks of activity. For this, it might be enough to have a short signature (a URL could be given, but a tool name with a version would also be fine). This is somewhat like the "user agent" field in HTTP.

But you are right that some formatting convention may help further here. How about this:

#TOOL:<any user agent information that you like to share>

Then one could look for comments of this form without knowing all the tools upfront. Of course, this is just a hint in any case, since one could always use the same comment in any manually written query.

Best regards,

Markus

...
On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Dear SPARQL users,

We are starting a research project to investigate the use of the
Wikidata SPARQL Query Service, with the goal to gain insights that
may help to improve Wikidata and the query service [1]. Currently,
we are still waiting for all data to become available. Meanwhile, we
would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service
varies greatly over time, presumably because power users and
software tools are running large numbers of queries. For a
meaningful analysis, we would like to understand such high-impact
biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of
queries (over 10,000)? If so, please let us know how your queries
might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If
so, then please let us know if there is any way to identify your
queries.

If (1) or (2) applies to you, then it would be good if you could
include an identifying comment into your SPARQL queries in the
future, to make it easier to recognise them. In return, this would
enable us to provide you with statistics on the usage of your tool
[2].

...
Further feedback is welcome.

Cheers,

Markus


[1]
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

...
<
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries%3E

...
[2] Pending permission by the WMF. Like all Wikimedia usage data,
the query logs are under strict privacy protection, so we will need
to get clearance before sharing any findings with the public. We
hope, however, that there won't be any reservations against
publishing non-identifying information.

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en
<https://iccl.inf.tu-dresden.de/web/KBS/en>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Denny Vrandečić

2:47 p.m.

Markus, do you have access to the corresponding HTTP request logs? The fields there might be helpful (although I might be overtly optimistic about it)

On Fri, Sep 30, 2016 at 11:38 AM Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

I guess I qualify for #2 several times:

The <mapframe> & <maplink> support access to the geoshapes service,

which in turn can make requests to WDQS. For example, see https://en.wikipedia.org/wiki/User:Yurik/maplink (click on "governor's link")

The <graph> wiki tag supports the same geoshapes service, as well as

direct queries to WDQS. This graph uses both (one to get all countries, the other is to get the list of disasters)

https://www.mediawiki.org/wiki/Extension:Graph/Demo/Sparql/Largest_disasters

There has been some discussion to allow direct WDQS querying from maps

too - e.g. to draw points of interest based on Wikidata (very easy to implement, but we should be careful to cache it properly)

Since all these queries are called from either nodejs or our javascript, we could attach extra headers, like X-Analytics, which is already handled by Varnish. Also, NodeJS queries could set the user agent string.

On Fri, Sep 30, 2016 at 10:44 AM Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...
On 30.09.2016 16:18, Andra Waagmeester wrote:

...
Would it help if I add the following header to every large batch of

queries?

...
####### # access: (http://query.wikidata.org or

https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D .)

...
# contact: email, acountname, twittername etc # bot: True/False # ......... ######

This is already more detailed than what I had in mind. Having a way to tell apart bots and tools from "organic" queries would already be great. We are mainly looking for something that will help us to understand sudden peaks of activity. For this, it might be enough to have a short signature (a URL could be given, but a tool name with a version would also be fine). This is somewhat like the "user agent" field in HTTP.

But you are right that some formatting convention may help further here. How about this:

#TOOL:<any user agent information that you like to share>

Then one could look for comments of this form without knowing all the tools upfront. Of course, this is just a hint in any case, since one could always use the same comment in any manually written query.

Best regards,

Markus

...
On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de <mailto:markus.kroetzsch@tu-dresden.de

...
wrote:
Dear SPARQL users,

We are starting a research project to investigate the use of the
Wikidata SPARQL Query Service, with the goal to gain insights that
may help to improve Wikidata and the query service [1]. Currently,
we are still waiting for all data to become available. Meanwhile, we
would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service
varies greatly over time, presumably because power users and
software tools are running large numbers of queries. For a
meaningful analysis, we would like to understand such high-impact
biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of
queries (over 10,000)? If so, please let us know how your queries
might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If
so, then please let us know if there is any way to identify your
queries.

If (1) or (2) applies to you, then it would be good if you could
include an identifying comment into your SPARQL queries in the
future, to make it easier to recognise them. In return, this would
enable us to provide you with statistics on the usage of your tool
[2].

...
Further feedback is welcome.

Cheers,

Markus


[1]
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

...
<
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries%3E

...
[2] Pending permission by the WMF. Like all Wikimedia usage data,
the query logs are under strict privacy protection, so we will need
to get clearance before sharing any findings with the public. We
hope, however, that there won't be any reservations against
publishing non-identifying information.

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en
<https://iccl.inf.tu-dresden.de/web/KBS/en>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Maarten Dammers

3:17 p.m.

Hi Denny,

On 30-09-16 20:47, Denny Vrandečić wrote:

...

Markus, do you have access to the corresponding HTTP request logs? The fields there might be helpful (although I might be overtly optimistic about it)

I was about to say the same. I use pywikibot quite a lot and it sends some nice headers like described at https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client .

Maarten

Markus Kroetzsch

3:28 p.m.

On 30.09.2016 20:47, Denny Vrandečić wrote:

...

Markus, do you have access to the corresponding HTTP request logs? The fields there might be helpful (although I might be overtly optimistic about it)

Yes, we can access all logs. For bot-based queries, this should be very helpful indeed. I can still think of several cases where this won't help much:

* People writing a quick Python (or whatever) script to run thousands of queries, without setting a meaningful user agent. * Web applications like Reasonator or SQID that cause the client to run SPARQL queries when viewing a page (in this case, the user agent that gets logged is the user's browser).

But, yes, we will definitely look at all signals that we can get from the data.

Best,

Markus

...

On Fri, Sep 30, 2016 at 11:38 AM Yuri Astrakhan <yastrakhan@wikimedia.org mailto:yastrakhan@wikimedia.org> wrote:

I guess I qualify for #2 several times:
* The <mapframe> & <maplink> support access to the geoshapes
service, which in turn can make requests to WDQS. For example, see
https://en.wikipedia.org/wiki/User:Yurik/maplink  (click on
"governor's link")

* The <graph> wiki tag supports the same geoshapes service, as well
as direct queries to WDQS. This graph uses both (one to get all
countries, the other is to get the list of disasters)
https://www.mediawiki.org/wiki/Extension:Graph/Demo/Sparql/Largest_disasters

* There has been some discussion to allow direct WDQS querying from
maps too - e.g. to draw points of interest based on Wikidata (very
easy to implement, but we should be careful to cache it properly)

Since all these queries are called from either nodejs or our
javascript, we could attach extra headers, like X-Analytics, which
is already handled by Varnish.  Also, NodeJS queries could set the
user agent string.


On Fri, Sep 30, 2016 at 10:44 AM Markus Kroetzsch
<markus.kroetzsch@tu-dresden.de
<mailto:markus.kroetzsch@tu-dresden.de>> wrote:

    On 30.09.2016 16:18, Andra Waagmeester wrote:
    > Would it help if I add the following header to every large
    batch of queries?
    >
    > #######
    > # access: (http://query.wikidata.org
    > or
    https://query.wikidata.org/bigdata/namespace/wdq/sparql?query={SPARQL}
    <https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D>
    .)
    > # contact: email, acountname, twittername etc
    > # bot: True/False
    > # .........
    > ######

    This is already more detailed than what I had in mind. Having a
    way to
    tell apart bots and tools from "organic" queries would already
    be great.
    We are mainly looking for something that will help us to understand
    sudden peaks of activity. For this, it might be enough to have a
    short
    signature (a URL could be given, but a tool name with a version
    would
    also be fine). This is somewhat like the "user agent" field in HTTP.

    But you are right that some formatting convention may help
    further here.
    How about this:

    #TOOL:<any user agent information that you like to share>

    Then one could look for comments of this form without knowing
    all the
    tools upfront. Of course, this is just a hint in any case, since one
    could always use the same comment in any manually written query.

    Best regards,

    Markus

    >
    > On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch
    > <markus.kroetzsch@tu-dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>
    <mailto:markus.kroetzsch@tu-dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>>>
    > wrote:
    >
    >     Dear SPARQL users,
    >
    >     We are starting a research project to investigate the use
    of the
    >     Wikidata SPARQL Query Service, with the goal to gain
    insights that
    >     may help to improve Wikidata and the query service [1].
    Currently,
    >     we are still waiting for all data to become available.
    Meanwhile, we
    >     would like to ask for your input.
    >
    >     Preliminary analyses show that the use of the SPARQL query
    service
    >     varies greatly over time, presumably because power users and
    >     software tools are running large numbers of queries. For a
    >     meaningful analysis, we would like to understand such
    high-impact
    >     biases in the data. We therefore need your help:
    >
    >     (1) Are you a SPARQL power user who sometimes runs large
    numbers of
    >     queries (over 10,000)? If so, please let us know how your
    queries
    >     might typically look so we can identify them in the logs.
    >
    >     (2) Are you the developer of a tool that launches SPARQL
    queries? If
    >     so, then please let us know if there is any way to
    identify your
    >     queries.
    >
    >     If (1) or (2) applies to you, then it would be good if you
    could
    >     include an identifying comment into your SPARQL queries in the
    >     future, to make it easier to recognise them. In return,
    this would
    >     enable us to provide you with statistics on the usage of
    your tool [2].
    >
    >     Further feedback is welcome.
    >
    >     Cheers,
    >
    >     Markus
    >
    >
    >     [1]
    >
     https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
    >
     <https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries>
    >
    >     [2] Pending permission by the WMF. Like all Wikimedia
    usage data,
    >     the query logs are under strict privacy protection, so we
    will need
    >     to get clearance before sharing any findings with the
    public. We
    >     hope, however, that there won't be any reservations against
    >     publishing non-identifying information.
    >
    >     --
    >     Prof. Dr. Markus Kroetzsch
    >     Knowledge-Based Systems Group
    >     Faculty of Computer Science
    >     TU Dresden
    >     +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
    >     https://iccl.inf.tu-dresden.de/web/KBS/en
    >     <https://iccl.inf.tu-dresden.de/web/KBS/en>
    >
    >     _______________________________________________
    >     Wikidata mailing list
    >     Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>>
    >     https://lists.wikimedia.org/mailman/listinfo/wikidata
    >     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
    >
    >
    >
    >
    > _______________________________________________
    > Wikidata mailing list
    > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    > https://lists.wikimedia.org/mailman/listinfo/wikidata
    >


    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

4:21 p.m.

Hi!

...

Would it help if I add the following header to every large batch of queries?

I think having a distinct User-Agent header (maybe with URL linking to the rest of the info) would be enough. This is recorded by the request log and can be used later in processing.

In general, every time you create a bot which does large number of processing it is a good practice to send distinct User-Agent header so people on the other side would know what's going on.

-- Stas Malyshev smalyshev@wikimedia.org

Magnus Manske

1 Oct 1 Oct

3:36 a.m.

I'll try to throw in a #TOOL: comment where I can remember using SPARQL, but I'll be bound to forget a few...

On Fri, 30 Sep 2016, 21:21 Stas Malyshev, smalyshev@wikimedia.org wrote:

...

Hi!

...
Would it help if I add the following header to every large batch of

queries?

I think having a distinct User-Agent header (maybe with URL linking to the rest of the info) would be enough. This is recorded by the request log and can be used later in processing.

In general, every time you create a bot which does large number of processing it is a good practice to send distinct User-Agent header so people on the other side would know what's going on.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

2 Oct 2 Oct

4:28 p.m.

Hi!

...

I'll try to throw in a #TOOL: comment where I can remember using SPARQL, but I'll be bound to forget a few...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

-- Stas Malyshev smalyshev@wikimedia.org

Yuri Astrakhan

4:32 p.m.

I would highly recommend using X-Analytics header for this, and establishing a "well known" key name(s). X-Analytics gets parsed into key-value pairs (object field) by our varnish/hadoop infrastructure, whereas the user agent is basically a semi-free form text string. Also, user agent cannot be set for by any javascript client, so we will constantly have to perform two types of analysis - those that came from the "backend" and those that were made by the browser.

On Sun, Oct 2, 2016 at 4:28 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
I'll try to throw in a #TOOL: comment where I can remember using SPARQL, but I'll be bound to forget a few...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Nuria Ruiz

6:07 p.m.

New subject: [Analytics] SPARQL power users and developers

Yuri/Stas:

This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

X-analytics is a separate field in our hive data, we like it when info intended for analytics is dropped there. Please see docs: https://wikitech.wikimedia.org/wiki/X-Analytics

On Sun, Oct 2, 2016 at 1:32 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

I would highly recommend using X-Analytics header for this, and establishing a "well known" key name(s). X-Analytics gets parsed into key-value pairs (object field) by our varnish/hadoop infrastructure, whereas the user agent is basically a semi-free form text string. Also, user agent cannot be set for by any javascript client, so we will constantly have to perform two types of analysis - those that came from the "backend" and those that were made by the browser.

On Sun, Oct 2, 2016 at 4:28 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
I'll try to throw in a #TOOL: comment where I can remember using SPARQL, but I'll be bound to forget a few...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Stas Malyshev

6:40 p.m.

New subject: [Analytics] SPARQL power users and developers

Hi!

...

This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

1. Adding comments to the query itself 2. Adding query parameters 3. Adding query headers, specifically: a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

-- Stas Malyshev smalyshev@wikimedia.org

Guillaume Lederrey

3 Oct 3 Oct

5:41 a.m.

New subject: [Analytics] SPARQL power users and developers

On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

Adding comments to the query itself

Adding query parameters

Adding query headers, specifically:

a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done via HTTP headers. This way, they are not coupled to SPARQL itself and can be analysed with generic tools already in place. Setting a user-agent is a standard best practice and seems to be part of the Mediawiki API guidelines [1], we should use the same guidelines, no reason to reinvent them.

X-Analytics header might allow for more fine grained information, but I'm not sure this is actually needed (and using X-Analytics should not preclude from having a sensible user-agent).

[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client

...

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Magnus Manske

5:55 a.m.

New subject: [Analytics] SPARQL power users and developers

Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

On Mon, Oct 3, 2016 at 10:42 AM Guillaume Lederrey glederrey@wikimedia.org wrote:

...

On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

Adding comments to the query itself

Adding query parameters

Adding query headers, specifically:

a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done via HTTP headers. This way, they are not coupled to SPARQL itself and can be analysed with generic tools already in place. Setting a user-agent is a standard best practice and seems to be part of the Mediawiki API guidelines [1], we should use the same guidelines, no reason to reinvent them.

X-Analytics header might allow for more fine grained information, but I'm not sure this is actually needed (and using X-Analytics should not preclude from having a sensible user-agent).

[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client

...
-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Guillaume Lederrey

7:42 a.m.

New subject: [Analytics] SPARQL power users and developers

On Mon, Oct 3, 2016 at 11:55 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, the limitation of HTTP headers is that it makes things a bit more complicated for tools authors. At the same time, it is a limitation that is already pushed to tools authors using the mediawiki APIs. Having a specific way of doing things for WDQS increases the overall complexity of our infrastructure. As I am more involved on the general infrastructure and not only on WDQS, I am of course biased toward a globally standardized solution more than for a WDQS specific one. I am not absolutely against having a WDQS specific solution if it makes things sufficiently easier on tools author, I just want to make sure we don't take this decision lightly...

...

On Mon, Oct 3, 2016 at 10:42 AM Guillaume Lederrey glederrey@wikimedia.org wrote:

...
On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

Adding comments to the query itself

Adding query parameters

Adding query headers, specifically:

a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done via HTTP headers. This way, they are not coupled to SPARQL itself and can be analysed with generic tools already in place. Setting a user-agent is a standard best practice and seems to be part of the Mediawiki API guidelines [1], we should use the same guidelines, no reason to reinvent them.

X-Analytics header might allow for more fine grained information, but I'm not sure this is actually needed (and using X-Analytics should not preclude from having a sensible user-agent).

[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client

...
-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Stas Malyshev

4 Oct 4 Oct

12:45 a.m.

New subject: [Analytics] SPARQL power users and developers

Hi!

...

Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Yuri Astrakhan

1:05 a.m.

New subject: [Analytics] SPARQL power users and developers

For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated): * tool=<name of the tool> * toolver=<version of the tool> * contact=<some way of contacting you, e.g. @twitter, email@example.com, +1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Markus Kroetzsch

3:15 a.m.

New subject: [Analytics] SPARQL power users and developers

Hi again,

The solutions discussed here seem to be quite a bit more general than what I was thinking about. Of course it would be nice to have a uniform, cross-client way to indicate tools in any MW Web service or API, but this is a slightly bigger (and probably more long-term) goal than what I had in mind. It is a good idea to suggest a standard approach to tool developers there and to have a documentation page on that, but it would take some time until this is adopted by enough tools to work.

For our present task, we just need some more signals we can use. Analysing SPARQL queries requires us to parse them anyway, so comments are fine. In general, the data we are looking at has a lot of noise, so we cannot rely on a single field. We will combine user agents, X-analytics, query comments, and also query shapes (if you get 1M+ similar looking queries in one hour, you know its a bot). With the current data, the query shape is often our main clue, so comments would already be a big step forward.

Best,

Markus

On 04.10.2016 07:05, Yuri Astrakhan wrote:

...

For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com

mailto:email@example.com, +1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev <smalyshev@wikimedia.org mailto:smalyshev@wikimedia.org> wrote:
Hi!

> Using custom HTTP headers would, of course, complicate calls for the
> tool authors (i.e., myself). $.ajax instead of $.get and all that. I
> would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In
that case I guess we need either X-Analytics or put it in the query. Or
maybe Referer header would be fine then - it is also recorded. If
Referer is distinct enough it can be used then.

--
Stas Malyshev
smalyshev@wikimedia.org <mailto:smalyshev@wikimedia.org>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Nuria Ruiz

10:56 a.m.

New subject: [Analytics] SPARQL power users and developers

mmm...There are several things here that are already taken care of by our user agent policy, for example: if you are using a bot or automated tool we already ask you to please include bot in the user agent plus contact info.

Please see: https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets deleted.

X-Analytics is used for bits of info of analytics value, and the contact info of a tool developer doesn't seem to be one of those. Can we backtrack a little bit? What is the goal of this project? To keep tally of who is queying wikidata query service? Anything else?

Thanks,

Nuria

On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com,

+1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia. org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

11:47 a.m.

New subject: [Analytics] SPARQL power users and developers

We already track use of the action API. Combine with this?

https://www.mediawiki.org/wiki/Wikimedia_Reading_Infrastructure_team/Action_...

-Toby

On Tue, Oct 4, 2016 at 7:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

mmm...There are several things here that are already taken care of by our user agent policy, for example: if you are using a bot or automated tool we already ask you to please include bot in the user agent plus contact info.

Please see: https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets deleted.

X-Analytics is used for bits of info of analytics value, and the contact info of a tool developer doesn't seem to be one of those. Can we backtrack a little bit? What is the goal of this project? To keep tally of who is queying wikidata query service? Anything else?

Thanks,

Nuria

On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...
For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com,

+1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.or g/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

12:51 p.m.

New subject: [Analytics] SPARQL power users and developers

Hi Nuria and others,

For context: Stas and I are points of contact in the WMF for Markus et al.'s project. That's why I'm commenting here. :)

* The project and its goals at the proposal level are described at https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries .

* As Markus said, they are not looking for global solutions, they're trying to increase signal in the data and comments seem to be one natural and relatively cheap place to begin with, given that query owners can add them if they're aware of this conversation and that already helps.

* I suggest that we move discussions about possible changes of X-Analytics header to a new thread, if there is a need for it (long term or short term) given that we don't need those changes for this research, at least for now.

Thanks, Leila

On Tue, Oct 4, 2016 at 7:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

mmm...There are several things here that are already taken care of by our user agent policy, for example: if you are using a bot or automated tool we already ask you to please include bot in the user agent plus contact info.

Please see: https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets deleted.

X-Analytics is used for bits of info of analytics value, and the contact info of a tool developer doesn't seem to be one of those. Can we backtrack a little bit? What is the goal of this project? To keep tally of who is queying wikidata query service? Anything else?

Thanks,

Nuria

On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...
For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com,

+1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.or g/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

3:41 p.m.

New subject: [Analytics] SPARQL power users and developers

Hi!

...

X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com

mailto:email@example.com, +1.212.555.1234, ...>

I'd rather have the URL there, and on that page you can write whatever you want. Also solves problems with information being out-of-date, etc.

I think we can also merge tool & version - if there's a version, just put it into toll name :)

-- Stas Malyshev smalyshev@wikimedia.org

3005

Age (days ago)

3009

Last active (days ago)

wikidata@lists.wikimedia.org

26 comments

12 participants

tags (0)

participants (12)

Andra Waagmeester
Denny Vrandečić
Guillaume Lederrey
Leila Zia
Maarten Dammers
Magnus Manske
Markus Kroetzsch
Nuria Ruiz
Sebastian Burgstaller
Stas Malyshev
Toby Negrin
Yuri Astrakhan