Re: [Analytics] [Wikidata] SPARQL power users and developers

List overview All Threads
Download

newer

older

Identifying bots and bot edit...

Seeking feedback (+ answer to 1...

Yuri Astrakhan

3 Oct 2016 3 Oct '16

3:32 a.m.

I would highly recommend using X-Analytics header for this, and establishing a "well known" key name(s). X-Analytics gets parsed into key-value pairs (object field) by our varnish/hadoop infrastructure, whereas the user agent is basically a semi-free form text string. Also, user agent cannot be set for by any javascript client, so we will constantly have to perform two types of analysis - those that came from the "backend" and those that were made by the browser.

On Sun, Oct 2, 2016 at 4:28 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
I'll try to throw in a #TOOL: comment where I can remember using SPARQL, but I'll be bound to forget a few...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Attachments:

attachment.htm (text/html — 2.0 KB)

Show replies by date

Nuria Ruiz

3 Oct 3 Oct

5:07 a.m.

New subject: [Wikidata] SPARQL power users and developers

Yuri/Stas:

This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

X-analytics is a separate field in our hive data, we like it when info intended for analytics is dropped there. Please see docs: https://wikitech.wikimedia.org/wiki/X-Analytics

On Sun, Oct 2, 2016 at 1:32 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

I would highly recommend using X-Analytics header for this, and establishing a "well known" key name(s). X-Analytics gets parsed into key-value pairs (object field) by our varnish/hadoop infrastructure, whereas the user agent is basically a semi-free form text string. Also, user agent cannot be set for by any javascript client, so we will constantly have to perform two types of analysis - those that came from the "backend" and those that were made by the browser.

On Sun, Oct 2, 2016 at 4:28 PM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
I'll try to throw in a #TOOL: comment where I can remember using SPARQL, but I'll be bound to forget a few...

Thanks, though using distinct User-Agent may be easier for analysis, since those are stored as separate fields, and doing operations on separate field would be much easier than extracting comments from query field e.g. when doing Hive data processing.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Stas Malyshev

5:40 a.m.

New subject: [Wikidata] SPARQL power users and developers

Hi!

...

This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

1. Adding comments to the query itself 2. Adding query parameters 3. Adding query headers, specifically: a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

-- Stas Malyshev smalyshev@wikimedia.org

Guillaume Lederrey

4:41 p.m.

New subject: [Wikidata] SPARQL power users and developers

On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

Adding comments to the query itself

Adding query parameters

Adding query headers, specifically:

a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done via HTTP headers. This way, they are not coupled to SPARQL itself and can be analysed with generic tools already in place. Setting a user-agent is a standard best practice and seems to be part of the Mediawiki API guidelines [1], we should use the same guidelines, no reason to reinvent them.

X-Analytics header might allow for more fine grained information, but I'm not sure this is actually needed (and using X-Analytics should not preclude from having a sensible user-agent).

[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client

...

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Magnus Manske

4:55 p.m.

New subject: [Wikidata] SPARQL power users and developers

Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

On Mon, Oct 3, 2016 at 10:42 AM Guillaume Lederrey glederrey@wikimedia.org wrote:

...

On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

Adding comments to the query itself

Adding query parameters

Adding query headers, specifically:

a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done via HTTP headers. This way, they are not coupled to SPARQL itself and can be analysed with generic tools already in place. Setting a user-agent is a standard best practice and seems to be part of the Mediawiki API guidelines [1], we should use the same guidelines, no reason to reinvent them.

X-Analytics header might allow for more fine grained information, but I'm not sure this is actually needed (and using X-Analytics should not preclude from having a sensible user-agent).

[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client

...
-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Guillaume Lederrey

6:42 p.m.

New subject: [Wikidata] SPARQL power users and developers

On Mon, Oct 3, 2016 at 11:55 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, the limitation of HTTP headers is that it makes things a bit more complicated for tools authors. At the same time, it is a limitation that is already pushed to tools authors using the mediawiki APIs. Having a specific way of doing things for WDQS increases the overall complexity of our infrastructure. As I am more involved on the general infrastructure and not only on WDQS, I am of course biased toward a globally standardized solution more than for a WDQS specific one. I am not absolutely against having a WDQS specific solution if it makes things sufficiently easier on tools author, I just want to make sure we don't take this decision lightly...

...

On Mon, Oct 3, 2016 at 10:42 AM Guillaume Lederrey glederrey@wikimedia.org wrote:

...
On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
This thread is missing some background context info as to what the issues are, if you could forward it it will be great.

Well, I'm not talking about specific issues, except for the general need of identifying which tool is responsible for which queries. Basically, there are several ways of doing it:

Adding comments to the query itself

Adding query parameters

Adding query headers, specifically:

a) distinct User-Agent b) distinct X-Analytics header c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more efficient when we need to find out who sent a particular query. 3b may be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done via HTTP headers. This way, they are not coupled to SPARQL itself and can be analysed with generic tools already in place. Setting a user-agent is a standard best practice and seems to be part of the Mediawiki API guidelines [1], we should use the same guidelines, no reason to reinvent them.

X-Analytics header might allow for more fine grained information, but I'm not sure this is actually needed (and using X-Analytics should not preclude from having a sensible user-agent).

[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client

...
-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST

Stas Malyshev

4 Oct 4 Oct

11:45 a.m.

New subject: [Wikidata] SPARQL power users and developers

Hi!

...

Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Yuri Astrakhan

12:05 p.m.

New subject: [Wikidata] SPARQL power users and developers

For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated): * tool=<name of the tool> * toolver=<version of the tool> * contact=<some way of contacting you, e.g. @twitter, email@example.com, +1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Markus Kroetzsch

2:15 p.m.

New subject: [Wikidata] SPARQL power users and developers

Hi again,

The solutions discussed here seem to be quite a bit more general than what I was thinking about. Of course it would be nice to have a uniform, cross-client way to indicate tools in any MW Web service or API, but this is a slightly bigger (and probably more long-term) goal than what I had in mind. It is a good idea to suggest a standard approach to tool developers there and to have a documentation page on that, but it would take some time until this is adopted by enough tools to work.

For our present task, we just need some more signals we can use. Analysing SPARQL queries requires us to parse them anyway, so comments are fine. In general, the data we are looking at has a lot of noise, so we cannot rely on a single field. We will combine user agents, X-analytics, query comments, and also query shapes (if you get 1M+ similar looking queries in one hour, you know its a bot). With the current data, the query shape is often our main clue, so comments would already be a big step forward.

Best,

Markus

On 04.10.2016 07:05, Yuri Astrakhan wrote:

...

For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com

mailto:email@example.com, +1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev <smalyshev@wikimedia.org mailto:smalyshev@wikimedia.org> wrote:
Hi!

> Using custom HTTP headers would, of course, complicate calls for the
> tool authors (i.e., myself). $.ajax instead of $.get and all that. I
> would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In
that case I guess we need either X-Analytics or put it in the query. Or
maybe Referer header would be fine then - it is also recorded. If
Referer is distinct enough it can be used then.

--
Stas Malyshev
smalyshev@wikimedia.org <mailto:smalyshev@wikimedia.org>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/analytics
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Nuria Ruiz

9:56 p.m.

New subject: [Wikidata] SPARQL power users and developers

mmm...There are several things here that are already taken care of by our user agent policy, for example: if you are using a bot or automated tool we already ask you to please include bot in the user agent plus contact info.

Please see: https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets deleted.

X-Analytics is used for bits of info of analytics value, and the contact info of a tool developer doesn't seem to be one of those. Can we backtrack a little bit? What is the goal of this project? To keep tally of who is queying wikidata query service? Anything else?

Thanks,

Nuria

On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...

For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com,

+1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia. org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

10:47 p.m.

New subject: [Wikidata] SPARQL power users and developers

We already track use of the action API. Combine with this?

https://www.mediawiki.org/wiki/Wikimedia_Reading_Infrastructure_team/Action_...

-Toby

On Tue, Oct 4, 2016 at 7:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

mmm...There are several things here that are already taken care of by our user agent policy, for example: if you are using a bot or automated tool we already ask you to please include bot in the user agent plus contact info.

Please see: https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets deleted.

X-Analytics is used for bits of info of analytics value, and the contact info of a tool developer doesn't seem to be one of those. Can we backtrack a little bit? What is the goal of this project? To keep tally of who is queying wikidata query service? Anything else?

Thanks,

Nuria

On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...
For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com,

+1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.or g/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Leila Zia

11:51 p.m.

New subject: [Wikidata] SPARQL power users and developers

Hi Nuria and others,

For context: Stas and I are points of contact in the WMF for Markus et al.'s project. That's why I'm commenting here. :)

* The project and its goals at the proposal level are described at https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries .

* As Markus said, they are not looking for global solutions, they're trying to increase signal in the data and comments seem to be one natural and relatively cheap place to begin with, given that query owners can add them if they're aware of this conversation and that already helps.

* I suggest that we move discussions about possible changes of X-Analytics header to a new thread, if there is a need for it (long term or short term) given that we don't need those changes for this research, at least for now.

Thanks, Leila

On Tue, Oct 4, 2016 at 7:56 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

mmm...There are several things here that are already taken care of by our user agent policy, for example: if you are using a bot or automated tool we already ask you to please include bot in the user agent plus contact info.

Please see: https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets deleted.

X-Analytics is used for bits of info of analytics value, and the contact info of a tool developer doesn't seem to be one of those. Can we backtrack a little bit? What is the goal of this project? To keep tally of who is queying wikidata query service? Anything else?

Thanks,

Nuria

On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan yastrakhan@wikimedia.org wrote:

...
For consistency between all possible clients, we seem to have only two options: either part of the query, or the X-Analytics header. The user-agent header is not really an option because it is not available for all types of clients, and we want to have just one way for everyone. Headers other than X-Analytics will need custom handling, whereas we already have plenty of Varnish code to deal with X-Analytics header, split it into parts, and for Hive to parse it. Yes it will be an extra line of code in JS ($.ajax instead of $.get), but I am sure this is not such a big deal if we provide cookie cutter code. Parsing query string in varnish/hive is also some complex extra work, so lets keep X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com,

+1.212.555.1234, ...>

Bikeshedding ? See also: https://wikitech.wikimedia.or g/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

...
Using custom HTTP headers would, of course, complicate calls for the tool authors (i.e., myself). $.ajax instead of $.get and all that. I would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In that case I guess we need either X-Analytics or put it in the query. Or maybe Referer header would be fine then - it is also recorded. If Referer is distinct enough it can be used then.

-- Stas Malyshev smalyshev@wikimedia.org

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

5 Oct 5 Oct

2:41 a.m.

New subject: [Wikidata] SPARQL power users and developers

Hi!

...

X-Analytics. Proposed required values (semicolon separated):

tool=<name of the tool>

toolver=<version of the tool>

contact=<some way of contacting you, e.g. @twitter, email@example.com

mailto:email@example.com, +1.212.555.1234, ...>

I'd rather have the URL there, and on that page you can write whatever you want. Also solves problems with information being out-of-date, etc.

I think we can also merge tool & version - if there's a version, just put it into toll name :)

-- Stas Malyshev smalyshev@wikimedia.org

3013

Age (days ago)

3015

Last active (days ago)

analytics@lists.wikimedia.org

12 comments

8 participants

tags (0)

participants (8)

Guillaume Lederrey
Leila Zia
Magnus Manske
Markus Kroetzsch
Nuria Ruiz
Stas Malyshev
Toby Negrin
Yuri Astrakhan