Adding analytics@ a public e-mail list where you can post questions such as this one.
that doesn’t tell us how often entities are accessed through
Special:EntityData or wbgetclaims
Does this data already exist, even in the form of raw access logs?
Is this data always requested via http from an api endpoint that will hit a varnish cache? (Daniel can probably answer this)
From what I see on our data we have requests like the following:
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q633155 www.wikidata.org /w/api.php ?callback=jQuery11130020702992017004984_1465195743367&format=json&action=wbgetclaims&property=P373&entity=Q5296&_=1465195743368 www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q573612 www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q472729 www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q349797 www.wikidata.org /w/api.php ?action=compare&torev=344163911&fromrev=344163907&format=json www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2356135 www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2355988 www.wikidata.org /w/api.php ?action=compare&torev=344164023&fromrev=344163948&format=json
If the data you are interested in can be inferred from these requests there is no additional data gathering needed.
If not, what effort would be required to gather this data? For the
purposes of my proposal to the U.S. Census Bureau I am estimating around
six weeks of effort for this for one person working full-time. If it will
take more time I will need to know. I think I have mentioned this before on an e-mail thread but without knowing the details of what you want to do we cannot give you a time estimate. What are the exact metrics you are interested on? Is the project described anywhere in meta?
Thanks,
Nuria
On Thu, Jun 30, 2016 at 11:45 AM, James Hare james@hxstrategy.com wrote:
Copying Lydia Pintscher and Daniel Kinzler (with whom I’ve discussed this very topic).
I am interested in metrics that describe how Wikidata is used. While we do have views on individual pages, that doesn’t tell us how often entities are accessed through Special:EntityData or wbgetclaims. Nor does it tell us how often statements/RDF triples show up in the Wikidata Query Service. Does this data already exist, even in the form of raw access logs? If not, what effort would be required to gather this data? For the purposes of my proposal to the U.S. Census Bureau I am estimating around six weeks of effort for this for one person working full-time. If it will take more time I will need to know.
Thank you, James Hare
On Thursday, June 2, 2016 at 2:18 PM, Nuria Ruiz wrote:
James:
My current operating assumption is that it would take one person,
working on a full time basis, around six weeks to go from raw access logs
to a functioning API that would provide information on how many times a
Wikidata entity was accessed through the various APIs and the >query service. Do you believe this to be an accurate level of effort estimation based on your experience with past projects of this nature? You are starting from the assumption that we do have the data you are interested in in the logs which I am not sure it is the case, have you done you checks on this regard with wikidata developers?
Analytics 'automagically' collects data from logs about *page* requests, any other requests collections (and it seems that yours fit on this scenario) need to be instrumented. I would send an e-mail to analytics@ public list and wikidata folks to ask about how to harvest the data you are interested in, it doesn't sound like it is being collected at this time so your project scope might be quite a bit bigger than you think.
Thanks,
Nuria
On Thu, Jun 2, 2016 at 5:06 AM, James Hare james@hxstrategy.com wrote:
Hello Nuria,
I am currently developing a proposal for the U.S. Census Bureau to integrate their datasets with Wikidata. As part of this, I am interested in getting Wikidata usage metrics beyond the page view data currently available. My concern is that the page views API gives you information only on how many times a *page* is accessed – but Wikidata is not really used in this way. More often is it the case that Wikidata’s information is accessed through the API endpoints (wbgetclaims etc.), through Special:EntityData, and the Wikidata Query Service. If we have information on usage through those mechanisms, that would give me much better information on Wikidata’s usage.
To the extent these metrics are important to my prospective client, I am willing to provide in-kind support to the analytics team to make this information available, including expenses associated with the NDA process (I understand that such a person may need to deal with raw access logs that include PII.) My current operating assumption is that it would take one person, working on a full time basis, around six weeks to go from raw access logs to a functioning API that would provide information on how many times a Wikidata entity was accessed through the various APIs and the query service. Do you believe this to be an accurate level of effort estimation based on your experience with past projects of this nature?
Please let me know if you have any questions. I am happy to discuss my idea with you further.
Regards, James Hare
Am 01.07.2016 um 01:42 schrieb Nuria Ruiz:
Is this data always requested via http from an api endpoint that will hit a varnish cache? (Daniel can probably answer this)
Yes. Special:EntityData is a regular special page, and action=wbgetentities is a regular MW web API request, as your example shows.
If the data you are interested in can be inferred from these requests there is no additional data gathering needed.
Yay!
Nor does it tell us how often statements/RDF triples show up in the Wikidata Query Service.
I'm no expert on the query service, adding Stas for that. As far as I know, SPARQL queries go through Varnish directly to BlazeGraph. In any case, they are not processed by MediaWiki at all. Tracking how often an entity is mentioned in a GET request to the SPARQL service should be possible based on the varnish request logs, with a bit of regex magic. POST requests are more tricky, I suppose.
However, I don't think we are logging the contents of responses at all. I suppose that would have to be build into BlazeGraph somehow. And even if we did that, that would only tell use which entities were present in a result, not which entities were used to answer a query. E.g. if you list all instances of a class (including subclasses), the entities representing the classes are essential to answering the query, but they are not present in the result (and only the top-most class is present in the query).
POST requests are more tricky, I suppose.
FYI that we do not have post data neither responses to either get or post requests, we just store urls and http codes for both get and post. Thus the body of the post is also not available.
However, I don't think we are logging the contents of responses at all. I suppose that would have to be build into BlazeGraph somehow.
You can instrument code to report responses into the cluster just like the search team does it, depending how easy is to fit the instrumenting code that can be little or a lot of work. The mediawiki API is also doing similar "custom" reporting.
James: I think before asking for a time estimate we would need more detail in your end as to what metrics are you interested in measuring. If you could describe your project in meta that would be best. Just in case you might not be familiar with meta this is an example of how research projects are described: https://meta.wikimedia.org/wiki/Research:HTTPS_Transition_and_Article_Censor...
On Fri, Jul 1, 2016 at 1:33 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 01.07.2016 um 01:42 schrieb Nuria Ruiz:
Is this data always requested via http from an api endpoint that will
hit a
varnish cache? (Daniel can probably answer this)
Yes. Special:EntityData is a regular special page, and action=wbgetentities is a regular MW web API request, as your example shows.
If the data you are interested in can be inferred from these requests
there is
no additional data gathering needed.
Yay!
Nor does it tell us how often statements/RDF triples show up in the Wikidata Query Service.
I'm no expert on the query service, adding Stas for that. As far as I know, SPARQL queries go through Varnish directly to BlazeGraph. In any case, they are not processed by MediaWiki at all. Tracking how often an entity is mentioned in a GET request to the SPARQL service should be possible based on the varnish request logs, with a bit of regex magic. POST requests are more tricky, I suppose.
However, I don't think we are logging the contents of responses at all. I suppose that would have to be build into BlazeGraph somehow. And even if we did that, that would only tell use which entities were present in a result, not which entities were used to answer a query. E.g. if you list all instances of a class (including subclasses), the entities representing the classes are essential to answering the query, but they are not present in the result (and only the top-most class is present in the query).
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
On Fri, Jul 1, 2016 at 6:47 AM, Nuria Ruiz nuria@wikimedia.org wrote:
POST requests are more tricky, I suppose.
FYI that we do not have post data neither responses to either get or post requests, we just store urls and http codes for both get and post. Thus the body of the post is also not available.
For requests to api.php that hit the backend api servers we have the ApiAction dataset in Hadoop [0] which includes detailed data on the request parameters. There is also a refined dataset based on the raw ApiAction data in the 'bd808' database [1] that may or may not be easier to work with. The ETL for that refined data needs to be converted to Oozie jobs and moved to the 'wmf' database [2], but for now I have some adhoc scripting running on stat1002 that updates it daily.
SELECT SUM(viewcount) as views FROM bd808.action_action_hourly WHERE year = 2016 AND month = 6 AND action = 'wbgetclaims' ;
Total MapReduce CPU Time Spent: 6 minutes 52 seconds 920 msec OK views 5146909 Time taken: 111.763 seconds, Fetched: 1 row(s)
[0]: https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction [1]: https://phabricator.wikimedia.org/T116065#2151185 [2]: https://phabricator.wikimedia.org/T137321
Bryan