(cc-ing analytics public list)
Fundraising folks:
We were talking about the problems we have had with clickstream data and
kafka as of late and how to prevent issues like this one going forward: (
https://phabricator.wikimedia.org/T132500)
We think you guys could benefit from setting up the same set of alarms on
data integrity that we have on the webrequest end and we ill be happy to
help with that at your convenience.
An example of how these alarms could work (simplified version): every
message that comes from kafka has a sequence Id, if sorted those sequence
Ids should be more or less contiguous, a gap in sequence ids indicates an
issue with data loss at the kafka source. A script checks for sequence ids
and number of records and triggers an alarm if those two do not match.
Let us know if you want to proceed with this work.
Thanks,
Nuria
Hi Marc,
On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com> wrote:
> Since this would be for a research project I might ask funding for it, I
> would like to know if I could count on that, what is the nature of the
> available data, and what would be the procedure to obtain this data and if
> there would be any implication because of privacy concerns.
>
We grant access to webrequest log data and the non-public derivatives of
it not very frequently. When we do, we do it through creating formal
collaborations with the researchers. What these collaborations are and how
we set them up are explained at
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations.
To provide more context:
Requiring formal collaborations as a necessary step for accessing the data
means that we cannot scale rapidly, i.e, each researcher on our team is
only able to be involved in so many of them. The practical cap is somewhere
around 3 collaborations per researcher in my experience. We understand that
this is a problem as we would like more researchers to work with this data.
We reconsider ways for expanding our capacity to collaborate frequently. We
also always consider releasing more data-sets publicly since ultimately,
that's one of the best ways for us to empower others do what they want to
work on and find value in.
Best,
Leila
> Thank you very much!
>
> Best,
>
> Marc Miquel
> ᐧ
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
See also https://phabricator.wikimedia.org/T119352, which is proposing to
track time on site / page in general.
On Jul 1, 2016 4:24 PM, "Marcel Ruiz Forns" <mforns(a)wikimedia.org> wrote:
If we were doing this internally, a possibility would be to instrument
MediaWiki and send sampled events with the time on page to EventLogging.
This would not be retroactive though, we would have to wait a couple months
to collect significant data. In any case, I'm not sure if this would be
possible with an NDA?
On Fri, Jul 1, 2016 at 11:52 AM, Marc Miquel <marcmiquel(a)gmail.com> wrote:
> I see it is quite complicated to work with this data. It is a pity
> considering that valuable insights could be driven by readers' behaviors. I
> will think about what can be useful for the study.
>
> Thanks for the answers, Nuria and Marcel! :)
> Cheers,
>
> Marc
>
> El dj., 30 juny 2016 a les 14:16, Marcel Ruiz Forns (<mforns(a)wikimedia.org>)
> va escriure:
>
>> Marc, I also see what Nuria says. Also please consider that the majority
>> of Wikipedia sessions have only one pageview. So in the majority of
>> sessions it would not be possible to approximate the time spent on page
>> with boundaries with Joseph's alternative.
>>
>> On Thu, Jun 30, 2016 at 2:02 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>>
>>> >Aye, as Joseph says, the time-on-page or time-leaving is not
>>> collected, except as an extension of session reconstruction work. If you
>>> want a >concrete time, you're not gonna get it.
>>>
>>> I was about to make the same point, the data set that will most closely
>>> answer your questions is the one Oliver mentioned, otherwise we do not keep
>>> any information related to time on site and page requests so there is no
>>> "approximation" possible that will work on overall data. Even if you
>>> calculate signatures with IP-hash +user agent to approximate users (a
>>> method with known issues) there is no way for you to distinguish someone
>>> reading a page for an hour and someone that came to wikipedia twice in the
>>> same hour and spent a minute each time. Hopefully my example makes things
>>> more clear.
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>> On Wed, Jun 29, 2016 at 4:58 AM, Oliver Keyes <ironholds(a)gmail.com>
>>> wrote:
>>>
>>>> Aye, as Joseph says, the time-on-page or time-leaving is not collected,
>>>> except as an extension of session reconstruction work. If you want a
>>>> concrete time, you're not gonna get it.
>>>>
>>>> While PC-based data is more reliable than mobile, that does not
>>>> necessarily mean "reliable". I'm sort of confused, I guess, as to why the
>>>> datasets I linked (unless I'm misremembering them?) don't help: you would
>>>> have to do the calculation yourself but they should contain all the data
>>>> necessary to make that calculation (unless you want to have the pageID or
>>>> title associated with the time-on-page, in which case...yeah, that's an
>>>> issue).
>>>>
>>>> On Wed, Jun 29, 2016 at 3:16 AM, Marc Miquel <marcmiquel(a)gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the answer, Oliver. But I am not sure it answers my
>>>>> questions. I'd like to study aspects like how much time is spent in
>>>>> certain pages, as a proxy of how content is approached/read/understood. I'd
>>>>> be happy with time of entering the page, time of leaving. This is not
>>>>> entirely centered on 'user activity', but I said that because I imagined
>>>>> data would be stored in a similar way to editor sessions, or in a database
>>>>> and I would need to do the time calculations.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Marc
>>>>>
>>>>>
>>>>> El dc., 29 juny, 2016 03:11, Oliver Keyes <ironholds(a)gmail.com> va
>>>>> escriure:
>>>>>
>>>>>> If historic data is okay, there's already a dataset released (
>>>>>> https://figshare.com/articles/Activity_Sessions_datasets/1291033)
>>>>>> that was designed specifically to answer questions around how to best
>>>>>> calculate session length with regards to Wikipedia (
>>>>>> http://arxiv.org/abs/1411.2878)
>>>>>>
>>>>>> On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquel <marcmiquel(a)gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello!
>>>>>>>
>>>>>>> I was thinking about user sessions, yes, so this would mean to
>>>>>>> aggregate pageviews visited by a user during a short amount of time (I
>>>>>>> should check the cutoff, but it could be around an hour or less).
>>>>>>>
>>>>>>> I am particularly interested in understanding the order in which
>>>>>>> pages are seen (start, end), duration, etc.
>>>>>>> I wouldn't need data from a long period neither, but I think data
>>>>>>> from multiple languages would be helpful.
>>>>>>>
>>>>>>> I imagined reader data could be sensitive to privacy, but would an
>>>>>>> NDA with my university and some sort of data encoding help with this? As I
>>>>>>> said, it is for a scientific purpose.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Marc
>>>>>>>
>>>>>>> El dt., 28 juny 2016 a les 21:09, Nuria Ruiz (<nuria(a)wikimedia.org>)
>>>>>>> va escriure:
>>>>>>>
>>>>>>>>
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> >I am considering to study reader engagement for different article
>>>>>>>> topics in different languages. Because of this, I would like to know if
>>>>>>>> there is >any plan to make available pageviews dumps detailing activity log
>>>>>>>> at session level per user - in a similar way to editor sessions.
>>>>>>>>
>>>>>>>> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If
>>>>>>>> so, no we do not have any projects to provide that data as due to privacy
>>>>>>>> concerns we neither have nor keep that information.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Nuria
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia <leila(a)wikimedia.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> + Analytics
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I have a question for you regarding pageviews datadumps.
>>>>>>>>>>
>>>>>>>>>> I am considering to study reader engagement for different article
>>>>>>>>>> topics in different languages. Because of this, I would like to know if
>>>>>>>>>> there is any plan to make available pageviews dumps detailing activity log
>>>>>>>>>> at session level per user - in a similar way to editor sessions.
>>>>>>>>>>
>>>>>>>>>> Since this would be for a research project I might ask funding
>>>>>>>>>> for it, I would like to know if I could count on that, what is the nature
>>>>>>>>>> of the available data, and what would be the procedure to obtain this data
>>>>>>>>>> and if there would be any implication because of privacy concerns.
>>>>>>>>>>
>>>>>>>>>> Thank you very much!
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Marc Miquel
>>>>>>>>>> ᐧ
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Wiki-research-l mailing list
>>>>>>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> Analytics(a)lists.wikimedia.org
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> Analytics(a)lists.wikimedia.org
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Wiki-research-l mailing list
>>>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Wiki-research-l mailing list
>>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Wiki-research-l mailing list
>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> *Marcel Ruiz Forns*
>> Analytics Developer
>> Wikimedia Foundation
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
+ Analytics
On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com> wrote:
> Hello,
>
> I have a question for you regarding pageviews datadumps.
>
> I am considering to study reader engagement for different article topics
> in different languages. Because of this, I would like to know if there is
> any plan to make available pageviews dumps detailing activity log at
> session level per user - in a similar way to editor sessions.
>
> Since this would be for a research project I might ask funding for it, I
> would like to know if I could count on that, what is the nature of the
> available data, and what would be the procedure to obtain this data and if
> there would be any implication because of privacy concerns.
>
> Thank you very much!
>
> Best,
>
> Marc Miquel
> ᐧ
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
Adding analytics@ a public e-mail list where you can post questions such
as this one.
>that doesn’t tell us how often entities are accessed through
Special:EntityData or wbgetclaims
>Does this data already exist, even in the form of raw access logs?
Is this data always requested via http from an api endpoint that will hit a
varnish cache? (Daniel can probably answer this)
>From what I see on our data we have requests like the following:
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q633155
www.wikidata.org /w/api.php
?callback=jQuery11130020702992017004984_1465195743367&format=json&action=wbgetclaims&property=P373&entity=Q5296&_=1465195743368
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q573612
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q472729
www.wikidata.org /w/api.php ?action=wbgetclaims&format=json&entity=Q349797
www.wikidata.org /w/api.php
?action=compare&torev=344163911&fromrev=344163907&format=json
www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2356135
www.wikidata.org /w/api.php ?action=wbgetentities&format=xml&ids=Q2355988
www.wikidata.org /w/api.php
?action=compare&torev=344164023&fromrev=344163948&format=json
If the data you are interested in can be inferred from these requests there
is no additional data gathering needed.
>If not, what effort would be required to gather this data? For the
purposes of my proposal to the U.S. Census Bureau I am estimating around
>six weeks of effort for this for one person working full-time. If it will
take more time I will need to know.
I think I have mentioned this before on an e-mail thread but without
knowing the details of what you want to do we cannot give you a time
estimate. What are the exact metrics you are interested on? Is the
project described anywhere in meta?
Thanks,
Nuria
On Thu, Jun 30, 2016 at 11:45 AM, James Hare <james(a)hxstrategy.com> wrote:
> Copying Lydia Pintscher and Daniel Kinzler (with whom I’ve discussed this
> very topic).
>
> I am interested in metrics that describe how Wikidata is used. While we do
> have views on individual pages, that doesn’t tell us how often entities are
> accessed through Special:EntityData or wbgetclaims. Nor does it tell us how
> often statements/RDF triples show up in the Wikidata Query Service. Does
> this data already exist, even in the form of raw access logs? If not, what
> effort would be required to gather this data? For the purposes of my
> proposal to the U.S. Census Bureau I am estimating around six weeks of
> effort for this for one person working full-time. If it will take more time
> I will need to know.
>
>
> Thank you,
> James Hare
>
> On Thursday, June 2, 2016 at 2:18 PM, Nuria Ruiz wrote:
>
> James:
>
> >My current operating assumption is that it would take one person,
> working on a full time basis, around six weeks to go from raw access logs
> >to a functioning API that would provide information on how many times a
> Wikidata entity was accessed through the various APIs and the >query
> service. Do you believe this to be an accurate level of effort estimation
> based on your experience with past projects of this nature?
> You are starting from the assumption that we do have the data you are
> interested in in the logs which I am not sure it is the case, have you done
> you checks on this regard with wikidata developers?
>
> Analytics 'automagically' collects data from logs about *page* requests,
> any other requests collections (and it seems that yours fit on this
> scenario) need to be instrumented. I would send an e-mail to analytics@
> public list and wikidata folks to ask about how to harvest the data you are
> interested in, it doesn't sound like it is being collected at this time so
> your project scope might be quite a bit bigger than you think.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Thu, Jun 2, 2016 at 5:06 AM, James Hare <james(a)hxstrategy.com> wrote:
>
> Hello Nuria,
>
> I am currently developing a proposal for the U.S. Census Bureau to
> integrate their datasets with Wikidata. As part of this, I am interested in
> getting Wikidata usage metrics beyond the page view data currently
> available. My concern is that the page views API gives you information only
> on how many times a *page* is accessed – but Wikidata is not really used
> in this way. More often is it the case that Wikidata’s information is
> accessed through the API endpoints (wbgetclaims etc.), through
> Special:EntityData, and the Wikidata Query Service. If we have information
> on usage through those mechanisms, that would give me much better
> information on Wikidata’s usage.
>
> To the extent these metrics are important to my prospective client, I am
> willing to provide in-kind support to the analytics team to make this
> information available, including expenses associated with the NDA process
> (I understand that such a person may need to deal with raw access logs that
> include PII.) My current operating assumption is that it would take one
> person, working on a full time basis, around six weeks to go from raw
> access logs to a functioning API that would provide information on how many
> times a Wikidata entity was accessed through the various APIs and the query
> service. Do you believe this to be an accurate level of effort estimation
> based on your experience with past projects of this nature?
>
> Please let me know if you have any questions. I am happy to discuss my
> idea with you further.
>
>
> Regards,
> James Hare
>
>
>
>