Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hello All,
As you might know WMF has an Open Access Policy that requires all work that
they fund to be Open Access[1]. A strange consequence of this policy, that
I recently ran into, is that it requires researchers funded by grants to
publish OA -- but without providing any funding to do so. That is, I
recently completed an Individual Engagement Grant (IEG), part of whose
scope was explicitly to write a paper about the work[2], and when I wrote
to WMF to acquire funds for OA publishing, they confirmed that the paper
was under the OA mandate but indicated that funds were not available to pay
for OA publishing.
Has anyone else use WMF's Open Access Policy? What was your experience?
[1] https://wikimediafoundation.org/wiki/Open_access_policy
[2]
https://meta.wikimedia.org/wiki/Grants:IEG/WIGI:_Wikipedia_Gender_Index#Act…
Make a great day,
Max Klein ‽ http://notconfusing.com/
Hello,
I have a question for you regarding pageviews datadumps.
I am considering to study reader engagement for different article topics in
different languages. Because of this, I would like to know if there is any
plan to make available pageviews dumps detailing activity log at session
level per user - in a similar way to editor sessions.
Since this would be for a research project I might ask funding for it, I
would like to know if I could count on that, what is the nature of the
available data, and what would be the procedure to obtain this data and if
there would be any implication because of privacy concerns.
Thank you very much!
Best,
Marc Miquel
ᐧ
Hi All:
Given the conversation about fees for publishing articles about Wikipedia in OA journals, I wanted to call your attention to a new journal we are starting, Wiki Studies http://wikistudies.org/
Wiki Studies is an interdisciplinary, open access, peer-reviewed journal focusing on the intersection of Wikipedia and higher education. We are interested in most all of the same topics hosted on the research listserv and the newsletter, including articles about pedagogical practices, epistemology, bias, mission, and reliability. We will not charge for submission or publication, and will offer open access to readers. We will host on Open Journal Systems.
We are just getting started. We are recruiting editors, and plan to have a presence at the upcoming Wiki Conference North America in San Diego 7-10 October 2016. We hope to publish our first volume in March of 2017, consisting of submissions received by 31 December 2016.
Comments, queries, and suggestions all welcome at cummings(a)olemiss.edu<mailto:cummings@olemiss.edu>
Yours,
Bob Cummings
Yaron Koren has proposed to reopen the "Unacceptable behavior" section
(https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Suggested_change_…).
His perspective and mine are given on the talk page.
In brief:
* He disagrees with how "marginalized and otherwise underrepresented
groups" and "encouraged" are handled in the original text.
* I support the current text and process, and have explained why on the
talk page.
Thanks,
Matt Flaschen
Hello!
I was thinking about user sessions, yes, so this would mean to aggregate
pageviews visited by a user during a short amount of time (I should check
the cutoff, but it could be around an hour or less).
I am particularly interested in understanding the order in which pages are
seen (start, end), duration, etc.
I wouldn't need data from a long period neither, but I think data from
multiple languages would be helpful.
I imagined reader data could be sensitive to privacy, but would an NDA with
my university and some sort of data encoding help with this? As I said, it
is for a scientific purpose.
Thanks,
Marc
El dt., 28 juny 2016 a les 21:09, Nuria Ruiz (<nuria(a)wikimedia.org>) va
escriure:
>
> Hello!
>
> >I am considering to study reader engagement for different article topics
> in different languages. Because of this, I would like to know if there is
> >any plan to make available pageviews dumps detailing activity log at
> session level per user - in a similar way to editor sessions.
>
> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If so, no
> we do not have any projects to provide that data as due to privacy concerns
> we neither have nor keep that information.
>
> Thanks,
>
> Nuria
>
>
>
> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia <leila(a)wikimedia.org> wrote:
>
>> + Analytics
>>
>>
>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I have a question for you regarding pageviews datadumps.
>>>
>>> I am considering to study reader engagement for different article topics
>>> in different languages. Because of this, I would like to know if there is
>>> any plan to make available pageviews dumps detailing activity log at
>>> session level per user - in a similar way to editor sessions.
>>>
>>> Since this would be for a research project I might ask funding for it, I
>>> would like to know if I could count on that, what is the nature of the
>>> available data, and what would be the procedure to obtain this data and if
>>> there would be any implication because of privacy concerns.
>>>
>>> Thank you very much!
>>>
>>> Best,
>>>
>>> Marc Miquel
>>> ᐧ
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
Hi Joseph. Perhaps these approximations could already provide me valuable
information. If it is possible to distinguish between mobile and pc visits,
then I could filter the mobile and keep the more reliable pc-based data.
This is all I wanted to know by now to prepare my project. In case I need
to progress with it, I will contact you. Thank you very much for the
answer.
Cheers,
Marc
El dc., 29 juny 2016 a les 10:24, Joseph Allemandou (<
jallemandou(a)wikimedia.org>) va escriure:
> Hi Marc,
>
> The information you're after is not available in the data we collect, for
> at least two reasons
>
> - We don't collect data allowing to detect user sessions (no id-cookie
> or identifier)
> - We don't collect time spent on page
>
> Approximations could be made using finger-printing techniques as a proxy
> for sessions (with an important error on mobile due to ip-pooling), and
> successive events as boundaries for time spent on page.
> These approximations would in any case need an NDA.
>
> Cheers
> Joseph
>
> On Wed, Jun 29, 2016 at 9:16 AM, Marc Miquel <marcmiquel(a)gmail.com> wrote:
>
>> Thanks for the answer, Oliver. But I am not sure it answers my
>> questions. I'd like to study aspects like how much time is spent in
>> certain pages, as a proxy of how content is approached/read/understood. I'd
>> be happy with time of entering the page, time of leaving. This is not
>> entirely centered on 'user activity', but I said that because I imagined
>> data would be stored in a similar way to editor sessions, or in a database
>> and I would need to do the time calculations.
>>
>> Cheers,
>>
>> Marc
>>
>>
>> El dc., 29 juny, 2016 03:11, Oliver Keyes <ironholds(a)gmail.com> va
>> escriure:
>>
>>> If historic data is okay, there's already a dataset released (
>>> https://figshare.com/articles/Activity_Sessions_datasets/1291033) that
>>> was designed specifically to answer questions around how to best calculate
>>> session length with regards to Wikipedia (http://arxiv.org/abs/1411.2878
>>> )
>>>
>>> On Tue, Jun 28, 2016 at 3:42 PM, Marc Miquel <marcmiquel(a)gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I was thinking about user sessions, yes, so this would mean to
>>>> aggregate pageviews visited by a user during a short amount of time (I
>>>> should check the cutoff, but it could be around an hour or less).
>>>>
>>>> I am particularly interested in understanding the order in which pages
>>>> are seen (start, end), duration, etc.
>>>> I wouldn't need data from a long period neither, but I think data from
>>>> multiple languages would be helpful.
>>>>
>>>> I imagined reader data could be sensitive to privacy, but would an NDA
>>>> with my university and some sort of data encoding help with this? As I
>>>> said, it is for a scientific purpose.
>>>>
>>>> Thanks,
>>>>
>>>> Marc
>>>>
>>>> El dt., 28 juny 2016 a les 21:09, Nuria Ruiz (<nuria(a)wikimedia.org>)
>>>> va escriure:
>>>>
>>>>>
>>>>> Hello!
>>>>>
>>>>> >I am considering to study reader engagement for different article
>>>>> topics in different languages. Because of this, I would like to know if
>>>>> there is >any plan to make available pageviews dumps detailing activity log
>>>>> at session level per user - in a similar way to editor sessions.
>>>>>
>>>>> Are you thinking of "all-pageviews-visited-by-a-certain-user"? If so,
>>>>> no we do not have any projects to provide that data as due to privacy
>>>>> concerns we neither have nor keep that information.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Nuria
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 28, 2016 at 6:55 PM, Leila Zia <leila(a)wikimedia.org>
>>>>> wrote:
>>>>>
>>>>>> + Analytics
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 28, 2016 at 6:36 AM, Marc Miquel <marcmiquel(a)gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a question for you regarding pageviews datadumps.
>>>>>>>
>>>>>>> I am considering to study reader engagement for different article
>>>>>>> topics in different languages. Because of this, I would like to know if
>>>>>>> there is any plan to make available pageviews dumps detailing activity log
>>>>>>> at session level per user - in a similar way to editor sessions.
>>>>>>>
>>>>>>> Since this would be for a research project I might ask funding for
>>>>>>> it, I would like to know if I could count on that, what is the nature of
>>>>>>> the available data, and what would be the procedure to obtain this data and
>>>>>>> if there would be any implication because of privacy concerns.
>>>>>>>
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Marc Miquel
>>>>>>> ᐧ
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Wiki-research-l mailing list
>>>>>>> Wiki-research-l(a)lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics(a)lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> Wiki-research-l(a)lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
Hi everybody,
We’re preparing for the June 2016 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201606 and add your name next to any paper you are interested in covering. Our target publication date is Saturday July 2 UTC although actual publication might happen several days later. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
• This issue won’t be published before Saturday (May 28), possibly a bit later
• Case study in political user behavior on Wikipedia
• Combining syntactic patterns and Wikipedia's hierarchy of hyperlinks to extract meronym relations
• Crowdsourcing not all sourced by the crowd: An observation on the behavior of Wikipedia participants
• Customer relationship management practices in the online community – Wikipedia
• Determining the influence of reddit posts on wikipedia pageviews
• Digital History Meets Wikipedia: Analyzing Historical Persons in Wikipedia
• Enriching Wikidata with Frame Semantics
• Manipulating Google’s Knowledge Graph Box to Counter Biased Information Processing During an Online Search on Vaccination: Application of a Technological Debiasing Strategy
• Quality Assessment of Wikipedia Articles Without Feature Engineering
• The double power law in human collaboration behavior: The case of Wikipedia
• Visualizations of relationships among knowledge? Try WikiSeeker!
• Wikipedia traffic data and electoral prediction: towards theoretically informed models
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/wiki/Research:Newsletter