Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi all,
the webrequest and pageview_hourly tables on Hive contain the very
useful user_agent_map field, which stores the following data extracted
from the raw user agent (still available as a separate field):
device_family, browser_family, browser_major, os_family, os_major,
os_minor and wmf_app_version. (The Analytics Engineering team has
built a dashboard that uses this data and last month published a
popular blog post about it.) I understand it is mainly based on the
ua-parser library (http://www.uaparser.org/ ) .
In contrast, the event capsule in our EventLogging tables only
contains the raw, unparsed user agent.
* Does anyone on this list have experience in parsing user agents in
EventLogging data for the purpose of detecting browser family, version
etc, and would like to share advice on how to do this most
efficiently? (In the past, I have written some expressions in MySQL to
extract the app version number for the Wikipedia apps. But it seems a
bit of a pain to do that for classifying browsers in general. One
option would be to export the data and use the Python version of
ua-parser, however doing it directly in MySQL would fit better into
existing workflows.)
* Assuming it is technically possible to add such a pre-parsed
user_agent_map field to the EventLogging tables, would other analysts
be interested in using it too?
This came up recently with the Reading web team, for the purpose of
investigating whether certain issues are caused by certain browsers
only. But I imagine it has arisen in other places as well.
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Dear Analytics mailing list,
I am working, along with Issa Rice (cc'ed) on an analysis of changes to
Wikipedia pageviews since December 2007, when pageview statistics first
started being maintained. To help with our analysis, we collected key
events related to changes to user experience on the site as well as to
statistics availabilty and measurement. We've recorded our findings on this
page in Issa's userspace:
https://en.wikipedia.org/wiki/User:Riceissa/Timeline_of_Wikipedia_analytics
I'd appreciate it if you can highlight:
(a) Factual errors in the material currently in the timeline
(b) Missing events that you think should belong in the timeline, with
regards to the availability of statistics as well as any other events that
affected user experience significantly.
In addition, I had the following question: In the Wikimedia per-article
pageviews API https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_
metrics_pageviews_per_article_project_access_agent_article_g
ranularity_start_end, where do Wikipedia Zero pageviews get recorded? Do
they go under mobile-web, or mobile-app, or neither? https://wikitech.
wikimedia.org/wiki/Analytics/Data/Pageviews
<https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview> gives some
information on how the underlying pageviews are recorded in the pageviews
dataset, but I wasn't clear on how the pageview REST API processes that
data.
Thank you!
Vipul
NOTE: We don't intend to move the page to Wikipedia's main space, as we
know it won't meet the notability criterion. The user space was just a
convenient place to store it while taking advantage of MediaWiki's syntax
and Wikipedia's templates.
Hello Analytics,
Wikipedia’s search function exposes several modifiers (
https://www.mediawiki.org/wiki/Help:CirrusSearch)
On the recent German Wikicon there was a workshop on search and several
community members seemed to be enthusiastic about these functions.
I wonder if there is existing information about the current use of such
queries. I did some research, but I could not find out much.
Such information could help to improve the search function, since sometimes
a few modifiers are heavily used (despite them being hard to access) and
could e.g. be exposed via the user interface.
Jan
--
Jan Dittrich
UX Design/ User Research
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
http://wikimedia.de
Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Hello,
First of all, many thanks for this wonderful project!
I am writing as I downloaded the July pagecounts data from:
https://dumps.wikimedia.org/other/pagecounts-raw/2016/2016-07/
As I was browsing it, I was surprised to notice that some entities, such as
the movie "Suicide Squad", only seem to have gotten very sparse views in
July - see below. In comparison, the clicks for Jason Bourne seem to have
been much higher for the same period. Below are lines from the logs.
Am I doing something wrong?
Many thanks in advance,
Gheorghe
*Suicide Squad*:
(pagecounts-20160727-020000.gz,en Suicide_squad_(film) 1 6614)
(pagecounts-20160727-160000.gz,en Suicide_squad_(film) 1 25599)
(pagecounts-20160728-220000.gz,en Suicide_squad_(film) 2 32210)
(pagecounts-20160731-210000.gz,en Suicide_squad_(film) 11 72721)
*Jason Bourne*:
pagecounts-20160731-210000.gz,sv Jason_Bourne_(film) 12 124894)
(pagecounts-20160731-210000.gz,tr Jason_Bourne_(film) 78 1852192)
(pagecounts-20160731-220000.gz,en File:Jason_Bourne_(film).jpg 2 19067)
(pagecounts-20160731-220000.gz,en Jason_Bourne_(film) 2119 73275075)
(pagecounts-20160731-220000.gz,en Talk:Jason_Bourne_(film) 1 10059)
pagecounts-20160731-220000.gz,fr Jason_Bourne_(film) 55 1226127)
(pagecounts-20160731-220000.gz,hu Jason_Bourne_(film) 3 34335)
(pagecounts-20160731-220000.gz,it Jason_Bourne_(film) 29 579129)
(pagecounts-20160731-220000.gz,nl Jason_Bourne_(film) 11 125928)
Hi,
I was wondering where the famous (or rather infamous) 2010 survey is? This
one was made by the WMF and showed that women made less than 13% of WP
contributors (mentioned here
<https://meta.wikimedia.org/wiki/Women_and_Wikimedia_Survey_2011>). I can't
find it anywhere. Help is much appreciated. :)
Best,
Reem
--
*Kind regards,Reem Al-Kashif*
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, September
21, 2016 at 11:30 AM (PST) 18:30 (UTC).
YouTube stream: https://www.youtube.com/watch?v=fTDkVeqjw80
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2016>.
This month's showcase includes.
Finding News Citations for WikipediaBy *Besnik Fetahu
<http://www.l3s.de/~fetahu/> (Leibniz University of Hannover)*An important
editing policy in Wikipedia is to provide citations for added statements in
Wikipedia pages, where statements can be arbitrary pieces of text, ranging
from a sentence to a paragraph. In many cases citations are either outdated
or missing altogether. In this work we address the problem of finding and
updating news citations for statements in entity pages. We propose a two-
stage supervised approach for this problem. In the first step, we construct
a classifier to find out whether statements need a news citation or other
kinds of citations (web, book, journal, etc.). In the second step, we
develop a news citation algorithm for Wikipedia statements, which
recommends appropriate citations from a given news collection. Apart from
IR techniques that use the statement to query the news collection, we also
formalize three properties of an appropriate citation, namely: (i) the
citation should entail the Wikipedia statement, (ii) the statement should
be central to the citation, and (iii) the citation should be from an
authoritative source. We perform an extensive evaluation of both steps,
using 20 million articles from a real-world news collection. Our results
are quite promising, and show that we can perform this task with high
precision and at scale.
Designing and Building Online Discussion SystemsBy *Amy X. Zhang
<http://people.csail.mit.edu/axz/> (MIT)*Today, conversations are
everywhere on the Internet and come in many different forms. However, there
are still many problems with discussion interfaces today. In my talk, I
will first give an overview of some of the problems with discussion
systems, including difficulty dealing with large scales, which exacerbates
additional problems with navigating deep threads containing lots of
back-and-forth and getting an overall summary of a discussion. Other
problems include dealing with moderation and harassment in discussion
systems and gaining control over filtering, customization, and means of
access. Then I will focus on a few projects I am working on in this space
now. The first is Wikum, a system I developed to allow users to
collaboratively generate a wiki-like summary from threaded discussion. The
second, which I have just begun, is exploring the design space of
presentation and navigation of threaded discussion. I will next discuss
Murmur, a mailing list hybrid system we have built to implement and test
ideas around filtering, customization, and flexibility of access, as well
as combating harassment. Finally, I'll wrap up with what I am working on at
Google Research this summer: developing a taxonomy to describe online forum
discussion and using this information to extract meaningful content useful
for search, summarization of discussions, and characterization of
communities.
Hope to see you there!
Sarah R. Rodlund
Senior Project Coordinator-Engineering, Wikimedia Foundation
srodlund(a)wikimedia.org
Hi everybody,
the Analytics team is going to reboot all the stat hosts (stat1002,
stat1003 and stat1004) and the Hadoop cluster nodes to install new kernels
(security upgrade required). The work will start tomorrow morning (Sep
22nd) at around 9:00 AM CEST.
This task might interfere with ongoing Hadoop jobs or processes running on
the stat* hosts, so please let me know if there is any motivation to
postpone the maintenance.
Please also feel free to reach out to the analytics IRC channel or to me
directly if you have more questions :)
Thanks!
Regards,
Luca