For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Page Previews is now fully deployed to all but 2 of the Wikipedias. In
deploying it, we've created a new way to interact with pages without
navigating to them. This impacts the overall and per-page pageviews metrics
that are used in myriad reports, e.g. to editors about the readership of
their articles and in monthly reports to the board. Consequently, we need
to be able to report a user reading the preview of a page just like we do
them navigating to it.
Readers Web are planning to instrument Page Previews such that when a
preview is available and open for longer than X ms, a "page interaction" is
recorded. We're aware of a couple of mechanisms for recording something
like this from the client:
1. All files viewed with the media viewer are recorded by the client
requesting the /beacon/media?duration=X&uri=Y URL at some point  – as
Nuria points out in that thread, requests to /beacon/... are already
filtered and a canned response is sent immediately by Varnish .
2. Requesting a URL with the X-Analytics header  set to "preview". In
this context, we'd make a HEAD request to the URL of the page with the
IMO #1 is preferable from the operations and performance perspectives as
the response is always served from the edge and includes very few headers,
whereas the request in #2 may be served by the application servers if the
user is logged in (or in the mobile site's beta cohort). However, the
requests in #2 are already
We're currently considering recording page interactions when previews are
open for longer than 1000 ms. We estimate that this would increase overall
web requests by 0.3% .
Are there other ways of recording this information? We're fairly confident
that #1 seems like the best choice here but it's referred to as the
"virtual file view hack". Is this really the case? Moreover, should we
request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we
consolidate the URLs as both represent the same thing essentially?
IRC (Freenode): phuedx
as part of a lecture on Information Retrieval I am giving we work a lot
with Simple Wikipedia articles. It's a great data set because it's
comprehensive and not domain specific so when building search on top of it
humans can easily judge result quality, and it's still small enough to be
handled by a regular computer.
This year I want to cover the topic of Machine Learning for search. The
idea is to look at result clicks from an internal search search engine,
feed that into the Machine Learning and adjust search accordingly so that
the top-clicked results actually rank best. We will be using Solr LTR for
I would love to base this on Simple Wikipedia data since it would fit well
into the rest of the lecture. Unfortunately, I could not find that data.
The closest I came is
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream but this
covers neither Simple Wikipedia nor does it specify internal search queries.
Did I miss something? Is this data available somewhere? Can I produce it
myself from raw data? Ideally I would need (query-document) pairs with the
number of occurrences.
*Georg M. Sorst I CTO*
[image: FINDOLOGIC Logo]
Jakob-Haringer-Str. 5a | 5020
I T.: +43 662 456708 <+43%20662%20456708>
www.findologic.com Folgen Sie uns auf: XING
Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
A6 Stand E130 in München*! Hier
Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
<beratung(a)findologic.com?subject=SHOPTALK%20Las%20Vegas> Termin vereinbaren!
Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
in Zürich*! Hier <beratung(a)findologic.com?subject=SOM%20Z%C3%BCrich> Termin
Hier <http://www.findologic.com> geht es zu unserer *Homepage*!
The Analytics team would like to announce that we have migrated the
reportcard to a new domain:
The migrated reportcard includes both legacy and current pageview data,
daily unique devices and new editors data. Pageview and devices data is
updated daily but editor data is still updated ad-hoc.
The team is working at this time on revamping the way we compute edit data
and we hope to be able to provide monthly updates for the main edit metrics
this quarter. Some of those will be visible in the reportcard but the new
wikistats will have more detailed reports.
You can follow the new wikistats project here:
I recently spoke with "Next Big Sound" which is a company that tracks
Wikipedia page views on certain artists. They informed me that they got
details of the views directly from Wikipedia (because I had emailed them
that the View counts mentioned on Wikipedia and Next Big Sound show a major
discrepancy). There are rumors flying about saying that the information
only gathered is from Desktop Views, in which the counts are extremely
similar. Is there any way you can confirm this as true? Or is there another
method you also count that is gathered for other companies that collect
views? I know you have no idea of what Next Big Sound is presenting to the
world wide audience, but I wanted to know if you can explain what
information is given to Next Big Sound in terms of data. Thank you
Dear Analytics Team,
I am a M.Sc. student at Copenhagen Business School. For my Master Thesis I would like to use page views data from certain Wikipedia articles. I found out that in July 2015 a new API was created which delivers this data. However, for my project I have to use data from before 2015.
In my further search I found out that the old page views data exists (https://dumps.wikimedia.org/other/pagecounts-raw/ <https://dumps.wikimedia.org/other/pagecounts-raw/>) and until March 2017 it could be queried by using stats.grok.se. Unfortunately, this site does no longer exists, which is why I cannot filter and query the raw data in .gz format on the webpage.
Are there any possibilities to get the page views data for certain articles from before July 2017?
Thanks a lot and best regards,
PS: I am conducting my research in R and for the post 2015 data the package “pageviews” works great.
The next Research Showcase will be live-streamed this Wednesday, February
21, 2018 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=fpmRWCE7F_I
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
This month's presentation:
*Visual enrichment of collaborative knowledge bases*
By Miriam Redi, Wikimedia Foundation
Images allow us to explain, enrich and complement knowledge without
language barriers . They can help illustrate the content of an item in a
language-agnostic way to external data consumers. Images can be extremely
helpful in multilingual collaborative knowledge bases such as Wikidata.
However, a large proportion of Wikidata items lack images. More than 3.6M
Wikidata items are about humans (Q5), but only 17% of them have an image
associated with them. Only 2.2M of 40 Million Wikidata items have an image.
A wider presence of images in such a rich, cross-lingual repository could
enable a more complete representation of human knowledge.
In this talk, we will discuss challenges and opportunities faced when using
machine learning and computer vision tools for the visual enrichment of
collaborative knowledge bases. We will share research to help Wikidata
contributors make Wikidata more “visual” by recommending high-quality
Commons images to Wikidata items. We will show the first results on
free-licence image quality scoring and recommendation and discuss future
work in this direction.
 Van Hook, Steven R. "Modes and models for transcending cultural
differences in international classrooms." Journal of Research in
International Education 10.1 (2011): 5-27.
*Backlogs—backlogs everywhere: Using machine classification to clean up the
new page backlog*
By Aaron Halfaker, Wikimedia Foundation
If there's one insight that I've had about the functioning of Wikipedia and
other wiki-based online communities, it's that eventually self-directed
work breaks down and some form of organization becomes important for task
routing. In Wikipedia specifically, the notion of "backlogs" has become
dominant. There's backlogs of articles to create, articles to clean up,
articles to assess, new editor contributions to review, manual of style
rules to apply, etc. To a community of people working on a backlog, the
state of that backlog has deep effects on their emotional well being. A
backlog that only grows is frustrating and exhausting.
Backlogs aren't inevitable though and there are many shapes that backlogs
can take. In my presentation, I'll tell a story about where English
Wikipedia editors defined a process and set of roles that formed a backlog
around new page creations. I'll make the argument that this formalization
of quality control practices has created a choke point and that
alternatives exist. Finally I'll present a vision for such an alternative
using models that we have developed for ORES, the open machine prediction
service my team maintains.
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation