For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released . This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java .
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream  on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set  contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Just a reminder, we will be deprecating the pagecounts datasets at the end
of May, as we mentioned earlier this year . This means these files will
remain there to be used by researchers but new files will not be generated
in the future.
*Pagecounts datasets that will be deprecated*
Options for switching to the new datasets :
pageviews for the same format but better quality data
pagecounts-ez for compressed data
Hello Wikimedia analytics mailing list,
As part of research into how people read Wikipedia, a friend and I created
a short survey. We are interested in seeing how people on this mailing list
(not a representative sample of Wikipedia readers for sure!) fill the
survey. The survey should take 2 to 10 minutes to complete.
I would also appreciate if any of you have the ability to circulate the
survey to a different audience. If you are interested in doing that, please
let me know (off-list, if you prefer) and I will give you a separate URL
through which to do so for each such audience. The URLs represent different
audiences to whom the survey is shared so that it is easier to understand
how responses differ based on audience.
Any feedback on the survey questions would also be appreciated, on- or
Thank you very much!
Jeremiah Lewis / Business Analyst /// Skype: jpsl91
Stralauer Allee 2b
Chamber of Commerce: Frankfurt am Main – HRB 45639, company registered and located in Frankfurt am Main. Directors: Sascha Martini, Ariel Marciano. Authorized signatory: Kai Greib.
From: Analytics <analytics-bounces(a)lists.wikimedia.org> on behalf of analytics-request(a)lists.wikimedia.org <analytics-request(a)lists.wikimedia.org>
Sent: Friday, May 27, 2016 18:12
Subject: Analytics Digest, Vol 51, Issue 22
Send Analytics mailing list submissions to
To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
You can reach the person managing the list at
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Analytics digest..."
1. vital signs (Toby Negrin)
2. Re: vital signs (Dmitry Brant)
3. Re: vital signs (Dan Andreescu)
4. Re: vital signs (Nuria Ruiz)
5. Re: vital signs (Jonas Augusto)
Date: Fri, 27 May 2016 06:16:07 -0700
From: Toby Negrin <tnegrin(a)wikimedia.org>
To: "A mailing list for the Analytics Team at WMF and everybody who
has an interest in Wikipedia and analytics."
Subject: [Analytics] vital signs
Content-Type: text/plain; charset="utf-8"
I can't seem to get the page views report from vital signs to render:
Other reports are working fine. Nothing urgent, just an FYI
a few minutes ago dbstore1002, (I think you know it better as
analytics-store) was forced to have an unscheduled maintenance A.K.A
"it crashed and I am trying to give it first aid".
Please use db1047 (analytics-slave?) for now, if you can.
I will follow up with a state update once I know more.
Sorry for the inconveniences,
For a project we are trying to build an automatic analytical data
extraction script similar to BaGLAMa.
The BaGLAMa tool gives information about all media in a certain category.
We cannot find out how BaGLAMa collects the filenames for all files within
a category. Does someone know from which dump/api this can be retrieved?