Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi everyone,
I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released [1]. This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set [1] contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
log format).
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
Just a reminder, we will be deprecating the pagecounts datasets at the end
of May, as we mentioned earlier this year [0]. This means these files will
remain there to be used by researchers but new files will not be generated
in the future.
*Pagecounts datasets that will be deprecated*
pagecounts-raw
pagecounts-all-sites
Options for switching to the new datasets [1]:
pageviews for the same format but better quality data
pagecounts-ez for compressed data
[0] https://lists.wikimedia.org/pipermail/analytics/2016-March/005060.html
[1] https://dumps.wikimedia.org/other/analytics/
Hello Wikimedia analytics mailing list,
As part of research into how people read Wikipedia, a friend and I created
a short survey. We are interested in seeing how people on this mailing list
(not a representative sample of Wikipedia readers for sure!) fill the
survey. The survey should take 2 to 10 minutes to complete.
https://www.surveymonkey.com/r/QBCCVFY
I would also appreciate if any of you have the ability to circulate the
survey to a different audience. If you are interested in doing that, please
let me know (off-list, if you prefer) and I will give you a separate URL
through which to do so for each such audience. The URLs represent different
audiences to whom the survey is shared so that it is easier to understand
how responses differ based on audience.
Any feedback on the survey questions would also be appreciated, on- or
off-thread.
Thank you very much!
Vipul
unsubscribe
Jeremiah Lewis / Business Analyst /// Skype: jpsl91
Razorfish GmbH
Stralauer Allee 2b
10245 Berlin
Chamber of Commerce: Frankfurt am Main – HRB 45639, company registered and located in Frankfurt am Main. Directors: Sascha Martini, Ariel Marciano. Authorized signatory: Kai Greib.
________________________________________
From: Analytics <analytics-bounces(a)lists.wikimedia.org> on behalf of analytics-request(a)lists.wikimedia.org <analytics-request(a)lists.wikimedia.org>
Sent: Friday, May 27, 2016 18:12
To: analytics(a)lists.wikimedia.org
Subject: Analytics Digest, Vol 51, Issue 22
Send Analytics mailing list submissions to
analytics(a)lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/analytics
or, via email, send a message with subject or body 'help' to
analytics-request(a)lists.wikimedia.org
You can reach the person managing the list at
analytics-owner(a)lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Analytics digest..."
Today's Topics:
1. vital signs (Toby Negrin)
2. Re: vital signs (Dmitry Brant)
3. Re: vital signs (Dan Andreescu)
4. Re: vital signs (Nuria Ruiz)
5. Re: vital signs (Jonas Augusto)
----------------------------------------------------------------------
Message: 1
Date: Fri, 27 May 2016 06:16:07 -0700
From: Toby Negrin <tnegrin(a)wikimedia.org>
To: "A mailing list for the Analytics Team at WMF and everybody who
has an interest in Wikipedia and analytics."
<analytics(a)lists.wikimedia.org>
Subject: [Analytics] vital signs
Message-ID:
<CAAjh0EwLPzM6s0AAXXixqoWDJkOx7prn6F7s6wLGGq-c_N2ejQ(a)mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
I can't seem to get the page views report from vital signs to render:
https://vital-signs.wmflabs.org/#projects=enwiki/metrics=Pageviews
Other reports are working fine. Nothing urgent, just an FYI
-Toby
Hi,all,
a few minutes ago dbstore1002, (I think you know it better as
analytics-store) was forced to have an unscheduled maintenance A.K.A
"it crashed and I am trying to give it first aid".
Please use db1047 (analytics-slave?) for now, if you can.
I will follow up with a state update once I know more.
Sorry for the inconveniences,
--
Jaime Crespo
<http://wikimedia.org>
Dear all,
For a project we are trying to build an automatic analytical data
extraction script similar to BaGLAMa.
The BaGLAMa tool gives information about all media in a certain category.
We cannot find out how BaGLAMa collects the filenames for all files within
a category. Does someone know from which dump/api this can be retrieved?
Regards,
Sander Ubink