Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
I'm noticing curious spikes occasionally in the usage stats ( via
tools.wmflabs.org/pageviews/ ) for a Wikibook I wrote and maintain. I
would guess that many of the visitors are coming via a search engine.
Some blogs provide authors with a sanitized subset of HTTP referer {sic}
header information, specifically the search engine search terms. I'm
looking for that or something similar for Wikibooks.
How may I go about getting a sanitized list of search terms used to
enter that Wikibook or its chapters?
Regards,
Lars
Hi everyone,
I'm a phd student studying mathematical models to improve the hit ratio
of web caches. In my research community, we are lacking realistic data
sets and frequently rely on outdated modelling assumptions.
Previously, (~2007) a trace containing 10% of user requests issued to
the Wikipedia was publicly released [1]. This data set has been used
widely for performance evaluations of new caching algorithms, e.g., for
the new Caffeine caching framework for Java [2].
I would like to ask for your comments about compiling a similar
(updated) data set and making it public.
In my understanding, the necessary logs are readily available, e.g., in
the Analytics/Data/Mobile requests stream [3] on stat1002, with a
sampling rate of 1:100. As this request stream contains sensitive data
(e.g., client IPs), it would need anonymization before making it public.
It would be glad to help with that.
The previously released data set [1] contains no client information. It
contains 1) a counter, 2) a timestamp, 3) the URL, and 4) an update
flag. I would additionally suggest to include 5) the cache's hostname,
6) the cache_status, and 7) the response size (from the Wikimedia cache
log format).
I believe this format would preserve anonymity, and would be interesting
for many researchers.
Let me know your thoughts.
Thanks,
Daniel Berger
http://disco.cs.uni-kl.de/index.php/people/daniel-s-berger
[1] http://www.wikibench.eu/?page_id=60
[2] https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]
https://wikitech.wikimedia.org/wiki/Analytics/Data/Mobile_requests_stream
WikiConference North America 2016
7-10 October 2016, San Diego, CA, USA
SUBMISSIONS DEADLINE: August 31, 11:59pm Samoa Time!
https://wikiconference.org/wiki/Submissions
WikiConference North America (formerly WikiConference USA) is the third
annual conference on the North American continent devoted to Wikipedia and
other Wikimedia projects. The weekend will feature both academic and casual
presentations on Wikimedia-related outreach activities, workshops to
improve the skills of grassroots organizers, and discussions on the past,
present, and future of the Wikimedia projects. The conference features
offerings about community outreach, online activity, partnerships with
institutions of knowledge, and technology.
Keynote speakers are scheduled to include Katherine Maher, Executive
Director of the Wikimedia Foundation, and Merrilee Proffitt, Senior Program
Officer of OCLC Research. The last day of the conference will feature
programming coinciding with Indigenous Peoples' Day.
Registration for the conference is now open. You can register at
https://wikiconference.org.
Scholarships partially covering costs of travel and attendance are
available for active contributors to Wikimedia projects. Apply by August
23rd for scholarships at https://wikiconference.org/wiki/2016/Scholarships.
This is a volunteer run conference and volunteers are needed for any number
of tasks. If you are attending, please consider volunteering for at
https://wikiconference.org/wiki/Volunteers.
We seek presentations addressing topics related to Wikipedia or open access
and culture. Presentations may be from any discipline regarding any
relevant topic. Please submit a description of your proposed presentation
using our online submission process at https://wikiconference.org/
wiki/Submissions. If you are interested in participating in the
peer-reviewed academic track, see our call for academic submissions at
https://wikiconference.org/wiki/Call_for_Academic_Presentations.
- Sydney Poore (User:FloNight) and Rosie Stephenson-Goodknight
(User:Rosiestep), conference organizers
hi,
our team is trying to determine how pageviews are attributed to pages that redirect to other pages.
for instance, the page Panic!_at_the_disco redirects to the page Panic!_at_the_Disco, however, in the pageview dumps file
there is an entry for both Panic!_at_the_disco and Panic!_at_the_Disco. does this mean that a single visit to the page Panic!_at_the_disco generates two entries
in the pageview dumps file (one entry for the source page of the redirect and another for the target page of the redirect)?
-best,
-ace
Hello Analytics,
In the past weeks I was asked several times if it would be possible to
count clicks on links or even click paths (like n% klick *foo*, of those m%
click *bar*) on normal Wikipages.
I did not know.
But I though you could enlighten me and maybe even point me to some
resources that might help me to understand it better if it exists.
Jan
--
Jan Dittrich
UX Design/ User Research
Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
http://wikimedia.de
Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Hi all,
I need to restart all the Java daemons on the Analytics Hadoop cluster for
security upgrades. This procedure might affect ongoing jobs so please let
me know if you see any issue during the next hours.
IRC: elukey (#wikimedia-analytics Freenode)
Thanks for the patience!
Luca
Please comment on whether to approve the introduction to the "Committee"
section of the draft Code of conduct for technical spaces.
The draft text is at
https://www.mediawiki.org/w/index.php?title=Code_of_Conduct/Draft&oldid=220…
. This is the part after the "Page: Code of Conduct/Committee" heading
and before the "Diversity" heading.
You can comment at
https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Finalize_introduc…
. A position and brief comment is fine.
You can also send private feedback to conduct-discussion(a)wikimedia.org .
Thanks again,
Matt Flaschen
P.S. Sorry, I should have combined this into my previous email.