I am at
today and have mentioned that it would be interesting for our community to
see how much traffic they get from Wikimedia servers.
They said that they would be willing to look into the possibility of
providing such data, but before embarking on that, they would like to get
an overview of what kind of data or analyses people would be interested in.
If you have suggestions in this regard, please post them at
Thanks and cheers,
Cross-posting the announcement from the Wikimedia Blog. The details of the event are on Meta and we're also creating meetup.com pages for the local events. Check them out and RSVP if you're planning to attend. Looking forward to see you on November 9!
Dario, on behalf of the organizers
Join the inaugural Wiki Research Hackathon on November 9
Last summer at Wikimania in Hong Kong, the annual global Wikimedia conference, we (a group of Wikipedia researchers) discussed how we could make wiki researchmore impactful. In our work in academia and on Wikimedia projects, we saw a host of missed opportunities to share ideas, hypotheses, code, and research methods. We set out to create a space to bring researchers together with Wikipedians and facilitate problem solving, discovery and innovation with the use of open data and open source tools. Labs2 (L2) aims to build this space, by providing infrastructure and venues for collaborative wiki research.
Today we’re thrilled to announce the inaugural Wiki Research Hackathon – a global event hosted by Wikimedia Foundation researchers, academic researchers and Wikipedians from around the world on Saturday, November 9, 2013.
This hackathon is an opportunity for anyone interested in research on wikis, Wikipedia, and open collaboration to meet, share ideas, and work together. It is targeted at Wikipedia editors, students, researchers, coders and anyone interested in designing new tools, statistics and data visualization, and producing new knowledge about Wikimedia projects and their communities.
The goal of this event is to:
share knowledge about research tools and datasets (and how to use them)
ask burning research questions (and learn how to answer them)
get involved in ongoing research projects (or start new ones)
design new data-driven apps and tools (or hack existing ones)
(Locations are approximate)
This hackathon will be held both as a series of local meetups (Perth, Mannheim, Oxford,Rio de Janeiro, Chicago, Minneapolis, San Francisco, Seattle, etc.) and virtual meetups (Asia/Oceania, Europe/Africa & The Americas) for those who can’t make it to the local events. An IRC channel (#wikimedia-labsconnect) and a Google Hangout open throughout the day will allow attendees to connect online.
Interested attendees can sign up for the event on Meta-wiki.
Local and virtual meetups are listed on theevent page. All you need to do is add your name to the list of participants for the event that makes sense for you.
For any question about the event (including volunteering for a local meetup), you can reach us at wrh(a)wikimedia.org or leave a message on the hackathon’s talk page on Meta-wiki. We look forward to seeing you on November 9.
Aaron Halfaker, Wikimedia Foundation
Jonathan Morgan, Wikimedia Foundation
Morten Warncke-Wang, University of Minnesota
Aaron Shaw, Northwestern University
Dario Taraborelli, Wikimedia Foundation
Taha Yasseri, Oxford University
Henrique Andrade, Wikimedia Foundation
I've just configured 4 new ulsfo servers as mobile caches:
These are listed in cache.pp accordingly.
Please make sure that mobile requests from these servers are accounted for, and let me know when this is the case. I'd like to start giving them traffic ASAP. :)
Mark Bergsma <mark(a)wikimedia.org>
Lead Operations Architect
Greetings Ori and analytics team. Is there documentation somewhere about
how and where to determine that EventLogging events are being properly
recorded? We had to do a quick deployment last night to change schemas and
how we handle schemas internally for MF and realized after deploying the
changes that the best we could do was make sure that the events were firing
- we had no idea how to inspect the pipeline, and everyone who did know was
asleep/offline/etc (see below for more backstory).
On Thu, Oct 24, 2013 at 10:21 AM, Jon Robson <jrobson(a)wikimedia.org> wrote:
> I can confirm we are still logging.
> In terms of stat1.wikimedia.org access I forget exactly how I did it
> but you will need to talk to someone in analytics - maybe Dario to get
> setup there. I'd recommend doing this sooner rather than later to
> avoid this problem again.
> On Wed, Oct 23, 2013 at 7:31 PM, Arthur Richards
> <arichards(a)wikimedia.org> wrote:
> > On Wed, Oct 23, 2013 at 6:57 PM, Jon Robson <jrobson(a)wikimedia.org>
> >> Note if the events are firing and there are no errors in the console
> >> the change was successful :) If someone can double check they are
> showing up
> >> on stat1 though even better!
> > Are there details published somewhere on how to do this? After Kaldari
> > the changes out successuflly, we realized neither of us knew how to
> check on
> > stat1 nor could I quickly find docs.
> > --
> > Arthur Richards
> > Software Engineer, Mobile
> > [[User:Awjrichards]]
> > IRC: awjr
> > +1-415-839-6885 x6687
Software Engineer, Mobile
Apologies if I missed some documentation or prior discussion about this,
but is there a reason why the seconds field in the /pagecounts-raw/ dump
files vary? It seems unnecessary to scrape and parse the html to get the
true filenames (e.g., pagecounts-20131021-160013.gz) instead of being able
to pass clean filenames (e.g., pagecounts-20131021-160000.gz) especially
when there's no true precision needed at the second-level here. Is it
unreasonable to request that these be renamed to a more consistent and
tldr; Do we have data on the number of compressed vs. uncompressed requests
I'm investigating a fundraising issue where it appears that banners that
should be about the same size compressed, but which are differently sized
upon decompression, show markedly different conversion rates (the only
thing that's different about them is the name of the banner which affects
One of the angles I'm investigating is if perhaps we're serving a
significant number of banners uncompressed; which would affect the amount
of time it takes to appear on the site. If we have this data already, I can
compare it to data that I'm going to take from the banner stream .
Alternate things I'm considering is if it takes the caching layer longer to
retrieve serve certain banner content and/or cache keys.
-- The Data --
For the truly curious; the two tests I've run so far that have led me down
this path are: have two banners with the same content (cloned) but with
different names. As names get substituted into the banners multiple times
through keyword expansion the content lengths will be different. See how
many clicks each banner gets. This is multivariate with the two variables
being content length, and cache key.
Cache key setup 1 (Long name has a worse spot in the cache):
Short Name: 0.22% success rate (155300 samples)
Long Name: 0.19% success rate (160800 samples)
The 95% confidence interval has the long name performing from -31% to 3%
worse than the short name with a power of 0.014.
Cache key setup 2 (Long name has a better spot in the cache):
Short Name: 0.20% success rate (294900 samples)
Long Name: 0.19% success rate (309500 samples)
The 95% CI here still has the long name performing worse; but with power
that is effectively not useful.
Fundraising Technology Team
Is there a way to link up the username IDs which are included in the output
with the actual usernames?
Thank you so much!
Grantmaking Learning & Evaluation *
Imagine a world in which every single human being can freely share in
the sum of all knowledge. Help us make it a reality!
Donate to Wikimedia <https://donate.wikimedia.org/>
FYI, useful new stats :)
We might want to build a directory of reports generated on ToolLabs
somewhere in the analytics hub on mediawiki.org.
---------- Forwarded message ----------
From: Gerard Meijssen <gerard.meijssen(a)gmail.com>
Date: Thu, Oct 17, 2013 at 10:26 PM
Subject: [Wikidata-l] Statistics
To: WikiData-l <wikidata-l(a)lists.wikimedia.org>
I do not know if you have seen the statistics compiled by Magnus .
They are up to date and useful.
I blogged about it . As far as I am concerned, the biggest
challenge we face is the lack of labels. Given that 280+ languages are
represented in Wikidata it clearly demonstrates that Wikidata is
useless as it is for most languages. Please tell me that I am wrong
and explain why.
Wikidata-l mailing list
VP of Engineering and Product Development, Wikimedia Foundation
I spoke to Dario today about investigating uses for our Hadoop cluster.
This is an internal cluster but it's mirrored on labs so I'm posting to
the public list in case people are interested in the technology and hearing
what we're up to.
The questions we need to answer are :
- What's an easy way to import lots of data from MySQL without killing
the source servers? We've used sqoop and drdee's sqoopy but these would
hammer the prod servers too hard we think.
- drdee mentioned a way to pass a comment with select statements to
make them lower priority, is this documented somewhere?
- Could we just stand up the MySQL backups and import them?
- Could we import from the xml dumps?
- Is there a way to do incremental importing once an initial load is
Once we figure this out, the fun starts. What are some useful questions
once we have access to the core mediawiki db tables across all projects?