This is a reminder that you're invited to the pre-Wikimania hackathon,
10-11 July in Washington, DC, USA:
In order to come, you have to register for the Wikimania conference:
(Unfortunately, the period for requesting scholarships is now over.)
At the hackathon, we'll have trainings and projects for novices, and we
welcome creators of all Wikimedia technologies -- MediaWiki, gadgets,
bots, mobile apps, you name it -- to hack on stuff together and teach
Hope to see you!
Engineering Community Manager
This weekend, TechWeek Chicago starts: http://techweek.com/
The Foundation's Peter Gehres is copresenting the analytics presentation
"How Wikipedia Doubled its Online Fundraising" this Saturday. If you're
at TechWeek, he and other Wikimedians want to meet with you and talk shop!
Saturday June 23, 2012 4:00pm - 4:45pm @ 1 - Main Stage (222 Merchandise
Mart Plaza, Chicago, IL)
"In 2010, online donations to Wikipedia more than doubled, from $7.5
million to $16 million and, in 2011, increased another 33%. Much of this
increase was driven by user research conducted in Chicago. Design
researcher Billy Belchev from Webitects will get into the nitty-gritty
of form design and testing, user interviews. Do one-step forms work
better than multi-step? Does PayPal help or hurt your numbers? What are
the effect of “Jimmy” banners? The answers are based on data from the
fifth most trafficked website in the world."
Engineering Community Manager
We're getting ready to make some changes to the udp2log sources. The first change will be the addition of two new fields at the end of the log line: Accept-Language and X-Carrier.
We also would like to consider enabling geocoding via udp-filter for more of the udp2log files.
Erik Z, I've attached a some sample log lines of what these changes might generate. They include the two new fields and a geocoded and then anonymized IP address. (Note that the IPs in my sample were all 127.0.0.1, so they couldn't be properly geocoded. But, the format will be the same.) Can you check this out and tell Diederik and me if this will cause any problems with your scripts?
From a company with not exactly the same privacy standards or data mining needs as Wikimedia, but still an interesting read:
Begin forwarded message:
> Date: June 14, 2012 11:05:22 AM PDT
> Subject: Profile of Facebook Data Science Team
> Source: FlowingData
> Author: Nathan Yau
> MIT Technology Review profiles the Facebook Data Science Team, described as a gathering of grad students at a top school and headed by Cameron Marlow, the "young professor."
> Back at Facebook, Marlow isn't the one who makes decisions about what the company charges for, even if his work will shape them. Whatever happens, he says, the primary goal of his team is to support the well-being of the people who provide Facebook with their data, using it to make the service smarter. Along the way, he says, he and his colleagues will advance humanity's understanding of itself. That echoes Zuckerberg's often doubted but seemingly genuine belief that Facebook's job is to improve how the world communicates. Just don't ask yet exactly what that will entail. "It's hard to predict where we'll go, because we're at the very early stages of this science," says Marlow. "The number of potential things that we could ask of Facebook's data is enormous."
> Facebook status updates: young people are self-centered and old ramble
> Data Science is catching on
> Taking a Look at Facebook Statistics from All Facebook
> Read more…
Alright! We've got a 10 node CDH3 hadoop cluster set up. I am experimenting with (and learning about!) hadoop as we go. We plan on doing some benchmarking of CDH3 vs. DataStax Enterprise (and vs. CDH4?) on this cluster before we make decisions. Right now is playtime!
I just added some notes to this Etherpad on some variable tweaking I will be doing. My new notes start at about line 187. (Can I link to a specific line in Etherpad?) I've also created a google spreadsheet where I am keeping track of my benchmarking runs. Let me know if need access to it. https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AvpRkIqSY9hNdE…
If anyone on this list (who's on this list, anyway?!) has some insight or experience with hadoop benchmarking, feel free to chime in. We'd love the help!
(Hi everybody! Apparently I was already on this list, but had forgotten,
since the volume is so low. I'm a volunteer and a former board member of
the Swedish chapter. I also run Project Runeberg, the Scandinavian e-text
Here at the Berlin hackathon, I've improved the script I wrote in December
for compiling statistics on external links. My goal is to learn how many
links Wikipedia has to a particular website, and to monitor this over time.
I figure this might be intresting for GLAM cooperations.
This is found in the external links table, but since I want to filter out
links from talk and project pages, I need to join it with the page table,
where I can find the namespace. I've tried the join on the German
and it works fine for the minor wikis, but it tends to time out (beyond
30 minutes) for the ten largest Wikipedias. This is not because I fail to
use indexes, but because I want to run a substring operation on millions
of rows. Even an optimized query takes some time.
As a faster alternative, I have downloaded the database dumps, and processed
them with regular expressions. Since the page ID is a small integer,
from 1 up to a few millions, and all I want to know for each page ID is
whether or not it belongs to a content namespace, I can do with a bit vector
of a few hundred kilobytes. When this is loaded, and I read the dump of the
external links table, I can see if the page ID is of interest, truncate the
external link down to the domain name, and use a hash structure to count the
number of links to that domain. It runs fast and has a small RAM footprint.
In December 2011 I downloaded all the database dumps I could find, and
uploaded the resulting statistics to the Internet Archive, see e.g.
One problem though is that I don't get links to Wikisource, Wikiquotes this
way, because they are not in the external links table. Instead they are
interwiki links, found in the iwlinks table. My improvement here in Berlin
is that I now also read the interwiki prefix table and the iwlinks table.
It works fine.
One issue here, is the definition of content namespaces. Back in December,
I decided to count links found in namespaces 0 (main), 6 (File:),
Portal, Author and Index. Since then, the concept of "content namespaces"
has been introduced, as part of refining the way MediaWiki counts articles
in some projects (Wiktionary, Wikisource), where the normal definition
(all wiki pages in the main namespace that contain at least one link)
doesn't make sense. When Wikisource, using the ProofreadPage extension,
adds a lot of scanned books in the Page: namespace, this should count as
content, despite these pages not being in the main namespace, and whether
or not the pages contain any link (which they most often do not).
One problem is that I can't see which namespaces are "content" namespaces
in any of the database dumps. I can only see this from the API,
The API only provides the current value, which can change over time. I can't
get the value that was in effect when the database dump was generated.
Another problem is that I want to count links that I find in the File:
(ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as
content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should
count as content? Should I submit my script (300 lines of Perl)somewhere?
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/