Analytics June 2012

analytics@lists.wikimedia.org

4 participants
6 discussions

Upcoming hackathon for experts AND newbies: Washington, DC, USA July 10-11
by Sumana Harihareswara 22 Jun '12

22 Jun '12

This is a reminder that you're invited to the pre-Wikimania hackathon, 10-11 July in Washington, DC, USA: https://wikimania2012.wikimedia.org/wiki/Hackathon In order to come, you have to register for the Wikimania conference: https://wikimania2012.wikimedia.org/wiki/Registration (Unfortunately, the period for requesting scholarships is now over.) At the hackathon, we'll have trainings and projects for novices, and we welcome creators of all Wikimedia technologies -- MediaWiki, gadgets, bots, mobile apps, you name it -- to hack on stuff together and teach each other. Hope to see you! -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

1 1

Our fundraising staff at TechWeek Chicago
by Sumana Harihareswara 22 Jun '12

22 Jun '12

This weekend, TechWeek Chicago starts: http://techweek.com/ The Foundation's Peter Gehres is copresenting the analytics presentation "How Wikipedia Doubled its Online Fundraising" this Saturday. If you're at TechWeek, he and other Wikimedians want to meet with you and talk shop! http://schedule.techweek.com/event/003fc017e0530c08eb34f08033c50f86 Saturday June 23, 2012 4:00pm - 4:45pm @ 1 - Main Stage (222 Merchandise Mart Plaza, Chicago, IL) "In 2010, online donations to Wikipedia more than doubled, from $7.5 million to $16 million and, in 2011, increased another 33%. Much of this increase was driven by user research conducted in Chicago. Design researcher Billy Belchev from Webitects will get into the nitty-gritty of form design and testing, user interviews. Do one-step forms work better than multi-step? Does PayPal help or hurt your numbers? What are the effect of “Jimmy” banners? The answers are based on data from the fifth most trafficked website in the world." -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

1 0

udp2log added fields + geocoding
by Andrew Otto 21 Jun '12

21 Jun '12

Hiya all! We're getting ready to make some changes to the udp2log sources. The first change will be the addition of two new fields at the end of the log line: Accept-Language and X-Carrier. We also would like to consider enabling geocoding via udp-filter for more of the udp2log files. Erik Z, I've attached a some sample log lines of what these changes might generate. They include the two new fields and a geocoded and then anonymized IP address. (Note that the IPs in my sample were all 127.0.0.1, so they couldn't be properly geocoded. But, the format will be the same.) Can you check this out and tell Diederik and me if this will cause any problems with your scripts? Thanks! -Andrew Otto

1 0

Fwd: Profile of Facebook Data Science Team
by Dario Taraborelli 15 Jun '12

15 Jun '12

From a company with not exactly the same privacy standards or data mining needs as Wikimedia, but still an interesting read: http://www.technologyreview.com/featured-story/428150/what-facebook-knows/ Begin forwarded message: > Date: June 14, 2012 11:05:22 AM PDT > Subject: Profile of Facebook Data Science Team > Source: FlowingData > Author: Nathan Yau > > MIT Technology Review profiles the Facebook Data Science Team, described as a gathering of grad students at a top school and headed by Cameron Marlow, the "young professor." > > Back at Facebook, Marlow isn't the one who makes decisions about what the company charges for, even if his work will shape them. Whatever happens, he says, the primary goal of his team is to support the well-being of the people who provide Facebook with their data, using it to make the service smarter. Along the way, he says, he and his colleagues will advance humanity's understanding of itself. That echoes Zuckerberg's often doubted but seemingly genuine belief that Facebook's job is to improve how the world communicates. Just don't ask yet exactly what that will entail. "It's hard to predict where we'll go, because we're at the very early stages of this science," says Marlow. "The number of potential things that we could ask of Facebook's data is enormous." > > Related > > Facebook status updates: young people are self-centered and old ramble > Data Science is catching on > Taking a Look at Facebook Statistics from All Facebook > > Read more… >

1 0

Benchmarking Kraken
by Andrew Otto 09 Jun '12

09 Jun '12

Alright! We've got a 10 node CDH3 hadoop cluster set up. I am experimenting with (and learning about!) hadoop as we go. We plan on doing some benchmarking of CDH3 vs. DataStax Enterprise (and vs. CDH4?) on this cluster before we make decisions. Right now is playtime! I just added some notes to this Etherpad on some variable tweaking I will be doing. My new notes start at about line 187. (Can I link to a specific line in Etherpad?) I've also created a google spreadsheet where I am keeping track of my benchmarking runs. Let me know if need access to it. https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AvpRkIqSY9hNdE… If anyone on this list (who's on this list, anyway?!) has some insight or experience with hadoop benchmarking, feel free to chime in. We'd love the help! Thanks all, -Andrew Otto

1 0

External links statistics
by Lars Aronsson 03 Jun '12

03 Jun '12

(Hi everybody! Apparently I was already on this list, but had forgotten, since the volume is so low. I'm a volunteer and a former board member of the Swedish chapter. I also run Project Runeberg, the Scandinavian e-text archive, runeberg.org.) Here at the Berlin hackathon, I've improved the script I wrote in December for compiling statistics on external links. My goal is to learn how many links Wikipedia has to a particular website, and to monitor this over time. I figure this might be intresting for GLAM cooperations. This is found in the external links table, but since I want to filter out links from talk and project pages, I need to join it with the page table, where I can find the namespace. I've tried the join on the German Toolserver, and it works fine for the minor wikis, but it tends to time out (beyond 30 minutes) for the ten largest Wikipedias. This is not because I fail to use indexes, but because I want to run a substring operation on millions of rows. Even an optimized query takes some time. As a faster alternative, I have downloaded the database dumps, and processed them with regular expressions. Since the page ID is a small integer, counting from 1 up to a few millions, and all I want to know for each page ID is whether or not it belongs to a content namespace, I can do with a bit vector of a few hundred kilobytes. When this is loaded, and I read the dump of the external links table, I can see if the page ID is of interest, truncate the external link down to the domain name, and use a hash structure to count the number of links to that domain. It runs fast and has a small RAM footprint. In December 2011 I downloaded all the database dumps I could find, and uploaded the resulting statistics to the Internet Archive, see e.g. http://archive.org/details/Wikipedia_external_links_statistics_201101 One problem though is that I don't get links to Wikisource, Wikiquotes this way, because they are not in the external links table. Instead they are interwiki links, found in the iwlinks table. My improvement here in Berlin is that I now also read the interwiki prefix table and the iwlinks table. It works fine. One issue here, is the definition of content namespaces. Back in December, I decided to count links found in namespaces 0 (main), 6 (File:), Portal, Author and Index. Since then, the concept of "content namespaces" has been introduced, as part of refining the way MediaWiki counts articles in some projects (Wiktionary, Wikisource), where the normal definition (all wiki pages in the main namespace that contain at least one link) doesn't make sense. When Wikisource, using the ProofreadPage extension, adds a lot of scanned books in the Page: namespace, this should count as content, despite these pages not being in the main namespace, and whether or not the pages contain any link (which they most often do not). One problem is that I can't see which namespaces are "content" namespaces in any of the database dumps. I can only see this from the API, http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespa… The API only provides the current value, which can change over time. I can't get the value that was in effect when the database dump was generated. Another problem is that I want to count links that I find in the File: (ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as content namespaces by the API. Shouldn't they be? Is anybody else doing similar things? Do you have opinions on what should count as content? Should I submit my script (300 lines of Perl)somewhere? -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics June 2012