On Sunday, I posted the following to the Analytics mailing list, but didn't see any response there, so I'm reposting here.
At the Berlin hackathon, I improved the script I wrote in December for compiling statistics on external links. My goal is to learn how many links Wikipedia has to a particular website, and to monitor this over time. I figure this might be intresting for GLAM cooperations.
This is found in the external links table, but since I want to filter out links from talk and project pages, I need to join it with the page table, where I can find the namespace. I've tried the join on the German Toolserver, and it works fine for the minor wikis, but it tends to time out (beyond 30 minutes) for the ten largest Wikipedias. This is not because I fail to use indexes, but because I want to run a substring operation on millions of rows. Even an optimized query takes some time.
As a faster alternative, I have downloaded the database dumps, and processed them with regular expressions. Since the page ID is a small integer, counting from 1 up to a few millions, and all I want to know for each page ID is whether or not it belongs to a content namespace, I can do with a bit vector of a few hundred kilobytes. When this is loaded, and I read the dump of the external links table, I can see if the page ID is of interest, truncate the external link down to the domain name, and use a hash structure to count the number of links to each domain. It runs fast and has a small RAM footprint.
In December 2011 I downloaded all the database dumps I could find, and uploaded the resulting statistics to the Internet Archive, see e.g. http://archive.org/details/Wikipedia_external_links_statistics_201101
One problem though is that I don't get links to Wikisource, Wikiquote this way, because they are not in the external links table. Instead they are interwiki links, found in the iwlinks table. The improvement I made in Berlin is that I now also read the interwiki prefix table and the iwlinks table. It works fine.
One issue here, is the definition of content namespaces. Back in December, I decided to count links found in namespaces 0 (main), 6 (File:), Portal, Author and Index. Since then, the concept of "content namespaces" has been introduced, as part of refining the way MediaWiki counts articles in some projects (Wiktionary, Wikisource), where the normal definition (all wiki pages in the main namespace that contain at least one link) doesn't make sense. When Wikisource, using the ProofreadPage extension, adds a lot of scanned books in the Page: namespace, this should count as content, despite these pages not being in the main namespace, and whether or not the pages contain any link (which they most often do not).
One problem is that I can't see which namespaces are "content" namespaces in any of the database dumps. I can only see this from the API, http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=... The API only provides the current value, which can change over time. I can't get the value that was in effect when the database dump was generated.
Another problem is that I want to count links that I find in the File: (ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should count as content? Should I submit my script (300 lines of Perl) somewhere?