On Sunday, I posted the following to the Analytics mailing list,
but didn't see any response there, so I'm reposting here.
At the Berlin hackathon, I improved the script I wrote in December
for compiling statistics on external links. My goal is to learn how many
links Wikipedia has to a particular website, and to monitor this over time.
I figure this might be intresting for GLAM cooperations.
This is found in the external links table, but since I want to filter out
links from talk and project pages, I need to join it with the page table,
where I can find the namespace. I've tried the join on the German Toolserver,
and it works fine for the minor wikis, but it tends to time out (beyond
30 minutes) for the ten largest Wikipedias. This is not because I fail to
use indexes, but because I want to run a substring operation on millions
of rows. Even an optimized query takes some time.
As a faster alternative, I have downloaded the database dumps, and processed
them with regular expressions. Since the page ID is a small integer, counting
from 1 up to a few millions, and all I want to know for each page ID is
whether or not it belongs to a content namespace, I can do with a bit vector
of a few hundred kilobytes. When this is loaded, and I read the dump of the
external links table, I can see if the page ID is of interest, truncate the
external link down to the domain name, and use a hash structure to count the
number of links to each domain. It runs fast and has a small RAM footprint.
In December 2011 I downloaded all the database dumps I could find, and
uploaded the resulting statistics to the Internet Archive, see e.g.
http://archive.org/details/Wikipedia_external_links_statistics_201101
One problem though is that I don't get links to Wikisource, Wikiquote this
way, because they are not in the external links table. Instead they are
interwiki links, found in the iwlinks table. The improvement I made in Berlin
is that I now also read the interwiki prefix table and the iwlinks table.
It works fine.
One issue here, is the definition of content namespaces. Back in December,
I decided to count links found in namespaces 0 (main), 6 (File:),
Portal, Author and Index. Since then, the concept of "content namespaces"
has been introduced, as part of refining the way MediaWiki counts articles
in some projects (Wiktionary, Wikisource), where the normal definition
(all wiki pages in the main namespace that contain at least one link)
doesn't make sense. When Wikisource, using the ProofreadPage extension,
adds a lot of scanned books in the Page: namespace, this should count as
content, despite these pages not being in the main namespace, and whether
or not the pages contain any link (which they most often do not).
One problem is that I can't see which namespaces are "content" namespaces
in any of the database dumps. I can only see this from the API,
http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop…
The API only provides the current value, which can change over time. I can't
get the value that was in effect when the database dump was generated.
Another problem is that I want to count links that I find in the File:
(ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as
content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should
count as content? Should I submit my script (300 lines of Perl) somewhere?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik -
http://aronsson.se
Project Runeberg - free Nordic literature -
http://runeberg.org/