On Sunday, I posted the following to the Analytics mailing list, but didn't see any response there, so I'm reposting here.
At the Berlin hackathon, I improved the script I wrote in December for compiling statistics on external links. My goal is to learn how many links Wikipedia has to a particular website, and to monitor this over time. I figure this might be intresting for GLAM cooperations.
This is found in the external links table, but since I want to filter out links from talk and project pages, I need to join it with the page table, where I can find the namespace. I've tried the join on the German Toolserver, and it works fine for the minor wikis, but it tends to time out (beyond 30 minutes) for the ten largest Wikipedias. This is not because I fail to use indexes, but because I want to run a substring operation on millions of rows. Even an optimized query takes some time.
As a faster alternative, I have downloaded the database dumps, and processed them with regular expressions. Since the page ID is a small integer, counting from 1 up to a few millions, and all I want to know for each page ID is whether or not it belongs to a content namespace, I can do with a bit vector of a few hundred kilobytes. When this is loaded, and I read the dump of the external links table, I can see if the page ID is of interest, truncate the external link down to the domain name, and use a hash structure to count the number of links to each domain. It runs fast and has a small RAM footprint.
In December 2011 I downloaded all the database dumps I could find, and uploaded the resulting statistics to the Internet Archive, see e.g. http://archive.org/details/Wikipedia_external_links_statistics_201101
One problem though is that I don't get links to Wikisource, Wikiquote this way, because they are not in the external links table. Instead they are interwiki links, found in the iwlinks table. The improvement I made in Berlin is that I now also read the interwiki prefix table and the iwlinks table. It works fine.
One issue here, is the definition of content namespaces. Back in December, I decided to count links found in namespaces 0 (main), 6 (File:), Portal, Author and Index. Since then, the concept of "content namespaces" has been introduced, as part of refining the way MediaWiki counts articles in some projects (Wiktionary, Wikisource), where the normal definition (all wiki pages in the main namespace that contain at least one link) doesn't make sense. When Wikisource, using the ProofreadPage extension, adds a lot of scanned books in the Page: namespace, this should count as content, despite these pages not being in the main namespace, and whether or not the pages contain any link (which they most often do not).
One problem is that I can't see which namespaces are "content" namespaces in any of the database dumps. I can only see this from the API, http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=... The API only provides the current value, which can change over time. I can't get the value that was in effect when the database dump was generated.
Another problem is that I want to count links that I find in the File: (ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should count as content? Should I submit my script (300 lines of Perl) somewhere?
On 08/06/12 19:00, Lars Aronsson wrote:
One problem is that I can't see which namespaces are "content" namespaces in any of the database dumps. I can only see this from the API, http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
The API only provides the current value, which can change over time. I can't get the value that was in effect when the database dump was generated.
That's not a problem. You can process anold dump with current contentness values.
What could have changed? * Namespace wasn't listed as content, but it is now. As the content of a namespace will be of the same kind, it was a bug that it wasn't flagged as content, so it's right to use the newer value.
* Namespace didn't exist before, it was added as content. No problem, a missing namespace doesn't affect results either way it's marked.
* Namespace was listed as content, but now it isn't. Wouldn't happen, unless it was a shortly-lived config typo. Still, it's better to use newer data.
You should of course store the ns you treated as content with the filtered external link, but you're generating that file.
The biggest problem may be if many pages with external links in a pseudo-namespace moved to a real non-content namespace, as that would be detected as a lose of those links (because they shouldn't have been there).
Another problem is that I want to count links that I find in the File: (ns=6)
There's usually no content there (license templates, fair use rationales...). Given that you won't be correctly computing the external links from transcluded commons inmages, I wouldn't count it (except for commons, where images are the content).
and Portal: (mostly ns=100) namespaces,
I think these could be argued both ways. Do you see many externallinks from portal pages? I think they should list pages in the current wiki, so I wouldn't expect external links there.
but these aren't marked as content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should count as content?
Should I submit my script (300 lines of Perl) somewhere?
Yes. Probably somewhere at svn.wikimedia.org/mediawiki/trunk/tools
On 2012-06-09 01:31, Platonides wrote:
That's not a problem. You can process anold dump with current contentness values.
That's right, but if my results are not the same as the results you get, this might cause doubts. In my imagination, the number of links before and after a GLAM cooperation could be one metric of progress. This is where links in talk pages should not count, but links in content pages should. So I would prefer a robust way of counting the links, and not one that varies with the API.
You should of course store the ns you treated as content with the filtered external link, but you're generating that file.
Good idea.
Another problem is that I want to count links that I find in the File: (ns=6)
There's usually no content there (license templates, fair use rationales...). Given that you won't be correctly computing the external links from transcluded commons inmages, I wouldn't count it (except for commons, where images are the content).
If there are no local images (as on Swedish Wikipedia), it does no harm to include the File namespace. But on Commons it's the only real content. I also heard the suggestion that Wikisource should allow local uploads (of PDF / Djvu files with scanned books) to navigate around the overly restrictive admins on Commons.
Should I submit my script (300 lines of Perl) somewhere?
Yes. Probably somewhere at svn.wikimedia.org/mediawiki/trunk/tools
I had the impression that SVN was replaced by Git, but perhaps that's just for MediaWiki core?
Most likely, I'll just use my user page on meta.
On 09/06/12 21:11, Lars Aronsson wrote:
Should I submit my script (300 lines of Perl) somewhere?
Yes. Probably somewhere at svn.wikimedia.org/mediawiki/trunk/tools
I had the impression that SVN was replaced by Git, but perhaps that's just for MediaWiki core?
It's where we have some similar scripts right now. IMHO it's not worth creating a git repo for your script. Nor is it worth to wait two weeks for a repo creation to publish it.
wikitech-l@lists.wikimedia.org