On 6/19/06, Eric Astor eastor1@swarthmore.edu wrote:
I've been investigating this for a while... I understand that per-page popularity data has been disabled. However, I'm trying to make some statistical investigations of Wikipedia, in an attempt to provide some utilities for the selection of articles for the One Encyclopedia Per Child (OEPC) project.
To the point, then. For the English and Simple English Wikipedias - is there any way it could be possible to get per-page popularity data, or a stripped log file from some Squid proxy that this information could be extracted from? I already have a simple Perl utility that could be used to strip a Squid log file down to the information I need (the URL).
The Squids don't log. Generally the feeling is that the information isn't useful. I'd like to see us do at least a periodic sample for these sort of applications. This could be accomplished without impact on the production infrastructure by placing a box on a port mirror of one of the squids, sniffing the traffic, and reconstructing the TCP sessions enough to extract the requested URLs. Unfortunately this would require trusted staff access because of exposure to private data in raw traffic, and as far as I know none of our core devs are interested in this data.
In the absence of this data you can use pure connectectedness to estimate important articles. You would just need the pagelink and page tables to form metrics based on connectedness. We make mysql dumps of this data available on download.wikimedia.org.
For Wikipedia subsetting, the use of the pagerank algorithm (ideally combined with popularity as the initialization data) is probably the ideal automated algorithm. ... Because even if a page is not widely read if it is widely linked you should still probably include it.
A high performance implementation of pagerank is available in the Boost Graph Library.
I've computed the internal pagerank of Wikipedia articles in the past (initialized with neutral values), so if you need help with this I can make myself available.
If you have a substantial amount of human hours available for this purpose, I'd recommend that you get some folks building a man made list of core subjects which must be included. (See http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team for work thats already been done in this area). Such a list could be used as part of the initialization data for pagerank.
It would likely be useful to begin producing a negative list, a set of articles that you don't want to include because of the focus of your project. For example, you might wish to exclude pop-culture subjects. Be careful about using category data for this purpose: Since there is no built in method to view all the super categories that an article is a member of sometimes links are made within the hierarchy which produce confusing results. For example, there are a great many articles reachable from every one of our top level cats.