On 6/19/06, Eric Astor <eastor1(a)swarthmore.edu> wrote:
I've been investigating this for a while... I
understand that per-page
popularity data has been disabled. However, I'm trying to make some
statistical investigations of Wikipedia, in an attempt to provide some
utilities for the selection of articles for the One Encyclopedia Per Child
(OEPC) project.
To the point, then. For the English and Simple English Wikipedias - is there
any way it could be possible to get per-page popularity data, or a stripped
log file from some Squid proxy that this information could be extracted
from? I already have a simple Perl utility that could be used to strip a
Squid log file down to the information I need (the URL).
The Squids don't log. Generally the feeling is that the information
isn't useful. I'd like to see us do at least a periodic sample for
these sort of applications. This could be accomplished without impact
on the production infrastructure by placing a box on a port mirror of
one of the squids, sniffing the traffic, and reconstructing the TCP
sessions enough to extract the requested URLs. Unfortunately this
would require trusted staff access because of exposure to private data
in raw traffic, and as far as I know none of our core devs are
interested in this data.
In the absence of this data you can use pure connectectedness to
estimate important articles. You would just need the pagelink and page
tables to form metrics based on connectedness. We make mysql dumps of
this data available on
download.wikimedia.org.
For Wikipedia subsetting, the use of the pagerank algorithm (ideally
combined with popularity as the initialization data) is probably the
ideal automated algorithm. ... Because even if a page is not widely
read if it is widely linked you should still probably include it.
A high performance implementation of pagerank is available in the
Boost Graph Library.
I've computed the internal pagerank of Wikipedia articles in the past
(initialized with neutral values), so if you need help with this I can
make myself available.
If you have a substantial amount of human hours available for this
purpose, I'd recommend that you get some folks building a man made
list of core subjects which must be included. (See
http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team for
work thats already been done in this area). Such a list could be used
as part of the initialization data for pagerank.
It would likely be useful to begin producing a negative list, a set of
articles that you don't want to include because of the focus of your
project. For example, you might wish to exclude pop-culture subjects.
Be careful about using category data for this purpose: Since there is
no built in method to view all the super categories that an article is
a member of sometimes links are made within the hierarchy which
produce confusing results. For example, there are a great many
articles reachable from every one of our top level cats.