(Wikitech-l: this is more on automatic subject classification, which Axel brought up recently on Wikipedia-l.)
On Mon, 23 Sep 2002, Axel Boldt wrote:
[snip excellent comments that I agree with]
I still believe that all of this can and should be done automatically, by tracing link paths from the main page.
I'm going to repeat some of what you've said earlier, adding my own perspective. I really hope some programmers pursue this--they needn't ask anyone's permission. The proof's in the pudding.
If automatic categorization could be done, and it sounds very plausible to me, it would be *far* superior to a hand-maintained list of subject area links. And incredibly useful, too.
OK, the following will reiterate some of the earlier discussion.
Presumably, nearly every page on Wikipedia can be reached from nearly every other page. (There are orphans; and there are pages that do not link to any other pages, though other pages link to them.)
This suggests that we can basically assign a number--using *some* algorithm (not necessarily any one in particular: here is where programmers can be creative)--giving the "closeness" of a page to all the listed subjects. (This is very much like the Kevin Bacon game, of course, and the "six degrees of separation" phenomenon.)
The question whether a *useful* algorithm can be stated is interesting from a theoretical point of view. As I understand it, the suggestion is that there is a simple and reliable (but how reliable?) algorithm, such that, given simply a list of all the links in Wikipedia (viz., the source page and destination page), and a list of subject categories, we can reliably sort all pages into their proper categories.
It will not do to say, "There are obvious counterexamples, so let's not even try." We can live with some slop. This is Wikipedia! We could even fix errors by hand (ad hoc corrections are possible; why not?). As far as I'm concerned, the real question is, once we try *various* algorithms, what's the highest reliability we can actually generate? I'll bet it'll be reasonably high, certainly high enough to be quite useful.
Here's an attempt at expressing an algorithm:
For a given page P (e.g., [[Plato's allegory of the cave]]), if the average number of clicks (not backtracking to any page already reached-- otherwise you deal with infinite regresses) needed to reach P from the subject page S (e.g., [[Philosophy]]) through all possible links between P and S (or, perhaps, all possible links below a certain benchmark number?) is lower than the average number of clicks need to reach P from any other subject page, then P is "about" S.
The algorithm could be augmented in useful ways. In case of ties, or near ties, a page could be listed as under multiple subjects. I have no idea if this algorithm is correct, but that doesn't matter--it's just an example. If you think harder and longer, I'm sure you'll think of a better one.
This would be fascinating, I'm sure, for the programmers. Can't we just take the question about how long processing will require as a constraint on the algorithm rather than as a knock-down argument that it's not feasible? The *exercise* is to find (and implement!) an algorithm that *is* feasible. We don't even have to do this using Wikipedia's server, if it would be too great of a load; anyone could download the tarball and process it. You could do a cron job once a day, compile the 40-odd "subject numbers" for each article in Wikipedia, and sort articles into subject groups (in some cases, multiple subject groups for a given article--why not?). From there we could use scripts already written to create the many "recent changes" pages.
I really, really, really want to see [[Philosophy Recent Changes]]. We desperately need pages like that, and this is one of the best possible ways we have of getting them. It's worth actually exploring.
--Larry
wikitech-l@lists.wikimedia.org