[Wikipedia-l] Feature request

Neil Harris usenet at tonal.clara.co.uk
Mon Sep 23 08:53:01 UTC 2002


Larry Sanger wrote:

 >Sorry if this is already in the hopper or (!) has already been done, but
 >it seems long overdue:
 >
 >We need a way to compile, based on lists of links (I guess), "Recent
 >Changes" lists for all articles about a general topic.  I think this is
 >actually a very pressing need that should have been taken care of long
 >ago.  (For the record, I asked for it probably a year ago or more.)  The
 >idea is that we'd be able to maintain lists of all articles on a given
 >general subject, such as philosophy, and any changes made to any articles
 >on the subject would show up on the recent changes page for that subject.
 >
 >Why do we need this?  At least one reason is it might help attract experts
 >to the project.  Just speaking for myself, I'm sure I'd spend more time on
 >Wikipedia if there were a philosophy recent changes page.  More
 >importantly, the recent changes page has *always* been huge.  It's now
 >more cumbersome than ever and makes it hard for people to focus their
 >attention, which would be a nice option.  Lack of the feature also makes
 >it hard to *monitor* goings-on in a general subject area.
 >
 >With the relatively new mysql-driven software, this shouldn't be as
 >difficult as it might have been before.  One could compile personal lists
 >using the "watch this page" feature; what I suggest is that we have
 >publicly-editable and publicly-viewable lists of the same nature.
 >
 >(Minor point: in a list of recent changes pages, I think there should be
 >automatically listed the number of topics that are listed under a given
 >subject.  That'd give us an idea of how much more work there is to be done
 >in adding to the list.)
 >
 >I am not committed to any particular version of the feature, by the way.
 >I'd just like to see it done.  I don't want to have to wade through 5000
 >edits just to see all the recent philosophy edits.
 >
 >Larry
 >
 >
 >[Wikipedia-l]
 >To manage your subscription to this list, please go here:
 >http://www.nupedia.com/mailman/listinfo/wikipedia-l
 >
 >
 >
 >
 >
I have some ideas on this:

The problem of finding subject groups is closely related to the problem
of indexing.

The problem with the current link structure is that it is much too
dense: you can get between articles very easily, as Wikipedia is a very
"small world" network. This tends to defeat any attempts at automated
indexing. What is needed is a way of making some pages and links more
visible than others to automatic indexing systems.

We define a new "category" namespace. An article can contain any number
of "category" links, which do not appear in the main body of the
article, but instead in a separate area, like the inter-language links.
There, they link to a placeholder "category" page which can be used to
define and describe the category. (And its "category talk" page can be
used to discuss the category).

Now, all pages that link to a category page "belong" to that category.
Categories are just sets: there can be any number of them, and they can
belong to multiple competing schemes.  And categories can belong to
categories, too, allowing for hierarchies and networks of categories to
be created. The presence of categories will make machine indexing much
easier.

Some very basic categories that might be useful for a start:

*[[category:animal]] eg wolf, cat, bird, dinosaur
*[[category:vegetable]]  eg potato, cactus
*[[category:person]] eg Isaac Newton, Sherlock Holmes (but see below)
*[[category:time period]] eg 20th century, Feburary, 200 BC, Cenozoic
*[[category:event]]  eg Wars of the Roses, 1997 Academy Awards
*[[category:place]] eg Dubrovnik, Alaska, Indian Ocean, Atlantis (but
see below)
*[[category:field of study]] eg Biology, Chemistry, Philosophy, Law,
Accountancy, Civil engineering
** not sure of the best name(s) for this: field of endeavour, subject of
inquiry?
*[[category:fictional]] eg Sherlock Holmes, Atlantis
*[[category:abstraction]] eg Soul, Mind, Sophie Germain prime,
Mathematical set

I'd like to get these very simple categories in place first, as a sort
of "page coloring" experiment. Notice that they are neither complete,
nor framed in the form of a hierarchy: this is not a taxonomy. For
example, "prion" belongs to none of these categories. Perhaps someone
will create a [[category:other lifeform]] page for the Archaea, prions
and viruses.

How to bootstrap the process? My first idea is this:

* assign categories to about 1000 articles by hand
* train a naive Bayesian classifier to recognise each category
* adjust thresholds to make sure that classifications are reasonably
accurate
* machine-classify the entire Wikipedia!

Now, this process will be less than perfect. Some articles will be
mis-classified, others will be missed because the threshold
probabilities were set too high: ie both type I and type II errors.
Mis-classification will not damage any actual articles, it will only
result in errors in machine-generated indexes. But at this point, manual
editing will take over.

New articles can be machine-classified once they reach say 250
characters, using a Bayesian classifier that is trained on the corpus as
a whole: and again, once they have been machine-classified once, they
are then left alone thereafter.

Now, at this point, we may not need to create a "philosophy" or
"chemistry" category. Instead, we can just note that these are
[[category:field of study|fields of study]] and hence that pages that
link to them, or are linked from them, "belong" to them in some sense.
Similar treatment can be done for time periods.

I'm not 100% sure how this would work, but I think that a workable
mechanism could be evolved, given the initial category coloring.

Neil











More information about the Wikipedia-l mailing list