I think that a subject classification of articles would vastly improve
"soft security" and would save regulars a lot of time, since not
everyone would have to check every edit as currently seems to be the
case.
>I'd still like to see if we couldn't build those subjects
>automatically in some way based on links in the database.
How about this: the possible topics coincide with the major pages
listed on [[Main Page]] (from "Astronomy" to "Visual Arts"). The
shortest link path from such a topic page to an article defines that
article's topic. If there is no such path, then the article is
classified as a topic orphan.
To compute these topics quickly, the cur table gets two new columns:
topic and distance, where distance stands for the link distance from the
Main Page topic page. If a new article is created, looking at the
distance entries of all articles that link to the new one, and taking
the minimum, immediately classifies the new one. If an existing
article is saved, the topic and distance entries of all articles it
links to (and their children) may need to be updated; these changes
can be propagated in a recursive manner.
Would that work?
Axel
I've added a (presently for sysops only) Special:Undelete, which should
appear as "View and restore deleted pages" in the special pages list. It
lists the archived deleted pages (the majority of which are pure drivel
and should be flushed at some point), and you can view the archived
pages and their histories and optionally restore them to life.
Restoring a page with the same title as one that currently exists will
tack the deleted page's history onto the end of the existing one's.
The interface is still rather rough and not yet fully integrated with
the deletion log (ie, links to undelete perhaps should appear next to
deletion notices, and restorations should definitely be listed in the
same log), and I probably haven't tested it as thoroughly as I ought to
have.
It is to be considered experimental, so please be gentle with it. If Lee
wants to set it up on the piclab server, less gentle testing is I'm sure
welcome there. :) I'd offer my own test server, but it's currently
behind a modem (wah!)
(File is in CVS.)
-- brion vibber (brion @ pobox.com)
Can this be added to the User Preferences panel?
Note; this is just a texty version.
Default Recent Changes page
Number of titles: [100]
Number of hours: [1|12|24|72]
Show minor edits: [x]
Namespaces to display:
All [x]
Main: (encyclopedia entries) [x] +Talk [x]
Wikipedia: (pages about the site) [x] +Talk [x]
The hidden motive is to prep for
Meta:
List:
or other languages, etc.
Number of links to article is counted, instead of number of other articles
that link to it. Difference is important, because on some pages there is
enormous number of links to the same article. For example on
alphabetical lists of cities in Poland, there are so many links to
voivodships pages that they immediately become "most wanted".
Most wanted article with these settings has value 609, without 69.
Query:
$sql = "SELECT bl_to, COUNT( bl_to ) as nlinks " .
"FROM brokenlinks GROUP BY bl_to HAVING nlinks > 1 " .
"ORDER BY nlinks DESC LIMIT {$offset}, {$limit}";
Should be changed to something like that:
$sql = "SELECT bl_to, COUNT( DISTINCT bl_from ) as nlinks " .
"FROM brokenlinks GROUP BY bl_to HAVING nlinks > 1 " .
"ORDER BY nlinks DESC LIMIT {$offset}, {$limit}";
> So, why not make it "wikipedia:Fulltext Stoplist", and load the
> whole list from the database on each query? Might actually save
> some time in the long run, and the non-English wikipedias could
> easily develop their own lists. Or would that be too risky?
The stoplist is compiled into MySQL; it can't be changed without
recompiling the database software.
The CVS contains FulltextStoplist.php, which is a list of the "common
words" excluded from search queries. It contains only English words,
which caused complaints on the German wikipedia, as they, at least,
don't want to be kept from searching for "false friends" common in English.
It would be easy to just make it another array/function in the Language
files, but
1. AFAIK, it is only used in one function, namely search
2. It might be nice if updating this list would be easy for everyone,
not just developers
So, why not make it "wikipedia:Fulltext Stoplist", and load the whold
list from the database on each query? Might actually save some time in
the long run, and the non-English wikipedias could easily develop their
own lists. Or would that be too risky?
Magnus
I think the Nupedia software is written so that articles in a subject area
are not displayed unless a subject is "active." So, none of the
philosophy articles are displayed, because there is no philosophy editor
and therefore the subject area is "inactive."
This is just a bug, not a requested feature. Whether it can and will be
fixed--don't know!
Of course, if we were to implement the Nupedia system agreed upon last
November, or some other simpler system, presumably this sort of problem
wouldn't exist...
--Larry
(Wikitech-l: this is more on automatic subject classification, which Axel
brought up recently on Wikipedia-l.)
On Mon, 23 Sep 2002, Axel Boldt wrote:
[snip excellent comments that I agree with]
> I still believe that all of this can and should be
> done automatically, by tracing link paths from the
> main page.
I'm going to repeat some of what you've said earlier, adding my own
perspective. I really hope some programmers pursue this--they needn't ask
anyone's permission. The proof's in the pudding.
If automatic categorization could be done, and it sounds very plausible to
me, it would be *far* superior to a hand-maintained list of subject area
links. And incredibly useful, too.
OK, the following will reiterate some of the earlier discussion.
Presumably, nearly every page on Wikipedia can be reached from nearly
every other page. (There are orphans; and there are pages that do not
link to any other pages, though other pages link to them.)
This suggests that we can basically assign a number--using *some*
algorithm (not necessarily any one in particular: here is where
programmers can be creative)--giving the "closeness" of a page to all the
listed subjects. (This is very much like the Kevin Bacon game, of course,
and the "six degrees of separation" phenomenon.)
The question whether a *useful* algorithm can be stated is interesting
from a theoretical point of view. As I understand it, the suggestion is
that there is a simple and reliable (but how reliable?) algorithm, such
that, given simply a list of all the links in Wikipedia (viz., the source
page and destination page), and a list of subject categories, we can
reliably sort all pages into their proper categories.
It will not do to say, "There are obvious counterexamples, so let's not
even try." We can live with some slop. This is Wikipedia! We could even
fix errors by hand (ad hoc corrections are possible; why not?). As far as
I'm concerned, the real question is, once we try *various* algorithms,
what's the highest reliability we can actually generate? I'll bet it'll
be reasonably high, certainly high enough to be quite useful.
Here's an attempt at expressing an algorithm:
For a given page P (e.g., [[Plato's allegory of the cave]]), if the
average number of clicks (not backtracking to any page already reached--
otherwise you deal with infinite regresses) needed to reach P from the
subject page S (e.g., [[Philosophy]]) through all possible links between P
and S (or, perhaps, all possible links below a certain benchmark number?)
is lower than the average number of clicks need to reach P from any other
subject page, then P is "about" S.
The algorithm could be augmented in useful ways. In case of ties, or near
ties, a page could be listed as under multiple subjects. I have no idea
if this algorithm is correct, but that doesn't matter--it's just an
example. If you think harder and longer, I'm sure you'll think of a
better one.
This would be fascinating, I'm sure, for the programmers. Can't we just
take the question about how long processing will require as a constraint
on the algorithm rather than as a knock-down argument that it's not
feasible? The *exercise* is to find (and implement!) an algorithm that
*is* feasible. We don't even have to do this using Wikipedia's server, if
it would be too great of a load; anyone could download the tarball and
process it. You could do a cron job once a day, compile the 40-odd
"subject numbers" for each article in Wikipedia, and sort articles into
subject groups (in some cases, multiple subject groups for a given
article--why not?). From there we could use scripts already written to
create the many "recent changes" pages.
I really, really, really want to see [[Philosophy Recent Changes]]. We
desperately need pages like that, and this is one of the best possible
ways we have of getting them. It's worth actually exploring.
--Larry
I'm forwarding this to wikitech-l.
I've been wanting to do a dump of our article titles to insert into the
search engines that I manage (bomis and 3apes, mainly), just to drive more
of the traffic that I influence towards wikipedia. I did a little program
for this in the old UseMod days, but Bomis hasn't updated it's wikipedia
links since then. :-(
Perhaps we should have a script to generate RDF, which is a simple format used
by dmoz and familiar to search engine operators.
--Jimbo
----- Forwarded message from Stephen Gilbert <canuck_in_korea2002(a)yahoo.com> -----
From: Stephen Gilbert <canuck_in_korea2002(a)yahoo.com>
Date: Sat, 21 Sep 2002 09:29:48 -0700 (PDT)
To: wikipedia-l(a)nupedia.com
Subject: [Wikipedia-l] Fwd: Adding Wikipedia to OneLook
I just received this from the maintainer of OneLook, a
meta-search of dictionaries (and some encyclopedias).
Is there any way to produce a flat file of all our
articles? This would also allow Wikipedia to once
again work with Sunir Shah's MetaWiki search.
If you haven't tried OneLook before, give it a spin.
It's fantastic.
http://www.onelook.com/
Stephen G.
--- Doug Beeferman <doug(a)dougb.com> wrote:
> Date: Fri, 20 Sep 2002 13:16:10 -0400 (EDT)
> From: Doug Beeferman <doug(a)dougb.com>
> To: <canuck_in_korea2002(a)yahoo.com>
>
>
> Hi Stephen,
>
> Thanks for your kind feedback on OneLook and sorry
> for the delay in
> responding to you. I already have some familiarity
> with Wikipedia/Nupedia
> as a user and would love to add its headwords to
> OneLook. Could you
> assist me in this? In particular, do you know of a
> flat file that I could
> point OneLook's update engine to that lists
> Wikipedia's headwords (or
> topic names, or however they're called in
> Wiki-land)?
>
> Running the available SQL files through MySQL to
> dump these headwords on
> every update would be a bit unwieldy for various
> reasons. One alternative
> would be a script that extracts the headwords from
> the .sql file, but I
> don't want to rewrite this if it's already been
> done.
>
> (I just spent fifteen minutes trying to figure out
> how to post this
> question on meta.wikipedia.com. I was defeated.
> There's probaby
> something I'm missing or too lazy to read...)
>
> Thanks again for writing. Are you involved actively
> in Wikipedia's
> maintenance? If so, good luck with the project --
> it looks like it's
> really taking off!
>
> Doug
> (a canuck in America)
>
>
> --- add canuck_in_korea2002(a)yahoo.com
> 2002-09-06 08:51:25 218.150.177.35 Mozilla/5.0
> Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020610
> Debian/1.2.5-1
>
> I just discovered One-Look and it has quickly found
> a much coveted space
> on my browser's Personal Toolbar. I especially
> appreciate the lack of
> pop-up ads. Thanks for the great resource!
>
>
> I notice you have several encyclopedias in your
> database. I would like to
> suggest an encyclopedia called Wikipedia. It is an
> effort to build a
> complete encyclopedia from scratch, written by
> volunteers and released
> under the GNU Free Documentation License. Wikipedia
> is one and a half
> years old, and currently has about 40,000 articles,
> many of which are very
> good.
>
> Cheers!
>
> Stephen Gilbert
>
>
__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com
[Wikipedia-l]
To manage your subscription to this list, please go here:
http://www.nupedia.com/mailman/listinfo/wikipedia-l
----- End forwarded message -----
Recently the image Great_Seal_of_the_United_States_(small).png vanished;
its image description page listed it as present and had a link to one
revision, but the image itself was gone.
After brief investigation, I found that the image file was still present
in the archives directory, and copied it back to where it belonged...
but there had also once been an earlier revision of the same file (a
non-transparent PNG), now missing both from the archive and the image page.
Grepping the access logs, it turned out that a spider had come across
the image description page and followed links to both revert and delete
the older revision of the image -- simply loading up these links caused
the wiki to move and permanently delete files.
Apparently telling it to both revert and delete the same revision
confused the poor wiki, and it ended up vanishing that revision entirely
_and_ leaving the newer one only in the archives.
As a workaround until a better way of handling these functions is
decided on, I've hacked Skin.php to not give the delete/revert links to
anonymous users (and therefore bots and spiders).
-- brion vibber (brion @ pobox.com)