Alphabetical article index (was [WikiEN-l] trying hard to do a list right)

List overview All Threads
Download

newer

older

missing search results

Wikipedia responsiveness

Brion Vibber

17 Feb 2003 17 Feb '03

9:25 a.m.

(Moved from wikien-l to wikipedia-l; discussion of revamped alphabetical page index affects all languages.)

On dim, 2003-02-16 at 17:17, Ray Saintonge wrote:

...

Brion Vibber wrote:

...
How about something like the alphabetical index for the online AHD: http://www.bartleby.com/61/s0.html ?

That would be a definite improvement over what we don't have now!

Okay, very preliminary version (code is in CVS): http://test.wikipedia.org/wiki/Special:Allpages

Note that the test wiki contains nearly all pages starting with 'A', so the index seems a little oddly weighted. ;)

There's probably some wiggle room in the ideal number of links per page and whatnot. It needs to be made prettier, with backlinks to the top level index and forward/back browsing, but the basic functionality is there.

Also we need to get a proper sorting system in (see my recent post on wikitech-l); for instance if I put this on the Esperanto wiki all the accented letters pile up at the end instead of in their proper places: http://eo.wikipedia.org/wiki/Speciala:Allpages

Other things: currently the list includes redirects. Some redirects should definitely stay -- alternate names that would not appear near each other, for instance. Others (spelling, caps variations) could maybe be dropped, but that's harder to do consistently. Or more simply, we could just italicize redirects or something.

The generation of the top level index is currently pretty inefficient; it makes a separate database query for each chunk of 480 articles, and takes a while to generate on a wiki with 100,000+ articles. Before putting it on the English wiki live, it'll need to have some sort of caching mechanism if it can't be made a lot faster.

Here's a saved copy of the toplevel index for the big English Wikipedia just to give an idea of scale (the links don't work): http://test.wikipedia.org/upload/c/cd/Allpages-demo.html

-- brion vibber (brion @ pobox.com)

Attachments:

signature.asc (application/pgp-signature — 189 bytes)

Show replies by date

Ray Saintonge

17 Feb 17 Feb

10:43 a.m.

Brion Vibber wrote:

...

On dim, 2003-02-16 at 17:17, Ray Saintonge wrote:

...
Brion Vibber wrote:

...
How about something like the alphabetical index for the online AHD: http://www.bartleby.com/61/s0.html ?

That would be a definite improvement over what we don't have now!

Okay, very preliminary version (code is in CVS): http://test.wikipedia.org/wiki/Special:Allpages

It looks beautiful!

...

There's probably some wiggle room in the ideal number of links per page and whatnot. It needs to be made prettier, with backlinks to the top level index and forward/back browsing, but the basic functionality is there.

And I thought I could expect perfection on the first draft. :-)

...

Also we need to get a proper sorting system in (see my recent post on wikitech-l); for instance if I put this on the Esperanto wiki all the accented letters pile up at the end instead of in their proper places: http://eo.wikipedia.org/wiki/Speciala:Allpages

Would a table of equivalent values be workable so that for sorting and searching purposes "à" and "ã" would be considered equivalent to "a", etc. A second level of mini-sort would only be required when accents are the only thing distinguishing two entries. Entries in non-Latin scripts would still need to be handled separately.

There would still be the problem of those languages which really do consider some of those accented characters as special letters which sometimes belong at the end of the alphabet. But that should not be a worry for the English wikipedia, and modifications could be made for the others as required for their special rules.

...

Other things: currently the list includes redirects. Some redirects should definitely stay -- alternate names that would not appear near each other, for instance. Others (spelling, caps variations) could maybe be dropped, but that's harder to do consistently. Or more simply, we could just italicize redirects or something.

I agree, but for now it's a step ahead of where we are. Another interesting challenge will come from how we deal and reconcile with the established policy of putting names as [[John Smith]] instead of [[Smith, John]]

...

The generation of the top level index is currently pretty inefficient; it makes a separate database query for each chunk of 480 articles, and takes a while to generate on a wiki with 100,000+ articles. Before putting it on the English wiki live, it'll need to have some sort of caching mechanism if it can't be made a lot faster.

As much as I find beauty in the proposal, there may be other ways to do this that are less demanding on the system, but just as easy for the user.. A tree that requires succesively choosing the first, second and third letters of the first word might do this. So would a simple browse function that asks the user to supply the first few letters to begin his browse. This would still contain the opportunity to step back and forth to adjacent blocks.

Eclecticology

Brion Vibber

11:46 a.m.

New subject: Alphabetical article index (was [WikiEN-l] trying hard to do a list right)

On lun, 2003-02-17 at 01:43, Ray Saintonge wrote:

...

Brion Vibber wrote:

...
Okay, very preliminary version (code is in CVS): http://test.wikipedia.org/wiki/Special:Allpages

It looks beautiful!

Woo-hoo!

...

...
There's probably some wiggle room in the ideal number of links per page and whatnot. It needs to be made prettier, with backlinks to the top level index and forward/back browsing, but the basic functionality is there.

And I thought I could expect perfection on the first draft. :-)

What, and leave you nothing to look forward to? ;)

...

Would a table of equivalent values be workable so that for sorting and searching purposes "à" and "ã" would be considered equivalent to "a", etc. A second level of mini-sort would only be required when accents are the only thing distinguishing two entries.

In order to be at all non-molasses-like, the sorting has to be ingrained into the indexes in the database.

To summarize my proposal on wikitech-l: this means either creating a suitable charset plugin for MySQL to work with when building indexes (can only be set on a server-wide basis -- not good enough for us, as each language must be treated distinctly) or adding a special sort-key field for each article. With a separate sort key, we can munge the titles on a per-language basis so that characters are equivalized or separated and rearranged such that a simple ASCII-style sort on the hidden field will turn up the right order.

For English and French, this means simply replacing accented characters with their base letters. For other languages this may involve adding dummy high or low ascii chars to force a letter to sort above or below an equivalent. But in all cases, it's the same basic mechanism -- make some replacements on a string, then store the result and let the database do a dumb sort with it.

Not really related, but what might be useful too is a (per-language) list of index points; we might want the top level index to force distinct sections for each letter.

ie instead of: Aardvark-Audio Aural-Catapult ...

we force index breaks at the start of each letter: Aardvark-Audio Aural-Azimuth Baal-Buzz Cab-Cop ....

...

Another interesting challenge will come from how we deal and reconcile with the established policy of putting names as [[John Smith]] instead of [[Smith, John]]

The simplest way is to abolish all alphabetical listings. ;)

...

As much as I find beauty in the proposal, there may be other ways to do this that are less demanding on the system, but just as easy for the user.. A tree that requires succesively choosing the first, second and third letters of the first word might do this. So would a simple browse function that asks the user to supply the first few letters to begin his browse. This would still contain the opportunity to step back and forth to adjacent blocks.

Well, if you consider this to be a browse function:

CHDIR C:\WIKI\ESP DIR ESP*.*

;)

Part of the usefulness of the all pages index is, at least in theory, providing a fairly direct path to all pages for search engine spiders; orphans and islands would still get linked to and indexed. So it's got to be linkable. That, and the funnest part of browsing is coming across things you wouldn't have thought of ahead of time -- this is helped by making some real words visible. Plus, it just establishes context more clearly to see whole words than a couple of letters jammed together.

-- brion vibber (brion @ pobox.com)

Toby Bartels

21 Feb 21 Feb

6:15 p.m.

New subject: Alphabetical article index

Brion VIBBER wrote in part:

...

Other things: currently the list includes redirects. Some redirects should definitely stay -- alternate names that would not appear near each other, for instance. Others (spelling, caps variations) could maybe be dropped, but that's harder to do consistently. Or more simply, we could just italicize redirects or something.

I favour using #DEPRECATED in addition to #REDIRECT. This would be used for misspellings, CamelCase, etc, while #REDIRECT would be kept for alternate titles that might well appear in page text but for whatever reason (naming conventions, arbitrary choices) aren't the page title.

For the most part, these would behave the same, but we can distinguish them in indexes of this sort. We can also do a special maintenance search for links to #DEPRECATED pages, since these will usually (not quite always, however) be a mistake in the page's text. To begin with, there will be no #DEPRECATED pages, but whenever we notice (say) CamelCase in the Allpages list, then we know that there's a #REDIRECT to change to #DEPRECATED.

-- Toby

Lee Daniel Crocker

7:02 p.m.

New subject: Alphabetical article index

...

I favour using #DEPRECATED in addition to #REDIRECT. This would be used for misspellings, CamelCase, etc, while #REDIRECT would be kept for alternate titles that might well appear in page text but for whatever reason (naming conventions, arbitrary choices) aren't the page title.

For the most part, these would behave the same, but we can distinguish them in indexes of this sort. We can also do a special maintenance search for links to #DEPRECATED pages, since these will usually (not quite always, however) be a mistake in the page's text. To begin with, there will be no #DEPRECATED pages, but whenever we notice (say) CamelCase in the Allpages list, then we know that there's a #REDIRECT to change to #DEPRECATED.

This is an excellent suggestion; I invite the list to tell me if they see any problems with implementing it. If not, I'll do it.

-- Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/ "All inventions or works of authorship original to me, herein and past, are placed irrevocably in the public domain, and may be used or modified for any purpose, without permission, attribution, or notification."--LDC

erik_moeller＠gmx.de

10:17 p.m.

New subject: Alphabetical article index

re #deprecated as alternative to #redirect:

...

This is an excellent suggestion; I invite the list to tell me if they see any problems with implementing it. If not, I'll do it.

I don't think we should add another #.. type command. Instead, the syntax of #REDIRECT should be expanded.

Please see http://www.wikipedia.org/pipermail/wikitech-l/2003-January/002345.html (search for "thought")

and

http://www.wikipedia.org/pipermail/wikitech-l/2003-February/002523.html

for an alternative way to improve the #REDIRECT syntax, namely by

1) allowing language-specific redirect reasons, 2) making it possible to turn a redirect into a disambiguation page by simply adding another #REDIRECT line.

Regards,

Erik

Ray Saintonge

23 Feb 23 Feb

10:31 p.m.

New subject: Alphabetical article index

Toby Bartels wrote:

...

Brion VIBBER wrote in part:

...
Other things: currently the list includes redirects. Some redirects should definitely stay -- alternate names that would not appear near each other, for instance. Others (spelling, caps variations) could maybe be dropped, but that's harder to do consistently. Or more simply, we could just italicize redirects or something.

I favour using #DEPRECATED in addition to #REDIRECT. This would be used for misspellings, CamelCase, etc, while #REDIRECT would be kept for alternate titles that might well appear in page text but for whatever reason (naming conventions, arbitrary choices) aren't the page title.

For the most part, these would behave the same, but we can distinguish them in indexes of this sort. We can also do a special maintenance search for links to #DEPRECATED pages, since these will usually (not quite always, however) be a mistake in the page's text. To begin with, there will be no #DEPRECATED pages, but whenever we notice (say) CamelCase in the Allpages list, then we know that there's a #REDIRECT to change to #DEPRECATED.

I see this as the sort of thing that a few people will use, but most of us will ignore. People would prefer not to cope with these subtle distinctions.

It could be used on a page that's a candidate for deletion, but which has been kept to avoid breaking the links in somebody's browser. A #DEPRECATED page could be deleted if it has not reached a predetermined level of accesses over a specified period of time.

Eclecticology

Fred Bauder

1 Mar 1 Mar

2:36 p.m.

New subject: Internet archive at the new library of alexandria

To access a web archive for about the last 8 years try:

http://www.bibalex.gov.eg/Start.asp?LangID=1

You'll need to search around a bit on the site for the "internet archive" link. You need the url to access the old pages.

Fred

Karl Eichwalder

3 Mar 3 Mar

5:44 a.m.

New subject: Internet archive at the new library of alexandria

Fred Bauder fredbaud@ctelco.net writes:

...

To access a web archive for about the last 8 years try:

http://www.bibalex.gov.eg/Start.asp?LangID=1

You'll need to search around a bit on the site for the "internet archive" link. You need the url to access the old pages.

Thanks for the pointer. Is there any other entry point? This URL just appears as a blue screen for me. And http://www.bibalex.gov.eg/ wants add-ons I don't have installed -- this site isn't user friendly at all :-(

-- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.gnu.franken.de/ke/ | ,__o Free Translation Project: | _-_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*)

Fred Bauder

1:40 p.m.

New subject: Internet archive at the new library of alexandria

Something is wrong at the server; it wan't load for me either in English or Arabic. Just wait till they get up again.

Fred

...

From: Karl Eichwalder ke@gnu.franken.de Reply-To: wikipedia-l@wikipedia.org Date: Mon, 03 Mar 2003 05:44:17 +0100 To: wikipedia-l@wikipedia.org Subject: [Wikipedia-l] Re: Internet archive at the new library of alexandria

Fred Bauder fredbaud@ctelco.net writes:

...
To access a web archive for about the last 8 years try:

http://www.bibalex.gov.eg/Start.asp?LangID=1

You'll need to search around a bit on the site for the "internet archive" link. You need the url to access the old pages.

Thanks for the pointer. Is there any other entry point? This URL just appears as a blue screen for me. And http://www.bibalex.gov.eg/ wants add-ons I don't have installed -- this site isn't user friendly at all :-(

-- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.gnu.franken.de/ke/ | ,__o Free Translation Project: | _-_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) _______________________________________________ Wikipedia-l mailing list Wikipedia-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikipedia-l

Jens Frank

8:45 p.m.

New subject: Internet archive at the new library of alexandria

On Mon, Mar 03, 2003 at 05:40:52AM -0700, Fred Bauder wrote:

...

Something is wrong at the server; it wan't load for me either in English or Arabic. Just wait till they get up again.

They said something about a fire at the library of Alexandria in the radio this morning. I don't remember the details - I've still been half asleep.

Regards,

JeLuF

Andre Engels

4 Mar 4 Mar

10:29 a.m.

New subject: Internet archive at the new library of alexandria

On Mon, 3 Mar 2003, Karl Eichwalder wrote:

...

Fred Bauder fredbaud@ctelco.net writes:

...
To access a web archive for about the last 8 years try:

http://www.bibalex.gov.eg/Start.asp?LangID=1

You'll need to search around a bit on the site for the "internet archive" link. You need the url to access the old pages.

Thanks for the pointer. Is there any other entry point? This URL just appears as a blue screen for me. And http://www.bibalex.gov.eg/ wants add-ons I don't have installed -- this site isn't user friendly at all :-(

Another web site that gives access to the same archive is http://www.archive.org/

Andre Engels

Karl Eichwalder

5 Mar 5 Mar

8:02 p.m.

New subject: Internet archive at the new library of alexandria

Andre Engels engels@uni-koblenz.de writes:

...

Another web site that gives access to the same archive is http://www.archive.org/

I already started wondering where they did get the past pages from ;)

http://www.archive.org/ does not seem to crawl that eagerly throught the net as it used to do the last years. In other words: it takes time until new pages will find their way into the archive...

-- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.gnu.franken.de/ke/ | ,__o Free Translation Project: | _-_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*)

7964

Age (days ago)

7980

Last active (days ago)

wikipedia-l@lists.wikimedia.org

12 comments

9 participants

tags (0)

participants (9)

Andre Engels
Brion Vibber
erik_moeller＠gmx.de
Fred Bauder
Jens Frank
Karl Eichwalder
Lee Daniel Crocker
Ray Saintonge
Toby Bartels