In the process of writing some standards documents for the Wikipedia content model (some lower level behind-the-scenes stuff that needs to be done before working on the syntax and to beef up the test suite), I've come to the point were I need to decide exactly what characters are and are not allowed in page titles. I'd like to solicit input on this. Keep in mind here that what I'm specifying is what set of characters can a page title be chosen from; that is, what strings will be allowed between the brackets of a link, and displayed at the top of a page, regardless of whatever URL-encoding tricks we have to use to make that happen. _After_ we specify that, then we can specify exactly how to construct URLs from them. Here are my current thoughts:
* Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets), {} (braces), <> (greater,less), + (plus), \ (backslash) because allowing them would interfere with link syntax and make the software more tricky to write. I can live without these, though I think + might be handy in some places (like C++), and might be worth the effort to allow.
* Should allow anything Unicode calls a letter, numeral, syllable, or ideograph.
* Should not allow Unicode diacriticals, combining forms, display forms (ligatures), controls, and other specials.
* Should allow most ASCII punctuation that might appear in a name or title in text, specifically - , . ( ) ' & : ; % ! ? / $ * (Note that some of these, like *, are not currently alowed, and that : is a special case that's allowed but only when the text before it doesn't match a namespace, etc.)
* Should not allow non-ASCII punctuation like em dash, curly quotes, etc., because they cause problems on machines with strict ISO character sets.
* Space is allowed. Underscore is allowed, but indistinguishable from space. No other controls (tab, etc.) are allowed.
Anyone have other ideas/suggestions?
On Fri, 23 May 2003, Lee Daniel Crocker wrote:
- Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets), {} (braces), <> (greater,less), + (plus), \ (backslash) because allowing them would interfere with link syntax and make the software more tricky to write. I can live without these, though I think + might be handy in some places (like C++), and might be worth the effort to allow.
Plus + and quote " are frequently asked for. These would not interfere with wiki syntax at all, though both would require escaping in URLs (as does the ampersand & when used in the query string and the percent % and question mark ? always, all of which we presently allow).
- Should allow anything Unicode calls a letter, numeral, syllable, or ideograph.
Okay...
- Should not allow Unicode diacriticals, combining forms, display forms (ligatures), controls, and other specials.
Waitaminute... that would seem to exclude the use of accented characters that do not have a precombined form. This could be seriously detrimental to some languages.
(In any case, we ought to do a little fancier work with UTF-8 to make sure that canonical forms are used to prevent false non-matches. I don't know if there's a library we can link into PHP to do this or if we'd have to write something.)
-- brion vibber (brion @ pobox.com)
(Brion Vibber vibber@aludra.usc.edu):
- Should not allow Unicode diacriticals, combining forms, display forms (ligatures), controls, and other specials.
Waitaminute... that would seem to exclude the use of accented characters that do not have a precombined form. This could be seriously detrimental to some languages.
(In any case, we ought to do a little fancier work with UTF-8 to make sure that canonical forms are used to prevent false non-matches. I don't know if there's a library we can link into PHP to do this or if we'd have to write something.)
I confess ignorance here. Are there really languages for which the simplest canonical representation in Unicode requires combining forms? If so, then I remove the restriction, but we must then specify a specific canonical representation for titles in each language, as you suggest; perhaps something like a Stringprep profile would be needed.
On Tue, 27 May 2003, Lee Daniel Crocker wrote:
I confess ignorance here. Are there really languages for which the simplest canonical representation in Unicode requires combining forms?
Off the top of my head, one Aleutian language (Unangam Tunuu) uses x-with-circumflex; Guarani apparently uses g-with-tilde. Tone marks for Chinese Zhuyin phoenetic script are combining characters; I think the Indian scripts are pretty dependant on this kind of thing as well.
Precombined characters are theoretically only included for round-trip conversion with legacy character sets, so they're not really making new ones for orthographies that are just getting started in the wonderful world of character encoding.
If so, then I remove the restriction, but we must then specify a specific canonical representation for titles in each language, as you suggest; perhaps something like a Stringprep profile would be needed.
They've thought of that already too, it seems. :) See Unicode Standard Annex #15, "Unicode normalization forms": http://www.unicode.org/unicode/reports/tr15/
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
On Tue, 27 May 2003, Lee Daniel Crocker wrote:
I confess ignorance here. Are there really languages for which the simplest canonical representation in Unicode requires combining forms?
Off the top of my head, one Aleutian language (Unangam Tunuu) uses x-with-circumflex; Guarani apparently uses g-with-tilde. Tone marks for Chinese Zhuyin phoenetic script are combining characters; I think the Indian scripts are pretty dependant on this kind of thing as well.
Also nasalized vowels for IPA.
Ec
On Fri, 23 May 2003, Lee Daniel Crocker wrote:
Date: Fri, 23 May 2003 13:46:28 -0500 From: Lee Daniel Crocker lee@piclab.com Subject: [Wikitech-l] Title characters
<snip>
Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets), {} (braces), <> (greater,less), + (plus), \ (backslash) because allowing them would interfere with link syntax and make the software more tricky to write. I can live without these, though I think + might be handy in some places (like C++), and might be worth the effort to allow.
Should allow most ASCII punctuation that might appear in a name or title in text, specifically - , . ( ) ' & : ; % ! ? / $ * (Note that some of these, like *, are not currently alowed, and that : is a special case that's allowed but only when the text before it doesn't match a namespace, etc.)
Should not allow non-ASCII punctuation like em dash, curly quotes, etc., because they cause problems on machines with strict ISO character sets.
Space is allowed. Underscore is allowed, but indistinguishable from space. No other controls (tab, etc.) are allowed.
Anyone have other ideas/suggestions?
Missed one: the "at" symbol, @, is currently not allowed. I don't feel strongly one way or the other about it myself, but it's come up on the Village Pump recently when someone wanted to use it, so it should probably be on one of those lists.
"Lee Daniel Crocker" lee@piclab.com wrote in message news:20030523184628.GA22556@piclab.com...
...
Anyone have other ideas/suggestions?
Here's one that's almost unrelated, but your post has reminded me of it. We've had a number of requests for titles with an initial lowercase letter. I would like to see a checkbox next to "Watch this article" on the edit page, labelled "Title starts with lowercase letter". Clicking on "save" with this checked would set a flag in the cur table, instructing getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and [[iMac]] still go to the same place, but when the software has to display the title, it comes out as [[iMac]].
Alternately you could just have a link in the sidebar, like for page protection -- but please make sure the change is registered in RC.
-- Tim Starling.
On Sun, May 25, 2003 at 12:20:12AM +1000, Tim Starling wrote:
"Lee Daniel Crocker" lee@piclab.com wrote in message news:20030523184628.GA22556@piclab.com...
...
Anyone have other ideas/suggestions?
Here's one that's almost unrelated, but your post has reminded me of it. We've had a number of requests for titles with an initial lowercase letter. I would like to see a checkbox next to "Watch this article" on the edit page, labelled "Title starts with lowercase letter". Clicking on "save" with this checked would set a flag in the cur table, instructing getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and [[iMac]] still go to the same place, but when the software has to display the title, it comes out as [[iMac]].
Alternately you could just have a link in the sidebar, like for page protection -- but please make sure the change is registered in RC.
IMHO there is no sense to have 2 articles that differ only by capitalization - there should be some "canonical" form, but all links, no matter what capitalization they have, should go to the same article.
That would help a lot with computer stuff.
On Sun, 25 May 2003, Tomasz Wegrzanowski wrote:
On Sun, May 25, 2003 at 12:20:12AM +1000, Tim Starling wrote:
"Lee Daniel Crocker" lee@piclab.com wrote in message news:20030523184628.GA22556@piclab.com...
...
Anyone have other ideas/suggestions?
Here's one that's almost unrelated, but your post has reminded me of it. We've had a number of requests for titles with an initial lowercase letter. I would like to see a checkbox next to "Watch this article" on the edit page, labelled "Title starts with lowercase letter". Clicking on "save" with this checked would set a flag in the cur table, instructing getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and [[iMac]] still go to the same place, but when the software has to display the title, it comes out as [[iMac]].
Alternately you could just have a link in the sidebar, like for page protection -- but please make sure the change is registered in RC.
IMHO there is no sense to have 2 articles that differ only by capitalization - there should be some "canonical" form, but all links, no matter what capitalization they have, should go to the same article.
That's exactly what Tim is proposing - the only change is that the canonical form can be decided for each word separately rather than being always said at 'with capital'.
Andre Engels
On Mon, May 26, 2003 at 09:49:15AM +0200, Andre Engels wrote:
On Sun, 25 May 2003, Tomasz Wegrzanowski wrote:
On Sun, May 25, 2003 at 12:20:12AM +1000, Tim Starling wrote:
"Lee Daniel Crocker" lee@piclab.com wrote in message news:20030523184628.GA22556@piclab.com...
...
Anyone have other ideas/suggestions?
Here's one that's almost unrelated, but your post has reminded me of it. We've had a number of requests for titles with an initial lowercase letter. I would like to see a checkbox next to "Watch this article" on the edit page, labelled "Title starts with lowercase letter". Clicking on "save" with this checked would set a flag in the cur table, instructing getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and [[iMac]] still go to the same place, but when the software has to display the title, it comes out as [[iMac]].
Alternately you could just have a link in the sidebar, like for page protection -- but please make sure the change is registered in RC.
IMHO there is no sense to have 2 articles that differ only by capitalization - there should be some "canonical" form, but all links, no matter what capitalization they have, should go to the same article.
That's exactly what Tim is proposing - the only change is that the canonical form can be decided for each word separately rather than being always said at 'with capital'.
Well, I'm more concerned about "UNIX" vs. "Unix".
(Tomasz Wegrzanowski taw@users.sourceforge.net):
Well, I'm more concerned about "UNIX" vs. "Unix".
Or more generally, acronyms. "CAT" is computer assisted tomography, while "cat" is a furry creature. But if we did go to complete case-insensitivity, the problem would be merely another source of title ambiguity, which we are already used to dealing with (i.e., the "cat" page would deal with the creature and the machine just as the "Mercury" page deals with the metal, the planet, and the god), so that's not a major impediment.
We'd have to canonicalize the URLs in some way (for example, by making every character in the URL lowercase all the time), and then make a guess about what actual title to create for new pages.
I don't know if it's possible to make every case easy, so we have to settle for making the majority of cases easy. I think most page titles are still such that they should be capitalized as titles but not in running text, just like "cat". So the present system handles the common case well. True, it doesn't handle some other cases, but I'm not really sure we could do that without complicating the more common case.
I'd need to see more argument about exactly how to handle this before I'd be convinced to change it.
On Tue, May 27, 2003 at 02:18:39PM -0500, Lee Daniel Crocker wrote:
(Tomasz Wegrzanowski taw@users.sourceforge.net):
Well, I'm more concerned about "UNIX" vs. "Unix".
Or more generally, acronyms. "CAT" is computer assisted tomography, while "cat" is a furry creature. But if we did go to complete case-insensitivity, the problem would be merely another source of title ambiguity, which we are already used to dealing with (i.e., the "cat" page would deal with the creature and the machine just as the "Mercury" page deals with the metal, the planet, and the god), so that's not a major impediment.
We'd have to canonicalize the URLs in some way (for example, by making every character in the URL lowercase all the time), and then make a guess about what actual title to create for new pages.
I don't know if it's possible to make every case easy, so we have to settle for making the majority of cases easy. I think most page titles are still such that they should be capitalized as titles but not in running text, just like "cat". So the present system handles the common case well. True, it doesn't handle some other cases, but I'm not really sure we could do that without complicating the more common case.
I'd need to see more argument about exactly how to handle this before I'd be convinced to change it.
We need 2 canonical forms - database canonical form for linking, always lowercase, and presentation canonical forms, which is by default ucfirst(title_of_link_that_created_article), and can be overriden by #CANONICALFORM iMac or something.
On Fri, 23 May 2003, Lee Daniel Crocker wrote:
- Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets), {} (braces), <> (greater,less), + (plus), \ (backslash) because allowing them would interfere with link syntax and make the software more tricky to write. I can live without these, though I think + might be handy in some places (like C++), and might be worth the effort to allow.
(...)
- Should allow most ASCII punctuation that might appear in a name or title in text, specifically - , . ( ) ' & : ; % ! ? / $ * (Note that some of these, like *, are not currently alowed, and that : is a special case that's allowed but only when the text before it doesn't match a namespace, etc.)
Note that currently & is allowed, but not working - linking to a page with '&' in the title, takes you to the page with only the part before the '&'. Thus, this one in my opinion counts also as 'interfering with link syntax'. It's a very useful one, so if you are going to do things to make some of these possible, this one should certainly be included. Same type of problem might exist with '?', I don't know about that.
Andre Engels
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Je Lundo 26 Majo 2003 00:52, Andre Engels skribis:
Note that currently & is allowed, but not working - linking to a page with '&' in the title, takes you to the page with only the part before the '&'.
No, that's just a bug in the rewrite rules; some of the wikis didn't get the proper escaping fix added in. Fixed on NL now.
- -- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org