In the process of writing some standards documents for the Wikipedia content model (some lower level behind-the-scenes stuff that needs to be done before working on the syntax and to beef up the test suite), I've come to the point were I need to decide exactly what characters are and are not allowed in page titles. I'd like to solicit input on this. Keep in mind here that what I'm specifying is what set of characters can a page title be chosen from; that is, what strings will be allowed between the brackets of a link, and displayed at the top of a page, regardless of whatever URL-encoding tricks we have to use to make that happen. _After_ we specify that, then we can specify exactly how to construct URLs from them. Here are my current thoughts:
* Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets), {} (braces), <> (greater,less), + (plus), \ (backslash) because allowing them would interfere with link syntax and make the software more tricky to write. I can live without these, though I think + might be handy in some places (like C++), and might be worth the effort to allow.
* Should allow anything Unicode calls a letter, numeral, syllable, or ideograph.
* Should not allow Unicode diacriticals, combining forms, display forms (ligatures), controls, and other specials.
* Should allow most ASCII punctuation that might appear in a name or title in text, specifically - , . ( ) ' & : ; % ! ? / $ * (Note that some of these, like *, are not currently alowed, and that : is a special case that's allowed but only when the text before it doesn't match a namespace, etc.)
* Should not allow non-ASCII punctuation like em dash, curly quotes, etc., because they cause problems on machines with strict ISO character sets.
* Space is allowed. Underscore is allowed, but indistinguishable from space. No other controls (tab, etc.) are allowed.
Anyone have other ideas/suggestions?