[Wikitech-l] Title characters

23 May 2003


      In the process of writing some standards documents for the Wikipedia
content model (some lower level behind-the-scenes stuff that needs to
be done before working on the syntax and to beef up the test suite),
I've come to the point were I need to decide exactly what characters
are and are not allowed in page titles. I'd like to solicit input on
this. Keep in mind here that what I'm specifying is what set of
characters can a page title be chosen from; that is, what strings
will be allowed between the brackets of a link, and displayed at the
top of a page, regardless of whatever URL-encoding tricks we have to
use to make that happen. _After_ we specify that, then we can specify
exactly how to construct URLs from them. Here are my current thoughts:
* Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets),
  {} (braces), <> (greater,less), + (plus), \ (backslash) because
  allowing them would interfere with link syntax and make the
  software more tricky to write. I can live without these, though
  I think + might be handy in some places (like C++), and might be
  worth the effort to allow.
* Should allow anything Unicode calls a letter, numeral, syllable,
  or ideograph.
* Should not allow Unicode diacriticals, combining forms, display
  forms (ligatures), controls, and other specials.
* Should allow most ASCII punctuation that might appear in a name
  or title in text, specifically - , . ( ) ' & : ; % ! ? / $ *
  (Note that some of these, like *, are not currently alowed,
  and that : is a special case that's allowed but only when the
  text before it doesn't match a namespace, etc.)
* Should not allow non-ASCII punctuation like em dash, curly
  quotes, etc., because they cause problems on machines with
  strict ISO character sets.
* Space is allowed. Underscore is allowed, but indistinguishable
  from space. No other controls (tab, etc.) are allowed.
Anyone have other ideas/suggestions?
-- 
Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Title characters