(Brion Vibber vibber@aludra.usc.edu):
- Should not allow Unicode diacriticals, combining forms, display forms (ligatures), controls, and other specials.
Waitaminute... that would seem to exclude the use of accented characters that do not have a precombined form. This could be seriously detrimental to some languages.
(In any case, we ought to do a little fancier work with UTF-8 to make sure that canonical forms are used to prevent false non-matches. I don't know if there's a library we can link into PHP to do this or if we'd have to write something.)
I confess ignorance here. Are there really languages for which the simplest canonical representation in Unicode requires combining forms? If so, then I remove the restriction, but we must then specify a specific canonical representation for titles in each language, as you suggest; perhaps something like a Stringprep profile would be needed.