Unicode high characters versus MySQL 5

List overview All Threads
Download

newer

older

Deleted revs

adding pages to a namespace

Brion Vibber

14 Oct 2005 14 Oct '05

10:23 p.m.

MySQL 5 is scheduled to come out of beta next month, and we're going to be looking at upgrading sometime in the coming months. Among other things we're probably going to want to start making use of the support for Unicode collation, so we can get better sorting and perhaps use it for case-insensitive matching.

There is however a compatibility issue: MySQL's Unicode support is limited to the 16-bit character range (basic multilingual plane), both for ucs2 and utf8 storage modes.

Characters beyond the BMP are relatively rare, but they do occur. Mostly in there are ancient/dead scripts, some invented scripts, and a bunch of rare Han characters which sometimes turn up in Chinese and Japanese.

This won't affect page _contents_; our content is stored in binary blobs and can have any wacky characters we want. But to support these high characters in page titles, usernames, and such might require jumping through a lot of hoops.

It would be relatively simple to disable use of titles and usernames with these high characters; to assess possible impact I did a check through all our current wikis and found 99 extant pages:

43 in en.wiktionary.org 31 in got.wikipedia.org 10 in la.wiktionary.org 9 in zh.wikipedia.org 3 in so.wikipedia.org 1 in en.wikibooks.org 1 in ja.wikipedia.org 1 in nl.wikibooks.org

I've put the full list of pages here: http://meta.wikimedia.org/wiki/User:Brion_VIBBER/Unicode_high_chars

Most of the en.wiktionary entries are individual letters in the Deseret and Shavian alphabets (invented alphabets for English; historical curiosities).

The Gothic alphabet is entirely in the high-character area, but it's a long-dead language and not exactly an active wiki. Perhaps we should just close it down...

Latin Wiktionary contains several Gothic terms...

The Chinese Wikipedia contains several apparently legitimate articles (from what I can tell) using high characters; these might have to be moved. The Japanese Wikipedia has one redirect with such a character.

The Somali Wikipedia contains three one-sentence stub pages pages using the Osmanya script; Omniglot's article on it says this script is no longer in use since adoption of the Latin alphabet in 1972.

English Wikibooks has a user account with a Gothic-script name, which has edited a number of pages about the Gothic language and has a user page.

Dutch Wikibooks has one Gothic-titled redirect.

-- brion vibber (brion @ pobox.com)

Attachments:

signature.asc (application/pgp-signature — 253 bytes)

Show replies by date

Ray Saintonge

15 Oct 15 Oct

12:11 a.m.

Brion Vibber wrote:

...

MySQL 5 is scheduled to come out of beta next month, and we're going to be looking at upgrading sometime in the coming months. Among other things we're probably going to want to start making use of the support for Unicode collation, so we can get better sorting and perhaps use it for case-insensitive matching.

(snip)

...

It would be relatively simple to disable use of titles and usernames with these high characters; to assess possible impact I did a check through all our current wikis and found 99 extant pages:

43 in en.wiktionary.org

Most of the en.wiktionary entries are individual letters in the Deseret and Shavian alphabets (invented alphabets for English; historical curiosities).

I've put this into the en:Wiktionary deletion process. Most of our material is the responsibility of one person [[User:Vladisdead]], who has not been active lately.

For some people, having this stuff available is the ONLY reason they need for using it. ;-)

Timwi

7:34 a.m.

...

There is however a compatibility issue: MySQL's Unicode support is limited to the 16-bit character range (basic multilingual plane), both for ucs2 and utf8 storage modes.

I can't believe you are seriously suggesting to allow this to limit us so badly that we would even have to close down an entire wiki (Gothic) and may possibly not be able to re-open it again in years. The same applies to high characters in article titles on Wikipedias such as Chinese that are actually active and may actually *need* them! If they are using them already, and they are working fine, and now you're taking it away from them, I don't think they are going to like you. I hope for strong opposition to such artificial limitations.

If MySQL's Unicode support is not up to the task, then DON'T USE IT.

Timwi

Tels

7:46 a.m.

-----BEGIN PGP SIGNED MESSAGE-----

Moin,

On Saturday 15 October 2005 14:34, Timwi wrote:

...

...
There is however a compatibility issue: MySQL's Unicode support is limited to the 16-bit character range (basic multilingual plane), both for ucs2 and utf8 storage modes.

I can't believe you are seriously suggesting to allow this to limit us so badly that we would even have to close down an entire wiki (Gothic) and may possibly not be able to re-open it again in years. The same applies to high characters in article titles on Wikipedias such as Chinese that are actually active and may actually *need* them! If they are using them already, and they are working fine, and now you're taking it away from them, I don't think they are going to like you. I hope for strong opposition to such artificial limitations.

If MySQL's Unicode support is not up to the task, then DON'T USE IT.

I have to agree - I thought that we left the "nobody-needs-more-than-X-characters"-age past us with Unicode. Please do not add in artificial contraints again.

Best wishes,

Tels

- -- Signed on Sat Oct 15 14:45:03 2005 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.

"Spammed if you do, spammed if you don't." - Murphy's Law

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iQEVAwUBQ1D6N3cLPEOTuEwVAQEWVwf/VhkZKPdK5rRfMj4u/ZDMlZXxvvt2LVtl vIoea4y0ctJU71mV1CXHoWccdKVApH9ryzq9obBXaS5/o55KKxY6JY6GN05SJmnb W5ZPpCO0D2/F37aoLjgIFGm3JoS049EeESYiGiPJ7Nbzz9LvZIOHhX/ci2Zsg359 AIrakV5F/YZd57bwomqCaODhrhKSoKGlSHAziCgbsNOSBoPx6JNcew7/jyYgAS9E M7p4I4nHyEBBGfHpcw6r8SMyn/7V8jI0mOETaoRpI2lLtM5E0/HJ70G4f4EGYeMg KBGjl2yX7DB6HVUNv7olGy6iTRWEg6iqM2mG5Zit061+PIOwZ88oow== =umVl -----END PGP SIGNATURE-----

Magnus Manske

8:21 a.m.

Am Samstag, den 15.10.2005, 13:34 +0100 schrieb Timwi:

...

I can't believe you are seriously suggesting to allow this to limit us so badly that we would even have to close down an entire wiki (Gothic) and may possibly not be able to re-open it again in years. The same applies to high characters in article titles on Wikipedias such as Chinese that are actually active and may actually *need* them! If they are using them already, and they are working fine, and now you're taking it away from them, I don't think they are going to like you. I hope for strong opposition to such artificial limitations.

If MySQL's Unicode support is not up to the task, then DON'T USE IT.

Maybe we can use MySQL5 Unicode support on all *but* Gothic and Chinese? I doesn't seem like, for example, en.wikipedia would be severly limited by the lack of high Unicode chars.

Magnus

Александр Сигачёв

9:17 a.m.

Unicode collation is very essential and long-awaited for some wikies.

http://bugzilla.wikimedia.org/show_bug.cgi?id=164

-- Alexander Sigachov meta:user:ajvol

Neil Harris

10 a.m.

It is that MySQL 5 cannot support characters outside the BMP at all, or just that it can't collate them properly? If it just handles > BMP UTF-8 sequences as binary data, might it simply sort them in Unicode code point order?

Or does it do something worse, and actively convert the Unicode characters into a 16-bit range, thus nuking characters outside the BMP. rather than storing, and largely processing, them as binary-encoded data for purposes other than collating?

-- Neil

Brion Vibber

1:55 p.m.

Neil Harris wrote:

...

It is that MySQL 5 cannot support characters outside the BMP at all, or just that it can't collate them properly? If it just handles > BMP UTF-8 sequences as binary data, might it simply sort them in Unicode code point order?

Or does it do something worse, and actively convert the Unicode characters into a 16-bit range, thus nuking characters outside the BMP. rather than storing, and largely processing, them as binary-encoded data for purposes other than collating?

I tested this yesterday, hence my post. To summarize the results:

Using a literal UTF-8 4-byte character in SQL statement, with connection on 'SET NAMES utf8' mode: * utf8 column: string is truncated at the problem character * ucs2 column: "????" is stored in place of problem character * blob column: works just fine (but no collation)

Using pseudo-UTF-8 with UTF-16 surrogate pair halves individually encoded: * utf8 column: works, but now we have bad encoding * ucs2 column: works, but now we have bad encoding * blob column: works, but now we have bad encoding

They won't be properly collated I'm sure, either.

In theory we could apply this tranformation but this will add a bunch of unnecessary and unreliable junk to the code. Automatically applying the transformation on all data could badly break binary storage (eg compressed text, the stuff we Really Don't Want To Lose).

If we apply it to page titles only, we might be able to get away with adding the transformation in eg the Title class:

* $title->getText() -> proper UTF-8, with spaces * $title->getUrl() -> proper UTF-8, with underscores * $title->getDbKey() -> fake UTF-8, with underscores

This of course means there's a nasssssty database dependency in the database-independent code, and could still break other things.

My preference, if possible, would be to get MySQL to fix their Unicode support to allow for either storage of full UTF-8 or proper transformation of UTF-8 to UTF-16. UCS-2 collation with UTF-16 conversion semantics would be "good enough" for us, I think, and avoids the 4-byte-per-character index bloat of extending the UTF-8 support.

-- brion vibber (brion @ pobox.com)

Brion Vibber

5:52 p.m.

Brion Vibber wrote:

...

My preference, if possible, would be to get MySQL to fix their Unicode support to allow for either storage of full UTF-8 or proper transformation of UTF-8 to UTF-16. UCS-2 collation with UTF-16 conversion semantics would be "good enough" for us, I think, and avoids the 4-byte-per-character index bloat of extending the UTF-8 support.

Bug filed: http://bugs.mysql.com/bug.php?id=14052

-- brion vibber (brion @ pobox.com)

Mark Williamson

4:45 p.m.

Agree wholeheartedly with Timwi.

A small note, though... the proper name isn't "high characters" but rather "plane 1" characters or "supplementary characters" or... there's another more appropriate name for them, but I don't remember it.

It also seems to me that developers in general tend to write off Plane 1 nowadays just as they may have once for Unicode in general.

Osmaniyyah is not _official_, but then again in Somalia "official" doesn't mean much because they don't have a government. The country exists only /de jure/, de facto it is a number of smaller countries, territories under rival warlords, and vast areas of "no man's land". The Somali government exists pretty much only in the context of international organisations.

I made an Osmaniyyah font (MPH 2B Damase), and I got a few e-mails about it from Somalia, and those people certainly sounded like it would be useful in an everyday context.

Also, the BMP is nearly full now. Most _living_ writing systems added to Unicode from here on out will be coded outside of the BMP. This includes, possibly, such still-used scripts as Modi, Meithei Meeyek, CIS2, Paiute, Lisu, Mundari scripts, West African scripts, Tulu...

Mark

On 15/10/05, Timwi timwi@gmx.net wrote:

...

...
There is however a compatibility issue: MySQL's Unicode support is limited to the 16-bit character range (basic multilingual plane), both for ucs2 and utf8 storage modes.

I can't believe you are seriously suggesting to allow this to limit us so badly that we would even have to close down an entire wiki (Gothic) and may possibly not be able to re-open it again in years. The same applies to high characters in article titles on Wikipedias such as Chinese that are actually active and may actually *need* them! If they are using them already, and they are working fine, and now you're taking it away from them, I don't think they are going to like you. I hope for strong opposition to such artificial limitations.

If MySQL's Unicode support is not up to the task, then DON'T USE IT.

Timwi

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Brion Vibber

5:52 p.m.

Mark Williamson wrote:

...

Agree wholeheartedly with Timwi.

A small note, though... the proper name isn't "high characters" but rather "plane 1" characters or "supplementary characters" or... there's another more appropriate name for them, but I don't remember it.

"Plane 1" characters would refer only to plane 1, and not to plane 2, 3, 4, 5... 15. :)

-- brion vibber (brion @ pobox.com)

Jeremy Dunck

2:03 p.m.

On 10/14/05, Brion Vibber brion@pobox.com wrote:

...

There is however a compatibility issue: MySQL's Unicode support is limited to the 16-bit character range (basic multilingual plane), both for ucs2 and utf8 storage modes.

...

It would be relatively simple to disable use of titles and usernames with these high characters; to assess possible impact I did a check through all our current wikis and found 99 extant pages:

...

While I understand the reluctance of others to settle for content loss or missing collation, I just wanted to thank you for doing such a thorough analysis of the situation.

6858

Age (days ago)

6858

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

9 participants

tags (0)

participants (9)

Brion Vibber
Jeremy Dunck
Magnus Manske
Mark Williamson
Neil Harris
Ray Saintonge
Tels
Timwi
Александр Сигачёв