On mer, 2002-05-08 at 20:33, Lee Daniel Crocker wrote:
A lot of the character-encoding stuff in the present
code is a mess
too.
I appreciate your polite understatement.
I understand well and can handle all the details
between server
and browser, but two things I don't know all the quirks of are PHP and
MySQL, so this is my attempt to pick the brains of those who have
already found those problems:
(1) Is MySQL 8-bit clean? If I store a chunk of 8-bit bytes in a
text field, will I get them back unmolested, or will MySQL try to be
"helpful" and fuck them up? If the latter, what are the limitations
of what can be stored in a text field and where is that documented?
As far as I know, yes (the former). MySQL has no direct support for
UTF-8, but it does have explicit support for a number of single and
multibyte encodings, so one would expect it to be generally 8-bit clean,
and implementations of sticking UTF-8 strings into MySQL using the
default ISO-8859-1 setting abound. However there are limitations --
MySQL doesn't know about proper case matching or accent folding, for
instance, which would be nice for searching. (This is get-aroundable if
we do our own case/accent-folding when saving a page and storing it in a
separate field just for searches, which has been discussed from time to
time.)
(2) Are PHP strings 8-bit clean? I'd be amazed if
they weren't,
considering how much of PHP is modelled on Perl.
Claims to be.
(3) Is the PHP on
wikipedia.com compiled with the
"iconv" library
(an optional thing),
I don't believe it is. It really should be.
and does PHP use it as documented?
That's a good question!
-- brion vibber (brion @
pobox.com)