A lot of the character-encoding stuff in the present code is a mess too. I understand well and can handle all the details between server and browser, but two things I don't know all the quirks of are PHP and MySQL, so this is my attempt to pick the brains of those who have already found those problems:
(1) Is MySQL 8-bit clean? If I store a chunk of 8-bit bytes in a text field, will I get them back unmolested, or will MySQL try to be "helpful" and fuck them up? If the latter, what are the limitations of what can be stored in a text field and where is that documented?
(2) Are PHP strings 8-bit clean? I'd be amazed if they weren't, considering how much of PHP is modelled on Perl.
(3) Is the PHP on wikipedia.com compiled with the "iconv" library (an optional thing), and does PHP use it as documented?
On mer, 2002-05-08 at 20:33, Lee Daniel Crocker wrote:
A lot of the character-encoding stuff in the present code is a mess too.
I appreciate your polite understatement.
I understand well and can handle all the details between server and browser, but two things I don't know all the quirks of are PHP and MySQL, so this is my attempt to pick the brains of those who have already found those problems:
(1) Is MySQL 8-bit clean? If I store a chunk of 8-bit bytes in a text field, will I get them back unmolested, or will MySQL try to be "helpful" and fuck them up? If the latter, what are the limitations of what can be stored in a text field and where is that documented?
As far as I know, yes (the former). MySQL has no direct support for UTF-8, but it does have explicit support for a number of single and multibyte encodings, so one would expect it to be generally 8-bit clean, and implementations of sticking UTF-8 strings into MySQL using the default ISO-8859-1 setting abound. However there are limitations -- MySQL doesn't know about proper case matching or accent folding, for instance, which would be nice for searching. (This is get-aroundable if we do our own case/accent-folding when saving a page and storing it in a separate field just for searches, which has been discussed from time to time.)
(2) Are PHP strings 8-bit clean? I'd be amazed if they weren't, considering how much of PHP is modelled on Perl.
Claims to be.
(3) Is the PHP on wikipedia.com compiled with the "iconv" library (an optional thing),
I don't believe it is. It really should be.
and does PHP use it as documented?
That's a good question!
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org