I don't want to defend MySQL development decisions- in fact PHP made some
similarly bad ones, but it would be unfair to judge them too harsly with
the "power of hindsight" [0]- but... /pedantic on
On Thu, Oct 14, 2021 at 7:37 PM Roy Smith <roy(a)panix.com> wrote:
What part of "universal" did they not
understand?
... several years ago, during the end of the century/start of a new one, no
one used UTF-8 [1] and PHP didn't even support multi-byte strings. The
original spec for UTF-8 called for up to 6 bytes[2]. The BMP, however (3
bytes) contained characters for most modern languages [3], which was a
waste of space and performance because at the time, MySQL worked much
faster with fixed-width columns, which would be a waste of space (double!).
My guess is that someone said "this is probably good enough", and would it
be too outrageous to think that we may not need as many extra characters as
stars in our Galaxy, when less than 65K were practically needed?
3 things changed after that:
* Unicode limited UTF-8 to encoding for 21 bits in 2003 [4], requiring only
4 bytes- only one more than on MySQL's utf8
* Apple wanted to sell iPhones in Japan, so they were added to unicode in
2010, and its subsequent popularity
* MySQL/InnoDB has been highly optimized for the fast handling of
variable-length strings
However, you cannot just arbitrarily break backwards compatibility and
rename the meaning of configuration- specially with storage software that
has been continuously supporting incremental upgrades as long as I can
remember. You can just support the new standard and encourage its usage,
make it the default, etc.
This is a bit offtopic here (feel free to PM to continue the conversation),
and just to be clear, I am _not fully justifying the decisions_, just
giving historical context, but I want to end with some relevant lessons to
the list:
* It is very difficult to build future-proof applications- PHP, MySQL,
Mediawiki, they have a long history and we should be gentle when we judge
them from the future. My work, involving backups, makes sometimes
supporting storage of stuff for over 5 years (unchanged) challenging,
because encryption algorithms are found to be weak, or end up being
unsupported/unavailable in just 2 releases of the operating system!
* Standards also change, they are not as "universal" as we may want to
believe (there have been 32 extra unicode versions since 1991). I expect
new collations to be needed in the future that are currently not
implemented, too.
* It is ok to make "mistakes", as long as we learn from them and improve
upon them :-)
Sorry for the text block.
[0] <url:https://powerlisting.fandom.com/wiki/Hindsight>
[1] <url:https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg>
[2] <url:https://www.rfc-editor.org/rfc/rfc2279>
[3] <url:
https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
[4] <url:https://www.rfc-editor.org/rfc/rfc3629>