Use binary to represent varchar rather than UTF8

List overview All Threads
Download

newer

older

Re: [Wikitech-l] Wikitext isn't a...

Re: [Wikitech-l] [Foundation-l]...

howard chen

9 Jan 2007 9 Jan '07

5:14 p.m.

in all table creation statments, such as

CREATE TABLE `categorylinks` ( `cl_from` int(8) unsigned NOT NULL default '0', `cl_to` varchar(255) binary NOT NULL default '', `cl_sortkey` varchar(255) binary NOT NULL default '', `cl_timestamp` timestamp(14) NOT NULL, UNIQUE KEY `cl_from` (`cl_from`,`cl_to`), KEY `cl_sortkey` (`cl_to`,`cl_sortkey`(128)), KEY `cl_timestamp` (`cl_to`,`cl_timestamp`) ) TYPE=InnoDB;

why use binary to represent varchar, not UTF8?

Show replies by date

Ashar Voultoiz

9 Jan 9 Jan

6:45 p.m.

howard chen wrote:

...

in all table creation statments, such as

CREATE TABLE `categorylinks` ( `cl_from` int(8) unsigned NOT NULL default '0', `cl_to` varchar(255) binary NOT NULL default '', `cl_sortkey` varchar(255) binary NOT NULL default '', `cl_timestamp` timestamp(14) NOT NULL, UNIQUE KEY `cl_from` (`cl_from`,`cl_to`), KEY `cl_sortkey` (`cl_to`,`cl_sortkey`(128)), KEY `cl_timestamp` (`cl_to`,`cl_timestamp`) ) TYPE=InnoDB;

why use binary to represent varchar, not UTF8?

Brion answered that earlier:

http://lists.wikimedia.org/pipermail/mediawiki-l/2006-February/010267.html

To quote him, utf8 in database will mostly:

* Make indexes larger (3 bytes per character) * Cause failures if you use characters outside the BOM in page titles, usernames, etc.

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

howard chen

10 Jan 10 Jan

2:47 a.m.

On 1/10/07, Ashar Voultoiz hashar@altern.org wrote:

...

howard chen wrote:

...
in all table creation statments, such as

CREATE TABLE `categorylinks` ( `cl_from` int(8) unsigned NOT NULL default '0', `cl_to` varchar(255) binary NOT NULL default '', `cl_sortkey` varchar(255) binary NOT NULL default '', `cl_timestamp` timestamp(14) NOT NULL, UNIQUE KEY `cl_from` (`cl_from`,`cl_to`), KEY `cl_sortkey` (`cl_to`,`cl_sortkey`(128)), KEY `cl_timestamp` (`cl_to`,`cl_timestamp`) ) TYPE=InnoDB;

why use binary to represent varchar, not UTF8?

Brion answered that earlier:

http://lists.wikimedia.org/pipermail/mediawiki-l/2006-February/010267.html

To quote him, utf8 in database will mostly:

Make indexes larger (3 bytes per character)

Cause failures if you use characters outside the BOM in page titles,

usernames, etc.

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

i would like to know if `wikipedia` is going to use this method in the future?

Andrew Dunbar

4:31 a.m.

On 1/10/07, howard chen howachen@gmail.com wrote:

...

On 1/10/07, Ashar Voultoiz hashar@altern.org wrote:

...
howard chen wrote:

...
in all table creation statments, such as

CREATE TABLE `categorylinks` ( `cl_from` int(8) unsigned NOT NULL default '0', `cl_to` varchar(255) binary NOT NULL default '', `cl_sortkey` varchar(255) binary NOT NULL default '', `cl_timestamp` timestamp(14) NOT NULL, UNIQUE KEY `cl_from` (`cl_from`,`cl_to`), KEY `cl_sortkey` (`cl_to`,`cl_sortkey`(128)), KEY `cl_timestamp` (`cl_to`,`cl_timestamp`) ) TYPE=InnoDB;

why use binary to represent varchar, not UTF8?

Brion answered that earlier:

http://lists.wikimedia.org/pipermail/mediawiki-l/2006-February/010267.html

To quote him, utf8 in database will mostly:

Make indexes larger (3 bytes per character)

Cause failures if you use characters outside the BOM in page titles,

usernames, etc.

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

i would like to know if `wikipedia` is going to use this method in the future?

I'm sure mediawiki will make use of proper utf-8 support when mysql has proper utf-8 support. Our users want to use Unicode characters beyond plane 1 and the current mysql utf-8 support does not make this possible.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

Lars Aronsson

8:59 a.m.

Andrew Dunbar wrote:

...

I'm sure mediawiki will make use of proper utf-8 support when mysql has proper utf-8 support. Our users want to use Unicode characters beyond plane 1 and the current mysql utf-8 support does not make this possible.

In addition to going beyond plane 1, is there any established practice for how to do Unicode "properly"?

Sorry if this is off topic, since Wikipedia is already doing UTF-8. I have a couple of other projects in Latin-1 that need to be converted to UTF-8. Is the time ripe? It seems that Perl 5.8 and MySQL 5 have far better Unicode support than earlier versions. But is it still a bit early? I'm sending my mail in UTF-8 with Pine 4.64, and I know some people have problems with this.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

howard chen

10:26 a.m.

On 1/10/07, Andrew Dunbar hippytrail@gmail.com wrote:

...

On 1/10/07, howard chen howachen@gmail.com wrote:

...
On 1/10/07, Ashar Voultoiz hashar@altern.org wrote:

...
howard chen wrote:

...
in all table creation statments, such as

CREATE TABLE `categorylinks` ( `cl_from` int(8) unsigned NOT NULL default '0', `cl_to` varchar(255) binary NOT NULL default '', `cl_sortkey` varchar(255) binary NOT NULL default '', `cl_timestamp` timestamp(14) NOT NULL, UNIQUE KEY `cl_from` (`cl_from`,`cl_to`), KEY `cl_sortkey` (`cl_to`,`cl_sortkey`(128)), KEY `cl_timestamp` (`cl_to`,`cl_timestamp`) ) TYPE=InnoDB;

why use binary to represent varchar, not UTF8?

Brion answered that earlier:

http://lists.wikimedia.org/pipermail/mediawiki-l/2006-February/010267.html

To quote him, utf8 in database will mostly:

Make indexes larger (3 bytes per character)

Cause failures if you use characters outside the BOM in page titles,

usernames, etc.

cheers,

-- Ashar Voultoiz - WP++++ http://en.wikipedia.org/wiki/User:Hashar http://www.livejournal.com/community/wikitech/ IM: hashar@jabber.org ICQ: 15325080

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

i would like to know if `wikipedia` is going to use this method in the future?

I'm sure mediawiki will make use of proper utf-8 support when mysql has proper utf-8 support. Our users want to use Unicode characters beyond plane 1 and the current mysql utf-8 support does not make this possible.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

besides fully UTF8 support, using binary will have the advantages of saving space for indexes, so will it also alter the decision?

Gregory Maxwell

2:45 p.m.

On 1/10/07, howard chen howachen@gmail.com wrote:

...

besides fully UTF8 support, using binary will have the advantages of saving space for indexes, so will it also alter the decision?

It is possible that when MySQL gets unicode support beyond UCS-16 it will come in the form of UTF-8, which is what we currently use packed into those binary fields.

This isn't that all unlikely. Supporting non-BMP characters without dealing with variable length characters requires UCS-32, and I think that even MySQL users would balk at another needless 2x increase in the size of their ASCII data. Since the effort required to work with UTF-16 (i.e. the two byte variable length encoding) is similar to UTF-8, it may make sense to go all the way to UTF-8 and get the space savings for ASCII.

It's also possible that by the time MySQL has support for non-BMP characters, it may have support for functional indexes, allowing the index to operate on a different datatype than the row... (although a straightforward type conversion would kill one of the primary advantages of using a real string type: collation which isn't total nonsense)

(and please trim your replies)

Brion Vibber

8:30 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

howard chen wrote:

...

i would like to know if `wikipedia` is going to use this method in the future?

If MySQL supported UTF-8, we'd be happy to make use of it. Using proper character sets gives us warm, fuzzy feelings and makes it easier to work in terminals and other direct-database tools, as well as potentially making it easier to use built-in database support for case-insensitive lookups and proper sorting.

BUT... at the moment MySQL only supports a subset of UTF-8 which corresponds to UCS-2 (limited to the lower 16 bits of Unicode's encoding space).

Thus characters from a number of scripts encoded outside of that range cannot be represented without resorting to storing raw UTF-8 in binary fields. Since we already have data outside that range, we don't plan to reduce our functionality to shoehorn it into a broken UTF-8 implementation.

(There is an experimental MediaWiki mode for using UTF-8 collation, but since the functionality in MySQL is incomplete it doesn't fully work. It's also not properly integrated with the updaters, so an experimental database in this mode may not properly update on version upgrades.)

We have expressed our interest in full UTF-8 support (or UTF-16 with proper conversion should do fine!) to MySQL, but as far as I know it's still not on the roadmap as of 5.2. Maybe some more lobbying is in order. ;)

Since the forseeable future does not include full UTF-8 support in MySQL, when we upgrade to 5.0 or 5.1 we expect to continue using binary encoding.

We do though plan to 'formalize' that a bit more, with proper binary charset/collation labeling on the fields, rather than the ad-hoc method we've used since 4.0. The experimental 'binary' schema on MediaWiki 1.9 or later can be tested to play with this, but be warned it's even more experimental than the UTF-8 one at the moment.

- -- brion vibber (brion @ pobox.com)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFpKQUwRnhpk1wk44RAr84AKCLte+paYICxJqwkZIVYVzEOx9AJwCgk1ao 0oWl2P0ac4rgPUwCMOU5y4c= =1Msn -----END PGP SIGNATURE-----

Edward Z. Yang

15 Jan 15 Jan

10:51 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Brion Vibber wrote:

...

If MySQL supported UTF-8, we'd be happy to make use of it. Using proper character sets gives us warm, fuzzy feelings and makes it easier to work in terminals and other direct-database tools, as well as potentially making it easier to use built-in database support for case-insensitive lookups and proper sorting. [snip]

On a related tack, why isn't MediaWiki using MySQL 5 yet? (I remember there was a very good reason, but I can't find it now)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFrAVuqTO+fYacSNoRAsUjAJ93w5o3rBzH1+P76Uw0QQlt2OU78wCgiQYq wfMDBZbrp52Dz7Nz4mTOu0Q= =4q0c -----END PGP SIGNATURE-----

Brion Vibber

11:35 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Edward Z. Yang wrote:

...

Brion Vibber wrote:

...
If MySQL supported UTF-8, we'd be happy to make use of it. Using proper character sets gives us warm, fuzzy feelings and makes it easier to work in terminals and other direct-database tools, as well as potentially making it easier to use built-in database support for case-insensitive lookups and proper sorting. [snip]

On a related tack, why isn't MediaWiki using MySQL 5 yet? (I remember there was a very good reason, but I can't find it now)

1) The biggest reason we'd want to upgrade would be to make use of native unicode support, but it doesn't work ;)

There are a few other niceties in 5, but not a lot of incentive to upgrade considering that...

2) To upgrade cleanly without using the (broken) unicode support, we'll want to finish working out and testing a proper binary schema and a migration to it.

The good news is that we've more or less got the hang of doing clean schema upgrades using replication and switching masters around. So once we get the schema details worked out, we *should* be able to do the upgrade with minimal downtime.

In theory. :)

- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFrA/PwRnhpk1wk44RAnBFAKCloQv6n8PvHebk3vbnhd5gJKnBVgCg0drL x56ggVGtQA49T0GvFST0s2M= =iDVS -----END PGP SIGNATURE-----

Andrew Dunbar

17 Jan 17 Jan

4:46 a.m.

On 1/16/07, Brion Vibber brion@pobox.com wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Edward Z. Yang wrote:

...
Brion Vibber wrote:

...
If MySQL supported UTF-8, we'd be happy to make use of it. Using proper character sets gives us warm, fuzzy feelings and makes it easier to work in terminals and other direct-database tools, as well as potentially making it easier to use built-in database support for case-insensitive lookups and proper sorting. [snip]

On a related tack, why isn't MediaWiki using MySQL 5 yet? (I remember there was a very good reason, but I can't find it now)

The biggest reason we'd want to upgrade would be to make use of

native unicode support, but it doesn't work ;)

There are a few other niceties in 5, but not a lot of incentive to upgrade considering that...

To upgrade cleanly without using the (broken) unicode support, we'll

want to finish working out and testing a proper binary schema and a migration to it.

The good news is that we've more or less got the hang of doing clean schema upgrades using replication and switching masters around. So once we get the schema details worked out, we *should* be able to do the upgrade with minimal downtime.

In theory. :)

1) Does mysql support UCS-2, UTF-8, or UCS-4 beyond the BMP? If so we can still use UTF-8 internally and rely on mysql's encoding conversion for the rest.

2) Has a bug report or feature request been filed on mysql's site? I'm fumbling through now trying to find one.

Andrew Dunbar (hippietrail)

...

-- brion vibber (brion @ pobox.com / brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFrA/PwRnhpk1wk44RAnBFAKCloQv6n8PvHebk3vbnhd5gJKnBVgCg0drL x56ggVGtQA49T0GvFST0s2M= =iDVS -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

Andrew Dunbar

5:29 a.m.

On 1/17/07, Andrew Dunbar hippytrail@gmail.com wrote:

...

On 1/16/07, Brion Vibber brion@pobox.com wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Edward Z. Yang wrote:

...
Brion Vibber wrote:

...
If MySQL supported UTF-8, we'd be happy to make use of it. Using proper character sets gives us warm, fuzzy feelings and makes it easier to work in terminals and other direct-database tools, as well as potentially making it easier to use built-in database support for case-insensitive lookups and proper sorting. [snip]

On a related tack, why isn't MediaWiki using MySQL 5 yet? (I remember there was a very good reason, but I can't find it now)

The biggest reason we'd want to upgrade would be to make use of

native unicode support, but it doesn't work ;)

There are a few other niceties in 5, but not a lot of incentive to upgrade considering that...

To upgrade cleanly without using the (broken) unicode support, we'll

want to finish working out and testing a proper binary schema and a migration to it.

The good news is that we've more or less got the hang of doing clean schema upgrades using replication and switching masters around. So once we get the schema details worked out, we *should* be able to do the upgrade with minimal downtime.

In theory. :)

Does mysql support UCS-2, UTF-8, or UCS-4 beyond the BMP? If so we

can still use UTF-8 internally and rely on mysql's encoding conversion for the rest.

Has a bug report or feature request been filed on mysql's site? I'm

fumbling through now trying to find one.

I couldn't find anything so I've gone ahead and filed this:

http://bugs.mysql.com/bug.php?id=25666

...

Andrew Dunbar (hippietrail)

...

-- brion vibber (brion @ pobox.com / brion @ wikimedia.org)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFrA/PwRnhpk1wk44RAnBFAKCloQv6n8PvHebk3vbnhd5gJKnBVgCg0drL x56ggVGtQA49T0GvFST0s2M= =iDVS -----END PGP SIGNATURE-----

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

-- http://linguaphile.sf.net

Brion Vibber

7:55 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Andrew Dunbar wrote:

...

Does mysql support UCS-2, UTF-8, or UCS-4 beyond the BMP?

UCS-2 does not include anything beyond the BMP, by definition.

(Further, UCS-2 is supported only for storage; connections to the server must use another charset such as UTF-8, and the conversion on the server end is indeed limited to UCS-2. While it is possible as of my testing last year to store UTF-16 data in UCS-2 fields by communicating with the server using invalid pseudo-UTF-8 which encodes the UTF-16 surrogate pair halves as separate UTF-8 characters, this would be very fragile and likely to lead to bugs, interoperability problems, and data corruption.)

UTF-8 support in MySQL is limited to 3-byte characters, which corresponds to the BMP.

UCS-4 is not supported at all.

- -- brion vibber (brion @ pobox.com)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFrdZvwRnhpk1wk44RAriBAJ0RV54gepEz9+Wx2GRyvRZrz+SVnQCeLDem jtMObBzFrtZIeKfQQorT9yc= =vAh6 -----END PGP SIGNATURE-----

Petr Kadlec

19 Jan 19 Jan

10:24 a.m.

Hi!

On 17/01/07, Brion Vibber brion@pobox.com wrote:

...

[...] While it is possible as of my testing last year to store UTF-16 data in UCS-2 fields by communicating with the server using invalid pseudo-UTF-8 which encodes the UTF-16 surrogate pair halves as separate UTF-8 characters, this would be very fragile and likely to lead to bugs, interoperability problems, and data corruption.) [...]

I believe this CESU-8 thing has been discussed before a bit (see http://lists.wikimedia.org/pipermail/wikitech-l/2005-October/019766.html and http://lists.wikimedia.org/pipermail/wikitech-l/2005-October/019611.html), including filing a MySQL bug (see http://bugs.mysql.com/bug.php?id=14052, no comments since, it seems).

-- [[cs:User:Mormegil | Petr Kadlec]]

6520

Age (days ago)

6530

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

8 participants

tags (0)

participants (8)

Andrew Dunbar
Ashar Voultoiz
Brion Vibber
Edward Z. Yang
Gregory Maxwell
howard chen
Lars Aronsson
Petr Kadlec