On Mon, Jun 9, 2014 at 8:00 PM, Sean Pringle springle@wikimedia.org wrote:
On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Just to narrow this down a little further from the DB server-side: the
eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types). To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it is appropriate.
Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was only commenting in case it could make the current python encoding fix easier and faster.
Were it a new system without ties to MW I'd push for solving character set issues properly with something like utf8mb4, depending on how you want to read/sort the data, but without that luxury varbinary is fine.
commit 9cff78b7c6a9516611cfd055906fd0707c4d5b88 Author: Ori Livneh ori@wikimedia.org Date: Sun Apr 28 14:46:28 2013 -0700
Default MariaDB character encoding for EL data: utf8 -> utf8mb4
This change sets the default character encoding for MySQL / MariaDB EventLogging data to 'utf8mb4' (was: 'utf8'), adding support for characters above the Base Multilingual Plane. Deployment will require manual migration of existing data in the database.
One of the consequences of this migration is that the previous default size for string columns is not longer appropriate, since the columns it generates are not indexable by InnoDB, which will not index columns beyond 767 bytes. This change therefore amends the default size to be 191, which is the maximum size a utf8mb4 string column can be and still remain indexable.
Finally, as a way of not being blocked on deployment of I8fdcc046d, this change adds a live hack that substitutes 'utf8mb4' for 'utf8' in database connection strings. The hack can be removed once I8fdcc046d is deployed.
FIXME: Database setup instructions and minimum requirements should be documented.
Change-Id: Ia94f2c2155de5fb9031a8164306720e06455cced
commit 041cb2c34c540dfea05886368edc5d6209102aed Author: Ori Livneh ori@wikimedia.org Date: Sun Apr 28 15:13:26 2013 -0700
...and back to utf8 as default charset
The version of MySQLdb that is packaged for Precise does not know about utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.
Keeping the 191 limit to ease migration in the future.
Change-Id: I807e1d3a6f192b13e36811af376806d6a92e122d