On Mon, Jun 9, 2014 at 8:00 PM, Sean Pringle <springle@wikimedia.org> wrote:
On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
>Just to narrow this down a little further from the DB server-side: the eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types).
To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it is appropriate.

Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was only commenting in case it could make the current python encoding fix easier and faster.

Were it a new system without ties to MW I'd push for solving character set issues properly with something like utf8mb4, depending on how you want to read/sort the data, but without that luxury varbinary is fine.


commit 9cff78b7c6a9516611cfd055906fd0707c4d5b88
Author: Ori Livneh <ori@wikimedia.org>
Date:   Sun Apr 28 14:46:28 2013 -0700

    Default MariaDB character encoding for EL data: utf8 -> utf8mb4

    This change sets the default character encoding for MySQL / MariaDB
    EventLogging data to 'utf8mb4' (was: 'utf8'), adding support for characters
    above the Base Multilingual Plane. Deployment will require manual migration of
    existing data in the database.

    One of the consequences of this migration is that the previous default size for
    string columns is not longer appropriate, since the columns it generates are
    not indexable by InnoDB, which will not index columns beyond 767 bytes. This
    change therefore amends the default size to be 191, which is the maximum size a
    utf8mb4 string column can be and still remain indexable.

    Finally, as a way of not being blocked on deployment of I8fdcc046d, this change
    adds a live hack that substitutes 'utf8mb4' for 'utf8' in database connection
    strings. The hack can be removed once I8fdcc046d is deployed.

    FIXME: Database setup instructions and minimum requirements should be
    documented.

    Change-Id: Ia94f2c2155de5fb9031a8164306720e06455cced

commit 041cb2c34c540dfea05886368edc5d6209102aed
Author: Ori Livneh <ori@wikimedia.org>
Date:   Sun Apr 28 15:13:26 2013 -0700

    ...and back to utf8 as default charset

    The version of MySQLdb that is packaged for Precise does not know about
    utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.

    Keeping the 191 limit to ease migration in the future.

    Change-Id: I807e1d3a6f192b13e36811af376806d6a92e122d