On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz <nuria@wikimedia.org> wrote:
>Just to narrow this down a little further from the DB server-side: the eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types).
To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it isĀ appropriate.

Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was only commenting in case it could make the current python encoding fix easier and faster.

Were it a new system without ties to MW I'd push for solving character set issues properly with something like utf8mb4, depending on how you want to read/sort the data, but without that luxury varbinary is fine.