On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Just to narrow this down a little further from the DB server-side: the
eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types). To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it is appropriate.
Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was only commenting in case it could make the current python encoding fix easier and faster.
Were it a new system without ties to MW I'd push for solving character set issues properly with something like utf8mb4, depending on how you want to read/sort the data, but without that luxury varbinary is fine.