>
I am hoping we can recover the garbled usernames from the raw JSON logs, Please have in mind that we have logs only from the last 90 days.
Now, we shall be able to recover from the logs the user_names with character set utf-8. Note that the encoding issue does not apply only just to user names but actually to any string who can have a non-asciii value in all event logging schemas, not just this one.
See, for example, the following record from the logs:
{"clientValidated": true, "event": {"campaign": "", "displayMobile": true, "isSelfMade": true, "returnTo": "\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a", "token": "", "userBuckets": "", "userId": 725222, "userName": "<removed>"}, "recvFrom": "mw1087", "revision": 5487345, "schema": "ServerSideAccountCreation", "seqId": 53258317, "timestamp": 1389610463, "uuid": "013953cf77a2585e983b491f2d4a2388", "webHost": "ar.wikipedia.org", "wiki": "arwiki"}
Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean not just "restoring" records from logs but also it involves changing database connection args, bindings and database types. Not a huge deal, but I just want to point out that fixing the issue goes beyond repopulating the records.