Nuria
I am hoping we
can recover the garbled usernames from the raw JSON logs,
Please have in mind that
we have logs only from the last 90 days.
this is not true, we have server-side data covering the whole lifespan of the latest
ServerSideAccountCreation in /a/eventlogging/archive.
I appreciate that we need to enforce the 90-day deletion/pruning for a subset of the logs,
but we do have the raw data for SSAC and I do not expect this to be a log that we will
delete/prune (we will have to drop the userAgent field, per the guidelines).
Now, we shall be able to recover from the logs the
user_names with character set utf-8. Note that the encoding issue does not apply only just
to user names but actually to any string who can have a non-asciii value in all event
logging schemas, not just this one.
That’s correct, see also my comment on
https://bugzilla.wikimedia.org/show_bug.cgi?id=66123
See, for example, the following record from the logs:
{"clientValidated": true, "event": {"campaign":
"", "displayMobile": true, "isSelfMade": true,
"returnTo":
"\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a",
"token": "", "userBuckets": "",
"userId": 725222, "userName": "<removed>"},
"recvFrom": "mw1087", "revision": 5487345,
"schema": "ServerSideAccountCreation", "seqId": 53258317,
"timestamp": 1389610463, "uuid":
"013953cf77a2585e983b491f2d4a2388", "webHost":
"ar.wikipedia.org", "wiki": "arwiki"}
Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean
not just "restoring" records from logs but also it involves changing database
connection args, bindings and database types. Not a huge deal, but I just want to point
out that fixing the issue goes beyond repopulating the records.
I appreciate that, however non-ASCII replaced with ? is creating a large amount of
artifacts in the data that at some point we’ll have to deal with. We should figure out if
repopulating historical data is a priority or we can live with that and only fix future
data.
Dario