If you ever used the ServerSideAccountCreation log to run queries on cross-wiki account registrations and ever used the event_userName field, please be aware of these two issues we recently discovered.
• Non-ASCII characters in usernames are garbled and replaced with question marks (we have 25K account creation events with username “???” and 21K registrations with username “????” just to mention the most frequent examples). [1] Counting usernames will underreport the actual number of accounts created, specifically for projects with a large proportion of non-ASCII usernames.
• There’s a large number of new users registering with the same username on multiple projects, which seems to violate the principle that all new accounts are unified by default. These users don’t have a record in centralauth.globaluser and as a result they are treated as non-unified accounts. [2]
Because of these reasons, and until these issues are addressed, you should not assume that there’s a unique event per new registered user globally.
How to avoid this problem:
• Use event_userId whenever possible
• When querying across projects, make sure you JOIN globaluser to make sure you don’t count the same user multiple times. The new analytics-store allows you to do that for any MediaWiki DB or EventLogging log, which is pretty awesome.
Dario
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=66123 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=66101