Data quality issues with account creation log

List overview All Threads
Download

newer

older

Want to examine editors cross-wiki...

WorkingWiki: open collaborative...

Dario Taraborelli

6 Jun 2014 6 Jun '14

4:24 a.m.

If you ever used the ServerSideAccountCreation log to run queries on cross-wiki account registrations and ever used the event_userName field, please be aware of these two issues we recently discovered.

• Non-ASCII characters in usernames are garbled and replaced with question marks (we have 25K account creation events with username “???” and 21K registrations with username “????” just to mention the most frequent examples). [1] Counting usernames will underreport the actual number of accounts created, specifically for projects with a large proportion of non-ASCII usernames.

• There’s a large number of new users registering with the same username on multiple projects, which seems to violate the principle that all new accounts are unified by default. These users don’t have a record in centralauth.globaluser and as a result they are treated as non-unified accounts. [2]

Because of these reasons, and until these issues are addressed, you should not assume that there’s a unique event per new registered user globally.

How to avoid this problem:

• Use event_userId whenever possible

• When querying across projects, make sure you JOIN globaluser to make sure you don’t count the same user multiple times. The new analytics-store allows you to do that for any MediaWiki DB or EventLogging log, which is pretty awesome.

Dario

[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=66123 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=66101

Show replies by date

Steven Walling

6 Jun 6 Jun

6:47 a.m.

On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

• Use event_userId whenever possible

This is really a best practice everyone should follow in all analysis. Unless you're qualitatively interested in the contents of usernames, any analysis that uses unique names instead of ids should probably be treated as highly suspect.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Aaron Halfaker

6:51 a.m.

Regretfully, looking up a user in Centralauth requires the use of a username. Then again, you'd need to join with a user table (with user_id) anyway since users can be renamed after they create their account and that name change won't be reflected in ServerSideAccountCreation.

On Thu, Jun 5, 2014 at 5:47 PM, Steven Walling swalling@wikimedia.org wrote:

...

On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
• Use event_userId whenever possible

This is really a best practice everyone should follow in all analysis. Unless you're qualitatively interested in the contents of usernames, any analysis that uses unique names instead of ids should probably be treated as highly suspect.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

7:39 a.m.

I am hoping we can recover the garbled usernames from the raw JSON logs, but you’re correct about username changes. For project level counts, though, they should not dramatically affect the accuracy of new registration numbers.

On Jun 5, 2014, at 3:51 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

Regretfully, looking up a user in Centralauth requires the use of a username. Then again, you'd need to join with a user table (with user_id) anyway since users can be renamed after they create their account and that name change won't be reflected in ServerSideAccountCreation.

On Thu, Jun 5, 2014 at 5:47 PM, Steven Walling swalling@wikimedia.org wrote:

On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

• Use event_userId whenever possible

This is really a best practice everyone should follow in all analysis. Unless you're qualitatively interested in the contents of usernames, any analysis that uses unique names instead of ids should probably be treated as highly suspect.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

7:41 a.m.

and yes, I wish we had a gu_id included in ServerSideAccountCreation (assuming MediaWiki knows it by the time the event is generated)

On Jun 5, 2014, at 4:39 PM, Dario Taraborelli dario@wikimedia.org wrote:

...

I am hoping we can recover the garbled usernames from the raw JSON logs, but you’re correct about username changes. For project level counts, though, they should not dramatically affect the accuracy of new registration numbers.

On Jun 5, 2014, at 3:51 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...
Regretfully, looking up a user in Centralauth requires the use of a username. Then again, you'd need to join with a user table (with user_id) anyway since users can be renamed after they create their account and that name change won't be reflected in ServerSideAccountCreation.

On Thu, Jun 5, 2014 at 5:47 PM, Steven Walling swalling@wikimedia.org wrote:

On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

• Use event_userId whenever possible

This is really a best practice everyone should follow in all analysis. Unless you're qualitatively interested in the contents of usernames, any analysis that uses unique names instead of ids should probably be treated as highly suspect.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

3:54 p.m.

If someone could document the reasons why the userName is needed on this schema it will be great. They can be documented on the schema talk page: http://meta.wikimedia.org/wiki/Schema_talk:ServerSideAccountCreation

When I looked at this issue early on it was not at all obvious to me why - if you have a user id- user_names would be necessary at all.

Thanks,

Nuria

On Fri, Jun 6, 2014 at 1:41 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

and yes, I wish we had a gu_id included in ServerSideAccountCreation (assuming MediaWiki knows it by the time the event is generated)

On Jun 5, 2014, at 4:39 PM, Dario Taraborelli dario@wikimedia.org wrote:

I am hoping we can recover the garbled usernames from the raw JSON logs, but you’re correct about username changes. For project level counts, though, they should not dramatically affect the accuracy of new registration numbers.

On Jun 5, 2014, at 3:51 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

Regretfully, looking up a user in Centralauth requires the use of a username. Then again, you'd need to join with a user table (with user_id) anyway since users can be renamed after they create their account and that name change won't be reflected in ServerSideAccountCreation.

On Thu, Jun 5, 2014 at 5:47 PM, Steven Walling swalling@wikimedia.org wrote:

...
On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
• Use event_userId whenever possible

This is really a best practice everyone should follow in all analysis. Unless you're qualitatively interested in the contents of usernames, any analysis that uses unique names instead of ids should probably be treated as highly suspect.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Nuria Ruiz

4:30 p.m.

...

I am hoping we can recover the garbled usernames from the raw JSON logs,

Please have in mind that we have logs only from the last 90 days.

Now, we shall be able to recover from the logs the user_names with character set utf-8. Note that the encoding issue does not apply only just to user names but actually to any string who can have a non-asciii value in all event logging schemas, not just this one.

See, for example, the following record from the logs:

{"clientValidated": true, "event": {"campaign": "", "displayMobile": true, "isSelfMade": true, "*returnTo*": *"\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a", *"token": "", "userBuckets": "", "userId": 725222, "userName": "<removed>"}, "recvFrom": "mw1087", "revision": 5487345, "schema": "ServerSideAccountCreation", "seqId": 53258317, "timestamp": 1389610463, "uuid": "013953cf77a2585e983b491f2d4a2388", "webHost": "ar.wikipedia.org", "wiki": "arwiki"}

Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean not just "restoring" records from logs but also it involves changing database connection args, bindings and database types. Not a huge deal, but I just want to point out that fixing the issue goes beyond repopulating the records.

On Fri, Jun 6, 2014 at 1:39 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

I am hoping we can recover the garbled usernames from the raw JSON logs, but you’re correct about username changes. For project level counts, though, they should not dramatically affect the accuracy of new registration numbers.

On Jun 5, 2014, at 3:51 PM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

Regretfully, looking up a user in Centralauth requires the use of a username. Then again, you'd need to join with a user table (with user_id) anyway since users can be renamed after they create their account and that name change won't be reflected in ServerSideAccountCreation.

On Thu, Jun 5, 2014 at 5:47 PM, Steven Walling swalling@wikimedia.org wrote:

...
On Thu, Jun 5, 2014 at 1:24 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
• Use event_userId whenever possible

This is really a best practice everyone should follow in all analysis. Unless you're qualitatively interested in the contents of usernames, any analysis that uses unique names instead of ids should probably be treated as highly suspect.

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

9:19 p.m.

Nuria

...

...
I am hoping we can recover the garbled usernames from the raw JSON logs,

Please have in mind that we have logs only from the last 90 days.

this is not true, we have server-side data covering the whole lifespan of the latest ServerSideAccountCreation in /a/eventlogging/archive. I appreciate that we need to enforce the 90-day deletion/pruning for a subset of the logs, but we do have the raw data for SSAC and I do not expect this to be a log that we will delete/prune (we will have to drop the userAgent field, per the guidelines).

...

Now, we shall be able to recover from the logs the user_names with character set utf-8. Note that the encoding issue does not apply only just to user names but actually to any string who can have a non-asciii value in all event logging schemas, not just this one.

That’s correct, see also my comment on https://bugzilla.wikimedia.org/show_bug.cgi?id=66123

...

See, for example, the following record from the logs:

{"clientValidated": true, "event": {"campaign": "", "displayMobile": true, "isSelfMade": true, "returnTo": "\u062e\u0627\u0635:\u0645\u0631\u0641\u0648\u0639\u0627\u062a", "token": "", "userBuckets": "", "userId": 725222, "userName": "<removed>"}, "recvFrom": "mw1087", "revision": 5487345, "schema": "ServerSideAccountCreation", "seqId": 53258317, "timestamp": 1389610463, "uuid": "013953cf77a2585e983b491f2d4a2388", "webHost": "ar.wikipedia.org", "wiki": "arwiki"}

Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean not just "restoring" records from logs but also it involves changing database connection args, bindings and database types. Not a huge deal, but I just want to point out that fixing the issue goes beyond repopulating the records.

I appreciate that, however non-ASCII replaced with ? is creating a large amount of artifacts in the data that at some point we’ll have to deal with. We should figure out if repopulating historical data is a priority or we can live with that and only fix future data.

Dario

Sean Pringle

9 Jun 9 Jun

2:13 p.m.

On Fri, Jun 6, 2014 at 6:30 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...

Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean not just "restoring" records from logs but also it involves changing database connection args, bindings and database types.

Just to narrow this down a little further from the DB server-side: the eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types).

BR Sean

Nuria Ruiz

11:04 p.m.

...

Just to narrow this down a little further from the DB server-side: the

eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types). To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it is appropriate.

On Mon, Jun 9, 2014 at 8:13 AM, Sean Pringle springle@wikimedia.org wrote:

...

On Fri, Jun 6, 2014 at 6:30 PM, Nuria Ruiz nuria@wikimedia.org wrote:

...
Encoding in python2 is a notorious pain and hard to get right so to fixing this will mean not just "restoring" records from logs but also it involves changing database connection args, bindings and database types.

Just to narrow this down a little further from the DB server-side: the eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types).

BR Sean

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Sean Pringle

10 Jun 10 Jun

11 a.m.

On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
Just to narrow this down a little further from the DB server-side: the

eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types). To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it is appropriate.

Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was only commenting in case it could make the current python encoding fix easier and faster.

Were it a new system without ties to MW I'd push for solving character set issues properly with something like utf8mb4, depending on how you want to read/sort the data, but without that luxury varbinary is fine.

Ori Livneh

11:12 a.m.

On Mon, Jun 9, 2014 at 8:00 PM, Sean Pringle springle@wikimedia.org wrote:

...

On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...
...
Just to narrow this down a little further from the DB server-side: the

eventlogging tables do use utf-8, so the fix probably doesn't require laborious schema changes (if that's what you meant by changing database types). To follow the structure on mediawiki I think the easiest is to change db types from varchar to varbinary where utf-8 is being used. Please let us know if you do not think it is appropriate.

Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was only commenting in case it could make the current python encoding fix easier and faster.

Were it a new system without ties to MW I'd push for solving character set issues properly with something like utf8mb4, depending on how you want to read/sort the data, but without that luxury varbinary is fine.

commit 9cff78b7c6a9516611cfd055906fd0707c4d5b88 Author: Ori Livneh ori@wikimedia.org Date: Sun Apr 28 14:46:28 2013 -0700

Default MariaDB character encoding for EL data: utf8 -> utf8mb4

This change sets the default character encoding for MySQL / MariaDB EventLogging data to 'utf8mb4' (was: 'utf8'), adding support for characters above the Base Multilingual Plane. Deployment will require manual migration of existing data in the database.

One of the consequences of this migration is that the previous default size for string columns is not longer appropriate, since the columns it generates are not indexable by InnoDB, which will not index columns beyond 767 bytes. This change therefore amends the default size to be 191, which is the maximum size a utf8mb4 string column can be and still remain indexable.

Finally, as a way of not being blocked on deployment of I8fdcc046d, this change adds a live hack that substitutes 'utf8mb4' for 'utf8' in database connection strings. The hack can be removed once I8fdcc046d is deployed.

FIXME: Database setup instructions and minimum requirements should be documented.

Change-Id: Ia94f2c2155de5fb9031a8164306720e06455cced

commit 041cb2c34c540dfea05886368edc5d6209102aed Author: Ori Livneh ori@wikimedia.org Date: Sun Apr 28 15:13:26 2013 -0700

...and back to utf8 as default charset

The version of MySQLdb that is packaged for Precise does not know about utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.

Keeping the 191 limit to ease migration in the future.

Change-Id: I807e1d3a6f192b13e36811af376806d6a92e122d

Sean Pringle

12:14 p.m.

On Tue, Jun 10, 2014 at 1:12 PM, Ori Livneh ori@wikimedia.org wrote:

...

...and back to utf8 as default charset

The version of MySQLdb that is packaged for Precise does not know about
utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.

Bet that was a fun day :) Somewhat like today for me...

We no longer use the precise packages. m2-master supports utf8mb4 if anyone wishes to use it.

However, as Nuria mentioned, consistency with Mediawiki may be safest if we expect that client packages will continue to have encoding challenges. Also cross-db/charset joins to be considered.

Nuria Ruiz

13 Jun 13 Jun

6:39 p.m.

...

However, as Nuria mentioned, consistency with Mediawiki may be safest if

we expect that client packages will continue >to have encoding challenges. Also cross-db/charset joins to be considered.

My last reply on this thread, I promise. As I have come to learn as of late (encoding in python 2.7 is a world of joy), having the right encoding in python while using sqlalchemy has a lot to do on "how" you connect to the db.

If we use VARBINARY types we also need to connect either by specifying convert_unicode=True or with connect_args={"charset" : "utf8"}

In the second case (specifying the charset) while our db types are VARBINARY the sql alchemy column types can be strings and everything is happy, this seems like the easiest solution.

On Tue, Jun 10, 2014 at 6:14 AM, Sean Pringle springle@wikimedia.org wrote:

...

On Tue, Jun 10, 2014 at 1:12 PM, Ori Livneh ori@wikimedia.org wrote:

...
...and back to utf8 as default charset

The version of MySQLdb that is packaged for Precise does not know
about utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.
Bet that was a fun day :) Somewhat like today for me...

We no longer use the precise packages. m2-master supports utf8mb4 if anyone wishes to use it.

However, as Nuria mentioned, consistency with Mediawiki may be safest if we expect that client packages will continue to have encoding challenges. Also cross-db/charset joins to be considered.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Aaron Halfaker

9:15 p.m.

As a data consumer, I'd prefer if columns matched between EventLogging and production DBs as closely as possible, so VARBINARY sounds like a win to me.

On Fri, Jun 13, 2014 at 5:39 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...

...
However, as Nuria mentioned, consistency with Mediawiki may be safest if

we expect that client packages will continue >to have encoding challenges. Also cross-db/charset joins to be considered.

My last reply on this thread, I promise. As I have come to learn as of late (encoding in python 2.7 is a world of joy), having the right encoding in python while using sqlalchemy has a lot to do on "how" you connect to the db.

If we use VARBINARY types we also need to connect either by specifying convert_unicode=True or with connect_args={"charset" : "utf8"}

In the second case (specifying the charset) while our db types are VARBINARY the sql alchemy column types can be strings and everything is happy, this seems like the easiest solution.

On Tue, Jun 10, 2014 at 6:14 AM, Sean Pringle springle@wikimedia.org wrote:

...
On Tue, Jun 10, 2014 at 1:12 PM, Ori Livneh ori@wikimedia.org wrote:

...
...and back to utf8 as default charset

The version of MySQLdb that is packaged for Precise does not know
about utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.
Bet that was a fun day :) Somewhat like today for me...

We no longer use the precise packages. m2-master supports utf8mb4 if anyone wishes to use it.

However, as Nuria mentioned, consistency with Mediawiki may be safest if we expect that client packages will continue to have encoding challenges. Also cross-db/charset joins to be considered.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Dario Taraborelli

10:07 p.m.

On Jun 13, 2014, at 6:15 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:

...

As a data consumer, I'd prefer if columns matched between EventLogging and production DBs as closely as possible, so VARBINARY sounds like a win to me.

On Fri, Jun 13, 2014 at 5:39 AM, Nuria Ruiz nuria@wikimedia.org wrote:

...
However, as Nuria mentioned, consistency with Mediawiki may be safest if we expect that client packages will continue >to have encoding challenges. Also cross-db/charset joins to be considered.

My last reply on this thread, I promise. As I have come to learn as of late (encoding in python 2.7 is a world of joy), having the right encoding in python while using sqlalchemy has a lot to do on "how" you connect to the db.

If we use VARBINARY types we also need to connect either by specifying convert_unicode=True or with connect_args={"charset" : "utf8"}

In the second case (specifying the charset) while our db types are VARBINARY the sql alchemy column types can be strings and everything is happy, this seems like the easiest solution.

On Tue, Jun 10, 2014 at 6:14 AM, Sean Pringle springle@wikimedia.org wrote: On Tue, Jun 10, 2014 at 1:12 PM, Ori Livneh ori@wikimedia.org wrote:
...and back to utf8 as default charset

The version of MySQLdb that is packaged for Precise does not know about
utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.
Bet that was a fun day :) Somewhat like today for me...

We no longer use the precise packages. m2-master supports utf8mb4 if anyone wishes to use it.

However, as Nuria mentioned, consistency with Mediawiki may be safest if we expect that client packages will continue to have encoding challenges. Also cross-db/charset joins to be considered.

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

3690

Age (days ago)

3698

Last active (days ago)

analytics@lists.wikimedia.org

15 comments

6 participants

tags (0)

participants (6)

Aaron Halfaker
Dario Taraborelli
Nuria Ruiz
Ori Livneh
Sean Pringle
Steven Walling