fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?
On Nov 21, 2013, at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
OK, I got 10k users to validate in about 30 seconds. Not instant but it does have to do a bunch of duplicate checks, multi-project batching, etc. Let me know how it works for you, and if there are any problems.
On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.
If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).
What’s the reason why it takes so long to validate a cohort in the application?
My understanding is that this is due to Labs being slow compared to stat1?
I don't think labs is that much slower though, we're talking orders of magnitude here. So, I think the reason is that currently it's validating one user at a time. Since for each record I have to check against a potential user_id and user_name match, this takes forever.
Two ways to make it much faster:
- batch every X users and do a where user_id in (...) or user_name in (...) query instead of checking each one
- create temporary tables just like Dario did
The problem is that cohorts can have users from multiple projects. That makes both approaches harder, but should still be doable. The reason I haven't done this yet is that when we scheduled 818 we broke out the performance issue and agreed we'd work on it later. Sounds important though, I'll look at it now.
Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics