OK, I got 10k users to validate in about 30 seconds. Not instant but it does have to do a bunch of duplicate checks, multi-project batching, etc. Let me know how it works for you, and if there are any problems._______________________________________________On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
My understanding is that this is due to Labs being slow compared to stat1?On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).What’s the reason why it takes so long to validate a cohort in the application?I don't think labs is that much slower though, we're talking orders of magnitude here. So, I think the reason is that currently it's validating one user at a time. Since for each record I have to check against a potential user_id and user_name match, this takes forever.Two ways to make it much faster:* batch every X users and do a where user_id in (...) or user_name in (...) query instead of checking each one* create temporary tables just like Dario didThe problem is that cohorts can have users from multiple projects. That makes both approaches harder, but should still be doable. The reason I haven't done this yet is that when we scheduled 818 we broke out the performance issue and agreed we'd work on it later. Sounds important though, I'll look at it now.
Wikimetrics mailing list
Wikimetrics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimetrics