Re: [Wikimetrics] Asynchronous Cohort Upload Deployed

22 Nov 2013

      OK, I got 10k users to validate in about 30 seconds.  Not instant but it
does have to do a bunch of duplicate checks, multi-project batching, etc.
 Let me know how it works for you, and if there are any problems.
On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:
...
...
On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli <
dtaraborelli@wikimedia.org> wrote:
...
I tried uploading a cohort from a recent A/B test (1,780 unique
user_id’s). The async validation took about 5 minutes to complete.
If I create a temporary table with the data in my CSV and run a join
with the user table against a slave, the query to validate that these users
exist takes about 400ms if I use user_id (primary key in enwiki.user) and
about 3s using user_name (unique in enwiki.user).
What’s the reason why it takes so long to validate a cohort in the
application?
My understanding is that this is due to Labs being slow compared to
stat1?
I don't think labs is that much slower though, we're talking orders of
magnitude here.  So, I think the reason is that currently it's validating
one user at a time.  Since for each record I have to check against a
potential user_id and user_name match, this takes forever.
Two ways to make it much faster:

batch every X users and do a where user_id in (...) or user_name in

(...) query instead of checking each one

create temporary tables just like Dario did

The problem is that cohorts can have users from multiple projects.  That
makes both approaches harder, but should still be doable.  The reason I
haven't done this yet is that when we scheduled 818 we broke out the
performance issue and agreed we'd work on it later.  Sounds important
though, I'll look at it now.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikimetrics] Asynchronous Cohort Upload Deployed