On Fri, Oct 11, 2013 at 10:42 AM, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
I think we should set the right expectations about working with large cohorts. 
I would not call 700 editors a large cohort :) this should just work fine. We don't know yet whether this was related to labs maintenance or not and I asked Steven to share the cohort with us to verify that it works.

I tried uploading a CSV with 500 user_ids and got the same 504 Gateway Time-out error as Steven

Quoting Dan's response from September 17:

I just wanted to correct one small misunderstanding.  Running large cohorts does *not* work in wikimetrics at this time for two reasons:

1. You'll have a problem uploading them as Dario mentioned (because it validates each user individually against the database, as Dario guessed).  The best solution for this is to create a temp table of all the users we are trying to upload and verify them in one query.  This would be very fast and not too hard to implement.
Uploading a cohort this should work work but it's a blocking operation which is not very user friendly, Mingle card 818 addresses this issue. 

Thanks, bookmarked :)


2. A large cohort will not fit in the "IN" clause of a SQL query.  This is a known limitation and we have to fix it by creating a temporary table from the cohort.  We can then join to the temp table for any metrics.  The reason I've delayed this is because the same mechanism could be used to implement dynamic cohorts, boolean cohort combinations, and project level cohorts.  We should prioritize these technically related features and then I can come up with a plan to do the minimally viable thing without shooting ourselves in the foot.
I did some calculations and it seems that this is only an issue with cohorts larger than 200k editors. 

We will need to discuss and benchmark performance for these jobs, this is one of the issues that I'd personally like to see prioritized over UX as it's something the entire Product team would benefit from.

Dario