Asynchronous Cohort Upload Deployed

List overview All Threads
Download

newer

older

temporary outage

deployment

Dan Andreescu

21 Nov 2013 21 Nov '13

7:39 a.m.

Dear Wikimetrics users,

I've just deployed asynchronous cohort upload. This is feature #818: https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and basically allows you to upload larger cohorts because validation is happening behind the scenes. I'll go over how the new functionality works here, and will rely on one of you to point me to the appropriate on-wiki place to update documentation.

So basically, visiting /cohorts and clicking "Upload Cohort" works as before. But once you click "Upload CSV", your form is validated, processed, and you're taken back to the cohorts page. Your new cohort is immediately created but is not yet validated. While it validates, you'll see the validation status and have a few options:

* Remove Cohort. This is destructive and will remove this cohort from your list. Use this in case you made a mistake, uploaded the wrong file, etc. * Validate Again. This will run validation again. One possible use for it is, let's say you upload a cohort with some *very* newly registered users. And because of replication lag to the labsdb databases, most of them come up invalid. You can then run validation again. * Refresh. This just refreshes the status of the validation and will update the counts that show up below.

You will not have the "Create Report" option until validation is done. And when you do create a report, only valid users will be considered and used in the output.

One caveat. Validation is still slow. And the time limit for the asynchronous task is set to 1 hour. I have some ideas for making this faster by batching, and I can increase the time limit per task (but that has other repercussions). For now, just keep in mind that the theoretical maximum cohort size you should upload is roughly 18,000 users. I would love some feedback about whether it's ok to increase the time limit or if people want me to focus on making validation faster.

Dan

Attachments:

attachment.htm (text/html — 2.2 KB)

Show replies by date

Dario Taraborelli

21 Nov 21 Nov

7:45 a.m.

thanks Dan, this is awesome – I’ll give it a try this morning with some of the recent mobile cohorts.

On Nov 21, 2013, at 7:39 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Dear Wikimetrics users,

I've just deployed asynchronous cohort upload. This is feature #818: https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and basically allows you to upload larger cohorts because validation is happening behind the scenes. I'll go over how the new functionality works here, and will rely on one of you to point me to the appropriate on-wiki place to update documentation.

So basically, visiting /cohorts and clicking "Upload Cohort" works as before. But once you click "Upload CSV", your form is validated, processed, and you're taken back to the cohorts page. Your new cohort is immediately created but is not yet validated. While it validates, you'll see the validation status and have a few options:

Remove Cohort. This is destructive and will remove this cohort from your list. Use this in case you made a mistake, uploaded the wrong file, etc.

Validate Again. This will run validation again. One possible use for it is, let's say you upload a cohort with some *very* newly registered users. And because of replication lag to the labsdb databases, most of them come up invalid. You can then run validation again.

Refresh. This just refreshes the status of the validation and will update the counts that show up below.

You will not have the "Create Report" option until validation is done. And when you do create a report, only valid users will be considered and used in the output.

One caveat. Validation is still slow. And the time limit for the asynchronous task is set to 1 hour. I have some ideas for making this faster by batching, and I can increase the time limit per task (but that has other repercussions). For now, just keep in mind that the theoretical maximum cohort size you should upload is roughly 18,000 users. I would love some feedback about whether it's ok to increase the time limit or if people want me to focus on making validation faster.

Dan _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dario Taraborelli

9:57 a.m.

Dan,

I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).

What’s the reason why it takes so long to validate a cohort in the application?

Dario

On Nov 21, 2013, at 7:45 AM, Dario Taraborelli dario@wikimedia.org wrote:

...

thanks Dan, this is awesome – I’ll give it a try this morning with some of the recent mobile cohorts.

On Nov 21, 2013, at 7:39 AM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Dear Wikimetrics users,

I've just deployed asynchronous cohort upload. This is feature #818: https://mingle.corp.wikimedia.org/projects/analytics/cards/818 and basically allows you to upload larger cohorts because validation is happening behind the scenes. I'll go over how the new functionality works here, and will rely on one of you to point me to the appropriate on-wiki place to update documentation.

So basically, visiting /cohorts and clicking "Upload Cohort" works as before. But once you click "Upload CSV", your form is validated, processed, and you're taken back to the cohorts page. Your new cohort is immediately created but is not yet validated. While it validates, you'll see the validation status and have a few options:

Remove Cohort. This is destructive and will remove this cohort from your list. Use this in case you made a mistake, uploaded the wrong file, etc.

Validate Again. This will run validation again. One possible use for it is, let's say you upload a cohort with some *very* newly registered users. And because of replication lag to the labsdb databases, most of them come up invalid. You can then run validation again.

Refresh. This just refreshes the status of the validation and will update the counts that show up below.

You will not have the "Create Report" option until validation is done. And when you do create a report, only valid users will be considered and used in the output.

One caveat. Validation is still slow. And the time limit for the asynchronous task is set to 1 hour. I have some ideas for making this faster by batching, and I can increase the time limit per task (but that has other repercussions). For now, just keep in mind that the theoretical maximum cohort size you should upload is roughly 18,000 users. I would love some feedback about whether it's ok to increase the time limit or if people want me to focus on making validation faster.

Dan _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Steven Walling

10 a.m.

On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).

What’s the reason why it takes so long to validate a cohort in the application?

My understanding is that this is due to Labs being slow compared to stat1?

-- Steven Walling, Product Manager https://wikimediafoundation.org/

Dario Taraborelli

10:43 a.m.

I have no evidence that this is the case. A scan for the user table using the same fields/keys as the ones I used on the private slaves takes less than a second on tool labs.

On Nov 21, 2013, at 10:00 AM, Steven Walling swalling@wikimedia.org wrote:

...

On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).

What’s the reason why it takes so long to validate a cohort in the application?

My understanding is that this is due to Labs being slow compared to stat1?

-- Steven Walling, Product Manager https://wikimediafoundation.org/ _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dan Andreescu

10:53 a.m.

...

On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).

What’s the reason why it takes so long to validate a cohort in the application?

My understanding is that this is due to Labs being slow compared to stat1?

I don't think labs is that much slower though, we're talking orders of magnitude here. So, I think the reason is that currently it's validating one user at a time. Since for each record I have to check against a potential user_id and user_name match, this takes forever.

Two ways to make it much faster:

* batch every X users and do a where user_id in (...) or user_name in (...) query instead of checking each one * create temporary tables just like Dario did

The problem is that cohorts can have users from multiple projects. That makes both approaches harder, but should still be doable. The reason I haven't done this yet is that when we scheduled 818 we broke out the performance issue and agreed we'd work on it later. Sounds important though, I'll look at it now.

Dan Andreescu

1:52 p.m.

OK, I got 10k users to validate in about 30 seconds. Not instant but it does have to do a bunch of duplicate checks, multi-project batching, etc. Let me know how it works for you, and if there are any problems.

On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...

...
On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).

What’s the reason why it takes so long to validate a cohort in the application?

My understanding is that this is due to Labs being slow compared to stat1?

I don't think labs is that much slower though, we're talking orders of magnitude here. So, I think the reason is that currently it's validating one user at a time. Since for each record I have to check against a potential user_id and user_name match, this takes forever.

Two ways to make it much faster:

batch every X users and do a where user_id in (...) or user_name in

(...) query instead of checking each one

create temporary tables just like Dario did

The problem is that cohorts can have users from multiple projects. That makes both approaches harder, but should still be doable. The reason I haven't done this yet is that when we scheduled 818 we broke out the performance issue and agreed we'd work on it later. Sounds important though, I'll look at it now.

Dario Taraborelli

3 p.m.

fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

On Nov 21, 2013, at 1:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

OK, I got 10k users to validate in about 30 seconds. Not instant but it does have to do a bunch of duplicate checks, multi-project batching, etc. Let me know how it works for you, and if there are any problems.

On Thu, Nov 21, 2013 at 1:53 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

On Thu, Nov 21, 2013 at 9:57 AM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: I tried uploading a cohort from a recent A/B test (1,780 unique user_id’s). The async validation took about 5 minutes to complete.

If I create a temporary table with the data in my CSV and run a join with the user table against a slave, the query to validate that these users exist takes about 400ms if I use user_id (primary key in enwiki.user) and about 3s using user_name (unique in enwiki.user).

What’s the reason why it takes so long to validate a cohort in the application?

My understanding is that this is due to Labs being slow compared to stat1?

I don't think labs is that much slower though, we're talking orders of magnitude here. So, I think the reason is that currently it's validating one user at a time. Since for each record I have to check against a potential user_id and user_name match, this takes forever.

Two ways to make it much faster:

batch every X users and do a where user_id in (...) or user_name in (...) query instead of checking each one

create temporary tables just like Dario did

The problem is that cohorts can have users from multiple projects. That makes both approaches harder, but should still be doable. The reason I haven't done this yet is that when we scheduled 818 we broke out the performance issue and agreed we'd work on it later. Sounds important though, I'll look at it now.

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dan Andreescu

3:16 p.m.

On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...

fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names

Dario Taraborelli

3:35 p.m.

is there any foreseeable use case for mixed cohorts? If not, it sounds like this would be a useful enhancement.

On Nov 21, 2013, at 3:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names _______________________________________________ Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dan Andreescu

3:40 p.m.

...

is there any foreseeable use case for mixed cohorts? If not, it sounds like this would be a useful enhancement.

I await the prioritization gods to tell me what to do :) Personally, I think dynamic cohorts and timeseries for all metrics might be more important.

Edward Galvez

3:45 p.m.

Thanks for this! Just did a cohort of 15K and it worked fine. I unintentionally hit the "back" button, but after clicking "forward" and my cohort was validated, not even 10 seconds later.

Also, I can't seem to find the place that listed which users were not valid. Did we lose that ability?

- E

On Thu, Nov 21, 2013 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...

On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Edward Galvez

4:20 p.m.

Also! Just to introduce myself, I'm one of the interns with the Program Evaluation & Design team - thus this upgrade is very timely. Thank you!

On Thu, Nov 21, 2013 at 3:45 PM, Edward Galvez egalvez@wikimedia.orgwrote:

...

Thanks for this! Just did a cohort of 15K and it worked fine. I unintentionally hit the "back" button, but after clicking "forward" and my cohort was validated, not even 10 seconds later.

Also, I can't seem to find the place that listed which users were not valid. Did we lose that ability?

E

On Thu, Nov 21, 2013 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...
On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dan Andreescu

5:24 p.m.

Hi Edward. Yes, we temporarily lost the UI that shows what users are invalid. I wasn't sure what exactly people needed here so I didn't hazard a guess. The data is all there though and I can easily show you invalid users and invalid reasons for your cohort.

I just maybe need you and someone else to say how you'd like it to work and I can whip up a view for it tomorrow. Dario, any opinion on how invalid users should be displayed? The only weird part now is that you can't upload again. You'd have to delete the whole cohort and start over...

— Sent from Mailbox for iPhone

On Thu, Nov 21, 2013 at 7:20 PM, Edward Galvez egalvez@wikimedia.org wrote:

...

Also! Just to introduce myself, I'm one of the interns with the Program Evaluation & Design team - thus this upgrade is very timely. Thank you! On Thu, Nov 21, 2013 at 3:45 PM, Edward Galvez egalvez@wikimedia.orgwrote:

...
Thanks for this! Just did a cohort of 15K and it worked fine. I unintentionally hit the "back" button, but after clicking "forward" and my cohort was validated, not even 10 seconds later.

Also, I can't seem to find the place that listed which users were not valid. Did we lose that ability?

E

On Thu, Nov 21, 2013 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.orgwrote:

...
On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:

...
fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dario Taraborelli

5:28 p.m.

I (and by extension other people in the research team) will probably only ever use user_ids (which we know in advance are valid), so it’s probably best to ask Program Evauation folks or community members who may rely on usernames.

On Nov 21, 2013, at 5:24 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Hi Edward. Yes, we temporarily lost the UI that shows what users are invalid. I wasn't sure what exactly people needed here so I didn't hazard a guess. The data is all there though and I can easily show you invalid users and invalid reasons for your cohort.

I just maybe need you and someone else to say how you'd like it to work and I can whip up a view for it tomorrow. Dario, any opinion on how invalid users should be displayed? The only weird part now is that you can't upload again. You'd have to delete the whole cohort and start over... — Sent from Mailbox for iPhone

On Thu, Nov 21, 2013 at 7:20 PM, Edward Galvez egalvez@wikimedia.org wrote:

Also! Just to introduce myself, I'm one of the interns with the Program Evaluation & Design team - thus this upgrade is very timely. Thank you!

On Thu, Nov 21, 2013 at 3:45 PM, Edward Galvez egalvez@wikimedia.org wrote: Thanks for this! Just did a cohort of 15K and it worked fine. I unintentionally hit the "back" button, but after clicking "forward" and my cohort was validated, not even 10 seconds later.

Also, I can't seem to find the place that listed which users were not valid. Did we lose that ability?

E

On Thu, Nov 21, 2013 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote: fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Jaime Anstee

8:02 p.m.

Yes, we need user names and I can imagine some cases for potential mixed cohorts, but not sure about prevalence - Jaime

On Nov 21, 2013, at 5:28 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...

I (and by extension other people in the research team) will probably only ever use user_ids (which we know in advance are valid), so it’s probably best to ask Program Evauation folks or community members who may rely on usernames.

On Nov 21, 2013, at 5:24 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Hi Edward. Yes, we temporarily lost the UI that shows what users are invalid. I wasn't sure what exactly people needed here so I didn't hazard a guess. The data is all there though and I can easily show you invalid users and invalid reasons for your cohort.

I just maybe need you and someone else to say how you'd like it to work and I can whip up a view for it tomorrow. Dario, any opinion on how invalid users should be displayed? The only weird part now is that you can't upload again. You'd have to delete the whole cohort and start over... — Sent from Mailbox for iPhone

On Thu, Nov 21, 2013 at 7:20 PM, Edward Galvez egalvez@wikimedia.org wrote:

...
Also! Just to introduce myself, I'm one of the interns with the Program Evaluation & Design team - thus this upgrade is very timely. Thank you!

On Thu, Nov 21, 2013 at 3:45 PM, Edward Galvez egalvez@wikimedia.org wrote:

...
Thanks for this! Just did a cohort of 15K and it worked fine. I unintentionally hit the "back" button, but after clicking "forward" and my cohort was validated, not even 10 seconds later.

Also, I can't seem to find the place that listed which users were not valid. Did we lose that ability?

E

On Thu, Nov 21, 2013 at 3:16 PM, Dan Andreescu dandreescu@wikimedia.org wrote:

...
On Thu, Nov 21, 2013 at 6:00 PM, Dario Taraborelli dtaraborelli@wikimedia.org wrote:

...
fantastic, is there any chance we could get an even better performance if we allowed users to specify the field type in the upload form (if it’s just user_ids, validation will be faster and the app doesn’t need to check every single entry for a valid user_name too). I understand that by design the application makes no assumption about the type of that field (and in fact it accepts a mix of user_id’s and user_names, correct)?

absolutely, it would run up to 2x faster if the file was all user_ids and the user specified that up front. But currently, yes, you can mix user_ids and user_names

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Wikimetrics mailing list Wikimetrics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimetrics

Dan Andreescu

23 Nov 23 Nov

3:24 p.m.

...

Thanks for this! Just did a cohort of 15K and it worked fine. I unintentionally hit the "back" button, but after clicking "forward" and my cohort was validated, not even 10 seconds later.

Also, I can't seem to find the place that listed which users were not valid. Did we lose that ability?

OK Edward, I added a little link to see the invalid users for your validated cohort. You just click on the "X are invalid" text on your cohort's tab. This will list the value it tried to validate and the reason it thought it was invalid.

Dan

4063

Age (days ago)

4065

Last active (days ago)

wikimetrics@lists.wikimedia.org

16 comments

5 participants

tags (0)

participants (5)

Dan Andreescu
Dario Taraborelli
Edward Galvez
Jaime Anstee
Steven Walling