On Wed, Nov 27, 2013 at 12:18 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
On Wed, Nov 27, 2013 at 2:41 PM, Kenan Wang <kwang@wikimedia.org> wrote:
Dan here is what I'm looking for:

How many users registered on enwiki in month X and reached 5 edits within 30 days

I talked with Dario and we're hoping that restricting it to enwiki solves the cross-db join issue that you were facing.


Thank you.  I'll see if I can tune the query to do this efficiently.  The cross-db issue comes from joining the Event Logging table with the mediawiki table.  If my tuning doesn't yield results, the only viable solution is to import the event logging stuff into a temp table in labsdb/enwiki_p.  Then they'll be on the same database and the query should fly.  Is that possible with the schema you're capturing for mobile registrations?  In other words, can that data be shared publicly?

It doesn't make sense to do it that way. Instead of inferring that something must have happened by cross-referencing conditions across datasets, just do the following: in MediaWiki, every time a user makes an edit, check their registration date and edit count. If the date is within the last thirty days and the edit count is 5, log an event. Doing it this way will easily scale to the entire cluster, not just enwiki, and to any number of bins, not just 5 edits.

Patch at <https://gerrit.wikimedia.org/r/#/c/98079/>; you can take it from there if you like.