You can backfill the events according to Ori's new logic. Then your query is simple going forward.
On Mon, Dec 2, 2013 at 6:55 PM, Jon Robson jrobson@wikimedia.org wrote:
If Kenan schedules a task we can update the schema to record this for newly created data and given the issues with this it seems like a good idea.
That said we will have a lot of historic data that will still need to be joined and saved as a new table... via a UNION i guess?
On Mon, Dec 2, 2013 at 2:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Mon, Dec 2, 2013 at 5:45 PM, Kenan Wang kwang@wikimedia.org wrote:
It sounds good to me. Dario, Dan?
On Mon, Dec 2, 2013 at 1:35 PM, Arthur Richards <
arichards@wikimedia.org>
wrote:
On Thu, Nov 28, 2013 at 3:17 AM, Ori Livneh ori@wikimedia.org wrote:
It doesn't make sense to do it that way. Instead of inferring that something must have happened by cross-referencing conditions across datasets, just do the following: in MediaWiki, every time a user
makes an
edit, check their registration date and edit count. If the date is
within
the last thirty days and the edit count is 5, log an event. Doing it
this
way will easily scale to the entire cluster, not just enwiki, and to
any
number of bins, not just 5 edits.
Patch at https://gerrit.wikimedia.org/r/#/c/98079/; you can take it from there if you like.
Thanks Ori - this sounds and looks viable to me, and seems like a
better
solution. Kenan, Jon, Dario, Dan, et al - can we move forward with
this?
I'm ok with this. I do see it as a temporary measure though. What Ori
says
here, "inferring that something must have happened", is sort of the whole reason SQL exists. In my opinion, the problem is that these two data sources can't be joined efficiently to do analytics work on them. But
since
that's a harder problem at the moment, I agree with Ori's solution.
Jon/Arthur, who set up your Event Logging solution and do you need help reviewing / merging this Change? I don't know much about Event Logging
but
I'm happy to learn and help if you need.
Dan