If Kenan schedules a task we can update the schema to record this for newly created data and given the issues with this it seems like a good idea.
That said we will have a lot of historic data that will still need to be joined and saved as a new table... via a UNION i guess?
On Mon, Dec 2, 2013 at 2:52 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Mon, Dec 2, 2013 at 5:45 PM, Kenan Wang kwang@wikimedia.org wrote:
It sounds good to me. Dario, Dan?
On Mon, Dec 2, 2013 at 1:35 PM, Arthur Richards arichards@wikimedia.org wrote:
On Thu, Nov 28, 2013 at 3:17 AM, Ori Livneh ori@wikimedia.org wrote:
It doesn't make sense to do it that way. Instead of inferring that something must have happened by cross-referencing conditions across datasets, just do the following: in MediaWiki, every time a user makes an edit, check their registration date and edit count. If the date is within the last thirty days and the edit count is 5, log an event. Doing it this way will easily scale to the entire cluster, not just enwiki, and to any number of bins, not just 5 edits.
Patch at https://gerrit.wikimedia.org/r/#/c/98079/; you can take it from there if you like.
Thanks Ori - this sounds and looks viable to me, and seems like a better solution. Kenan, Jon, Dario, Dan, et al - can we move forward with this?
I'm ok with this. I do see it as a temporary measure though. What Ori says here, "inferring that something must have happened", is sort of the whole reason SQL exists. In my opinion, the problem is that these two data sources can't be joined efficiently to do analytics work on them. But since that's a harder problem at the moment, I agree with Ori's solution.
Jon/Arthur, who set up your Event Logging solution and do you need help reviewing / merging this Change? I don't know much about Event Logging but I'm happy to learn and help if you need.
Dan