Hey folks,
I replied to all of the lists with the email below from my personal address. Turns out that I didn't sign up for analytics-l with that address so I'm sending again. Sorry for the duplicates.
----
I have a dataset containing estimated registration dates for editors who registered before Dec. 2005. My method assumes that user_id is monotonically increasing and sets the lowest upper-bound available.
For example. Let's assume the following rows:
user_id first_edit
12345 20040102030405
12344 NULL
12343 20040102050102
Since an editor couldn't have saved a revision before registering their account, we can assume that user 12345 registered there account on or before 20040102030405. If user_id is monotonically increasing, we also know that user 12344 must have registered on or before 20040102030405, which lets us fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, but that edit happened pretty late. We can actually just continue to propagate the 20040102030405 timestamp to this user too.
After performing this approximation, we'd have the following rows:
user_id first_edit user_registration_approx
12345 20040102030405 20040102030405
12344 NULL 20040102030405
12343 20040102050102 20040102030405
In effect, this is similar to the approximation discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to interpolate probable registration timings on users. In practice we're talking about a difference of seconds, so I haven't bothered with the extra work.
I'm generating a datafile for English now that I should be able to share the the end of the day: