[Labs-l] [Analytics] User registration date on DB replicas

Aaron Halfaker aaron.halfaker at gmail.com
Fri Feb 14 14:59:33 UTC 2014


I have a dataset containing estimated registration dates for editors who
registered before Dec. 2005.  My method assumes that user_id is
monotonically increasing and sets the lowest upper-bound available.

For example.  Let's assume the following rows:

    user_id    first_edit
    12345      20040102030405
    12344      NULL
    12343      20040102050102

Since an editor couldn't have saved a revision before registering their
account, we can assume that user 12345 registered there account on or
before 20040102030405.  If user_id is monotonically increasing, we also
know that user 12344 must have registered on or before 20040102030405,
which lets us fill in a NULL.  Similarly, we have a first_edit timestamp
for user 12343, but that edit happened pretty late.  We can actually just
continue to propagate the 20040102030405 timestamp to this user too.

After performing this approximation, we'd have the following rows:

    user_id    first_edit        user_registration_approx
    12345      20040102030405    20040102030405
    12344      NULL              20040102030405
    12343      20040102050102    20040102030405

In effect, this is similar to the approximation discussed in
https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to
interpolate probable registration timings on users.  In practice we're
talking about a difference of seconds, so I haven't bothered with the extra
work.

I'm generating a datafile for English now that I should be able to share
the the end of the day:

   - user_id
   - registration_type  (see
   https://meta.wikimedia.org/wiki/Research:Attached_user and
   https://meta.wikimedia.org/wiki/Research:Newly_registered_user)
   - user_registration (from user table)
   - first_edit (lowest timestamp from "revision" and "archive" for user_id)
   - registration_approx (my approximation based on the method described
   above)

-Aaron


On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <nemowiki at gmail.com>wrote:

> Felipe Ortega, 14/02/2014 12:05:
>
>  Thanks a lot. Then, I look forward to the confirmation and
>> implementation of this feature. In case it's better to open a new issue
>> on bugzilla or any other action on my side (lend a hand with value
>> reviewing/testing) just let me know.
>>
>
> You could help assess the correctness of and/or code the guesstimate
> method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 ,
> for the script to fill further blanks.
>
>
> Nemo
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20140214/c432e147/attachment.html>


More information about the Labs-l mailing list