Hello all.
@Tim: By "feature" I mean having values for column user.user_registration filled
for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I
don't see any reason for this info not being available, as it is already public from
Special:ListUsers.
@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In fact, I suspect
that daily or weekly aggregates would be enough for time-series characterization. My
actual goal is comparing trends between different languages, and eventually correlation
with other known activity metrics.
Best regards,
Felipe.
El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker <aaron.halfaker(a)gmail.com>
escribió:
I have a dataset containing estimated registration dates for editors who registered before
Dec. 2005. My method assumes that user_id is monotonically increasing and sets the lowest
upper-bound available.
For example. Let's assume the following rows:
user_id first_edit
12345 20040102030405
12344 NULL
12343 20040102050102
Since an editor couldn't have saved a revision before registering their account, we
can assume that user 12345 registered there account on or before 20040102030405. If
user_id is monotonically increasing, we also know that user 12344 must have registered on
or before 20040102030405, which lets us fill in a NULL. Similarly, we have a first_edit
timestamp for user 12343, but that edit happened pretty late. We can actually just
continue to propagate the 20040102030405timestamp to this user too.
After performing this approximation, we'd have the following rows:
user_id first_edit user_registration_approx
12345 20040102030405 20040102030405
12344 NULL 20040102030405
12343 20040102050102 20040102030405
In effect, this is similar to the approximation discussed
in
https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to
interpolate probable registration timings on users. In practice we're talking about a
difference of seconds, so I haven't bothered with the extra work.
I'm generating a datafile for English now that I should be able to share the the end
of the day:
* user_id
* registration_type (see
https://meta.wikimedia.org/wiki/Research:Attached_user
and
https://meta.wikimedia.org/wiki/Research:Newly_registered_user)
* user_registration (from user table)
* first_edit (lowest timestamp from "revision" and "archive" for
user_id)
* registration_approx (my approximation based on the method described above)
-Aaron
On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
Felipe Ortega, 14/02/2014 12:05:
>
>
>Thanks a lot. Then, I look forward to the confirmation and
>>implementation of this feature. In case it's better to open a new issue
>>on bugzilla or any other action on my side (lend a hand with value
>>reviewing/testing) just let me know.
>>
>
You could help assess the correctness of and/or code the guesstimate method
proposed in
https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill
further blanks.
Nemo
_______________________________________________
Labs-l mailing list
Labs-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l