[Labs-l] [Analytics] User registration date on DB replicas

Aaron Halfaker aaron.halfaker at gmail.com
Fri Feb 14 23:16:10 UTC 2014


OK, so the dataset I described above will be located here within a few
minutes:
http://stat1001.wikimedia.org/public-datasets/analytics/new_user_info.enwiki.tsv

However, there's an issue I didn't forsee.  It looks like some rows in the
archive table have some dubious timestamps and are causing problems with
relying on first_edit.  I think I'm going to take another pass where I
disregard archive edits to see if it ends up producing a more sane result.

-Aaron




On Fri, Feb 14, 2014 at 11:48 AM, Dario Taraborelli <
dtaraborelli at wikimedia.org> wrote:

> Felipe, for some context on the work the team is doing on standardizing
> user class definitions and supportive analysis, check out:
> https://meta.wikimedia.org/wiki/Research:Newly_registered_user
>
> On Feb 14, 2014, at 9:27 AM, Felipe Ortega <glimmer_phoenix at yahoo.es>
> wrote:
>
> Hello all.
>
> @Tim: By "feature" I mean having values for column user.user_registration
> filled for DB replicas accessible from Tool-Labs, if possible. As Oliver
> has suggested, I don't see any reason for this info not being available, as
> it is already public from Special:ListUsers.
>
> @Aaron: Thanks a lot. I belive that is a fairly decent approximation. In
> fact, I suspect that daily or weekly aggregates would be enough for
> time-series characterization. My actual goal is comparing trends between
> different languages, and eventually correlation with other known activity
> metrics.
>
> Best regards,
> Felipe.
>
>
>
>   El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker <
> aaron.halfaker at gmail.com> escribió:
>
> I have a dataset containing estimated registration dates for editors who
> registered before Dec. 2005.  My method assumes that user_id is
> monotonically increasing and sets the lowest upper-bound available.
>
> For example.  Let's assume the following rows:
>
>     user_id    first_edit
>     12345      20040102030405
>     12344      NULL
>     12343      20040102050102
>
> Since an editor couldn't have saved a revision before registering their
> account, we can assume that user 12345 registered there account on or
> before 20040102030405.  If user_id is monotonically increasing, we also
> know that user 12344 must have registered on or before 20040102030405,
> which lets us fill in a NULL.  Similarly, we have a first_edit timestamp
> for user 12343, but that edit happened pretty late.  We can actually just
> continue to propagate the 20040102030405 timestamp to this user too.
>
> After performing this approximation, we'd have the following rows:
>
>     user_id    first_edit        user_registration_approx
>     12345      20040102030405    20040102030405
>     12344      NULL              20040102030405
>     12343      20040102050102    20040102030405
>
> In effect, this is similar to the approximation discussed in
> https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying
> to interpolate probable registration timings on users.  In practice we're
> talking about a difference of seconds, so I haven't bothered with the extra
> work.
>
> I'm generating a datafile for English now that I should be able to share
> the the end of the day:
>
>    - user_id
>    - registration_type  (see
>    https://meta.wikimedia.org/wiki/Research:Attached_user and
>    https://meta.wikimedia.org/wiki/Research:Newly_registered_user)
>    - user_registration (from user table)
>    - first_edit (lowest timestamp from "revision" and "archive" for
>    user_id)
>    - registration_approx (my approximation based on the method described
>    above)
>
> -Aaron
>
>
> On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <nemowiki at gmail.com>wrote:
>
> Felipe Ortega, 14/02/2014 12:05:
>
>  Thanks a lot. Then, I look forward to the confirmation and
> implementation of this feature. In case it's better to open a new issue
> on bugzilla or any other action on my side (lend a hand with value
> reviewing/testing) just let me know.
>
>
> You could help assess the correctness of and/or code the guesstimate
> method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 ,
> for the script to fill further blanks.
>
>
> Nemo
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
>
>   _______________________________________________
> Analytics mailing list
> Analytics at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20140214/8999992f/attachment.html>


More information about the Labs-l mailing list