________________________________
De: Dario Taraborelli <dtaraborelli(a)wikimedia.org>
Para: Felipe Ortega <glimmer_phoenix(a)yahoo.es>es>; A mailing list for the Analytics
Team at WMF and everybody who has an interest in Wikipedia and analytics.
<analytics(a)lists.wikimedia.org>
CC: Aaron Halfaker <aaron.halfaker(a)gmail.com>om>; Wikimedia Labs
<labs-l(a)lists.wikimedia.org>
Enviado: Viernes 14 de febrero de 2014 18:48
Asunto: Re: [Analytics] [Labs-l] User registration date on DB replicas
Felipe, for some context on the work the team is doing on standardizing user class
definitions and supportive analysis, check
out:
https://meta.wikimedia.org/wiki/Research:Newly_registered_user
Thanks a lot, Dario. This simplifies things a lot, as I already have the logging table
imported for all Wikipedias in the study.
BTW, regarding the graphs at the end of that page, I have instantly recognized the plots
from the stl() function in R. Did you used s.window = 'periodic' in the call? The
loess method is fine for a first approximation, but the (daily?) time-series are fairly
noisy in this case, and it may be quite sensitive to the selected window span. Residuals
have some noticeable patterns, e.g. in the case of Spanish (not a good thing).
I'm also adding a comment on the talk page regarding a 4th type of entries for
log_type='newusers' in logging. At least in German (maybe also in other DBs),
there are > 80K entries with log_action='newusers' (yes, same as log_type). It
shouldn't make a great difference, but mostly for completeness in case description.
Best,
Felipe.
On Feb 14, 2014, at 9:27 AM, Felipe Ortega <glimmer_phoenix(a)yahoo.es> wrote:
Hello all.
>
>@Tim: By "feature" I mean having values for column user.user_registration
filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I
don't see any reason for this info not being available, as it is already public from
Special:ListUsers.
>
>@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In fact, I
suspect that daily or weekly aggregates would be enough for time-series characterization.
My actual goal is comparing trends between different languages, and eventually correlation
with other known activity metrics.
>
>Best regards,
>Felipe.
>
>
>
>
>
>
>El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker
<aaron.halfaker(a)gmail.com> escribió:
>
>I have a dataset containing estimated registration dates for editors who registered
before Dec. 2005. My method assumes that user_id is monotonically increasing and sets the
lowest upper-bound available.
>>
>>
>>For example. Let's assume the following rows:
>>
>>
>> user_id first_edit
>> 12345 20040102030405
>> 12344 NULL
>> 12343 20040102050102
>>
>>
>>Since an editor couldn't have saved a revision before registering their
account, we can assume that user 12345 registered there account on or
before 20040102030405. If user_id is monotonically increasing, we also know that user
12344 must have registered on or before 20040102030405, which lets us fill in a NULL.
Similarly, we have a first_edit timestamp for user 12343, but that edit happened pretty
late. We can actually just continue to propagate the 20040102030405timestamp to this user
too.
>>
>>
>>After performing this approximation, we'd have the following rows:
>>
>>
>> user_id first_edit user_registration_approx
>> 12345 20040102030405 20040102030405
>> 12344 NULL 20040102030405
>> 12343 20040102050102 20040102030405
>>
>>
>>In effect, this is similar to the approximation discussed
in
https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to
interpolate probable registration timings on users. In practice we're talking about a
difference of seconds, so I haven't bothered with the extra work.
>>
>>
>>I'm generating a datafile for English now that I should be able to share the
the end of the day:
>> * user_id
>> * registration_type (see
https://meta.wikimedia.org/wiki/Research:Attached_user
and
https://meta.wikimedia.org/wiki/Research:Newly_registered_user)
>> * user_registration (from user table)
>> * first_edit (lowest timestamp from "revision" and
"archive" for user_id)
>> * registration_approx (my approximation based on the method described above)
>>-Aaron
>>
>>
>>
>>On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
>>
>>Felipe Ortega, 14/02/2014 12:05:
>>>
>>>
>>>Thanks a lot. Then, I look forward to the confirmation and
>>>>implementation of this feature. In case it's better to open a new
issue
>>>>on bugzilla or any other action on my side (lend a hand with value
>>>>reviewing/testing) just let me know.
>>>>
>>>
You could help assess the correctness of and/or code the guesstimate
method proposed in
https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script
to fill further blanks.
Nemo
_______________________________________________
Labs-l mailing list
Labs-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics