Hello all.
@Tim: By "feature" I mean having values for column user.user_registration filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I don't see any reason for this info not being available, as it is already public from Special:ListUsers.
@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In fact, I suspect that daily or weekly aggregates would be enough for time-series characterization. My actual goal is comparing trends between different languages, and eventually correlation with other known activity metrics.
Best regards, Felipe.
El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker aaron.halfaker@gmail.com escribió:
I have a dataset containing estimated registration dates for editors who registered before Dec. 2005. My method assumes that user_id is monotonically increasing and sets the lowest upper-bound available.
For example. Let's assume the following rows:
user_id first_edit 12345 20040102030405 12344 NULL 12343 20040102050102
Since an editor couldn't have saved a revision before registering their account, we can assume that user 12345 registered there account on or before 20040102030405. If user_id is monotonically increasing, we also know that user 12344 must have registered on or before 20040102030405, which lets us fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, but that edit happened pretty late. We can actually just continue to propagate the 20040102030405timestamp to this user too.
After performing this approximation, we'd have the following rows:
user_id first_edit user_registration_approx 12345 20040102030405 20040102030405 12344 NULL 20040102030405 12343 20040102050102 20040102030405
In effect, this is similar to the approximation discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to interpolate probable registration timings on users. In practice we're talking about a difference of seconds, so I haven't bothered with the extra work.
I'm generating a datafile for English now that I should be able to share the the end of the day:
- user_id
- registration_type (see https://meta.wikimedia.org/wiki/Research:Attached_user and https://meta.wikimedia.org/wiki/Research:Newly_registered_user)
- user_registration (from user table)
- first_edit (lowest timestamp from "revision" and "archive" for user_id)
- registration_approx (my approximation based on the method described above)
-Aaron
On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Felipe Ortega, 14/02/2014 12:05:
Thanks a lot. Then, I look forward to the confirmation and
implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
You could help assess the correctness of and/or code the guesstimate method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill further blanks.
Nemo
Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l
Felipe, for some context on the work the team is doing on standardizing user class definitions and supportive analysis, check out: https://meta.wikimedia.org/wiki/Research:Newly_registered_user
On Feb 14, 2014, at 9:27 AM, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Hello all.
@Tim: By "feature" I mean having values for column user.user_registration filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I don't see any reason for this info not being available, as it is already public from Special:ListUsers.
@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In fact, I suspect that daily or weekly aggregates would be enough for time-series characterization. My actual goal is comparing trends between different languages, and eventually correlation with other known activity metrics.
Best regards, Felipe.
El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker aaron.halfaker@gmail.com escribió: I have a dataset containing estimated registration dates for editors who registered before Dec. 2005. My method assumes that user_id is monotonically increasing and sets the lowest upper-bound available.
For example. Let's assume the following rows:
user_id first_edit 12345 20040102030405 12344 NULL 12343 20040102050102
Since an editor couldn't have saved a revision before registering their account, we can assume that user 12345 registered there account on or before 20040102030405. If user_id is monotonically increasing, we also know that user 12344 must have registered on or before 20040102030405, which lets us fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, but that edit happened pretty late. We can actually just continue to propagate the 20040102030405 timestamp to this user too.
After performing this approximation, we'd have the following rows:
user_id first_edit user_registration_approx 12345 20040102030405 20040102030405 12344 NULL 20040102030405 12343 20040102050102 20040102030405
In effect, this is similar to the approximation discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to interpolate probable registration timings on users. In practice we're talking about a difference of seconds, so I haven't bothered with the extra work.
I'm generating a datafile for English now that I should be able to share the the end of the day: user_id registration_type (see https://meta.wikimedia.org/wiki/Research:Attached_user and https://meta.wikimedia.org/wiki/Research:Newly_registered_user) user_registration (from user table) first_edit (lowest timestamp from "revision" and "archive" for user_id) registration_approx (my approximation based on the method described above) -Aaron
On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote: Felipe Ortega, 14/02/2014 12:05:
Thanks a lot. Then, I look forward to the confirmation and implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
You could help assess the correctness of and/or code the guesstimate method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill further blanks.
Nemo
Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
De: Dario Taraborelli dtaraborelli@wikimedia.org Para: Felipe Ortega glimmer_phoenix@yahoo.es; A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. analytics@lists.wikimedia.org CC: Aaron Halfaker aaron.halfaker@gmail.com; Wikimedia Labs labs-l@lists.wikimedia.org Enviado: Viernes 14 de febrero de 2014 18:48 Asunto: Re: [Analytics] [Labs-l] User registration date on DB replicas
Felipe, for some context on the work the team is doing on standardizing user class definitions and supportive analysis, check out: https://meta.wikimedia.org/wiki/Research:Newly_registered_user
Thanks a lot, Dario. This simplifies things a lot, as I already have the logging table imported for all Wikipedias in the study.
BTW, regarding the graphs at the end of that page, I have instantly recognized the plots from the stl() function in R. Did you used s.window = 'periodic' in the call? The loess method is fine for a first approximation, but the (daily?) time-series are fairly noisy in this case, and it may be quite sensitive to the selected window span. Residuals have some noticeable patterns, e.g. in the case of Spanish (not a good thing).
I'm also adding a comment on the talk page regarding a 4th type of entries for log_type='newusers' in logging. At least in German (maybe also in other DBs), there are > 80K entries with log_action='newusers' (yes, same as log_type). It shouldn't make a great difference, but mostly for completeness in case description.
Best, Felipe.
On Feb 14, 2014, at 9:27 AM, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Hello all.
@Tim: By "feature" I mean having values for column user.user_registration filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I don't see any reason for this info not being available, as it is already public from Special:ListUsers.
@Aaron: Thanks a lot. I belive that is a fairly decent approximation. In fact, I suspect that daily or weekly aggregates would be enough for time-series characterization. My actual goal is comparing trends between different languages, and eventually correlation with other known activity metrics.
Best regards, Felipe.
El Viernes 14 de febrero de 2014 16:00, Aaron Halfaker aaron.halfaker@gmail.com escribió:
I have a dataset containing estimated registration dates for editors who registered before Dec. 2005. My method assumes that user_id is monotonically increasing and sets the lowest upper-bound available.
For example. Let's assume the following rows:
user_id first_edit 12345 20040102030405 12344 NULL 12343 20040102050102
Since an editor couldn't have saved a revision before registering their account, we can assume that user 12345 registered there account on or before 20040102030405. If user_id is monotonically increasing, we also know that user 12344 must have registered on or before 20040102030405, which lets us fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, but that edit happened pretty late. We can actually just continue to propagate the 20040102030405timestamp to this user too.
After performing this approximation, we'd have the following rows:
user_id first_edit user_registration_approx 12345 20040102030405 20040102030405 12344 NULL 20040102030405 12343 20040102050102 20040102030405
In effect, this is similar to the approximation discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to interpolate probable registration timings on users. In practice we're talking about a difference of seconds, so I haven't bothered with the extra work.
I'm generating a datafile for English now that I should be able to share the the end of the day: * user_id * registration_type (see https://meta.wikimedia.org/wiki/Research:Attached_user and https://meta.wikimedia.org/wiki/Research:Newly_registered_user) * user_registration (from user table) * first_edit (lowest timestamp from "revision" and "archive" for user_id) * registration_approx (my approximation based on the method described above) -Aaron
On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Felipe Ortega, 14/02/2014 12:05:
Thanks a lot. Then, I look forward to the confirmation and
implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
You could help assess the correctness of and/or code the guesstimate method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill further blanks.
Nemo
Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Felipe Ortega glimmer_phoenix@yahoo.es wrote:
@Tim: By "feature" I mean having values for column user.user_registration filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I don't see any reason for this info not being available, as it is already public from Special:ListUsers.
[...]
The information from Special:ListUsers is already available in column user.user_registration. Where this column is NULL, Special:ListUsers doesn't list any information either (cf. https://en.wikipedia.org/w/index.php?title=Special%3AListUsers&username=...).
Tim
I wrote:
@Tim: By "feature" I mean having values for column user.user_registration filled for DB replicas accessible from Tool-Labs, if possible. As Oliver has suggested, I don't see any reason for this info not being available, as it is already public from Special:ListUsers.
[...]
The information from Special:ListUsers is already available in column user.user_registration. Where this column is NULL, Special:ListUsers doesn't list any information either (cf. https://en.wikipedia.org/w/index.php?title=Special%3AListUsers&username=...).
Felipe?
Tim