Hello all.
I'm CCing the analytics list in case this question is also relevant for them. Not sure about research-l, so please forward this message if that's the case.
I have a question regarding the registration date for new user accounts (table user). The information is (apparently) public, as it can be retrieved from the Special:ListUsers: http://en.wikipedia.org/w/index.php?title=Special%3AListUsers&username=D...
Furthermore, Dario uploaded to DataHub a CSV file with an hourly series of registration dates in 2008-2011, from enwiki: http://datahub.io/dataset/wikipedia-new-user-registrations
It would be quite interesting to study the whole series (say, back to 2004) and compare it with other languages. However, this info is not available on the DB replicas in Tool-Labs (whole column 'user_registration' in table 'user' is NULL).
My question is: are there any reasons for redacting this (apparently public) info? I can't figure out why this could be sensitive data.
Thanks in advance, best regards. Felipe.
Felipe Ortega, 13/02/2014 14:57:
My question is: are there any reasons for redacting this (apparently public) info? I can't figure out why this could be sensitive data.
It's not redacted, it simply never existed. There aren't even log entries for old registrations; on some wiki(s) the field was populated with guesstimates. See also https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , https://bugzilla.wikimedia.org/show_bug.cgi?id=22097 depends on it/is a duplicate.
Nemo
Thanks, Nemo.
It is a shame. Does this means that this information is also inaccurate for users created after r12207 (Dec. 2005) ? At least, it would be useful to compare any differences between the periods 2006-2008 and 2009-present.
In fact, I remember that this information was available in the DB replicas in Toolserver. But I haven't had the chance to check against log entries, yet.
Regards, Felipe.
El Jueves 13 de febrero de 2014 15:17, Federico Leva (Nemo) nemowiki@gmail.com escribió:
Felipe Ortega, 13/02/2014 14:57:
My question is: are there any reasons for redacting this (apparently public) info? I can't figure out why this could be sensitive data.
It's not redacted, it simply never existed. There aren't even log entries for old registrations; on some wiki(s) the field was populated with guesstimates. See also https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , https://bugzilla.wikimedia.org/show_bug.cgi?id=22097 depends on it/is a duplicate.
Nemo
I can't see a reason for the data to not be available; the deficiencies were (iirc) pre-2004-ish. It's actually really trivial to tell when they started, because the "guesstimates" are the timestamp of the first revision associated with the user.
So I'm not sure that this was a deliberate design decision - and if it was, I can't imagine they'd nullify the entire field just because of some inaccuracies a decade ago ;p.
On 13 February 2014 09:13, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Thanks, Nemo.
It is a shame. Does this means that this information is also inaccurate for users created after r12207 (Dec. 2005) ? At least, it would be useful to compare any differences between the periods 2006-2008 and 2009-present.
In fact, I remember that this information was available in the DB replicas in Toolserver. But I haven't had the chance to check against log entries, yet.
Regards, Felipe.
El Jueves 13 de febrero de 2014 15:17, Federico Leva (Nemo) < nemowiki@gmail.com> escribió:
Felipe Ortega, 13/02/2014 14:57:
My question is: are there any reasons for redacting this (apparently public) info? I can't figure out why this could be sensitive data.
It's not redacted, it simply never existed. There aren't even log entries for old registrations; on some wiki(s) the field was populated with guesstimates. See also https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , https://bugzilla.wikimedia.org/show_bug.cgi?id=22097 depends on it/is a duplicate.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Oliver,
Thanks a lot. Then, I look forward to the confirmation and implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
Regards, Felipe.
El Jueves 13 de febrero de 2014 18:37, Oliver Keyes okeyes@wikimedia.org escribió:
I can't see a reason for the data to not be available; the deficiencies were (iirc) pre-2004-ish. It's actually really trivial to tell when they started, because the "guesstimates" are the timestamp of the first revision associated with the user.
So I'm not sure that this was a deliberate design decision - and if it was, I can't imagine they'd nullify the entire field just because of some inaccuracies a decade ago ;p.
On 13 February 2014 09:13, Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Thanks, Nemo.
It is a shame. Does this means that this information is also inaccurate for users created after r12207 (Dec. 2005) ? At least, it would be useful to compare any differences between the periods 2006-2008 and 2009-present.
In fact, I remember that this information was available in the DB replicas in Toolserver. But I haven't had the chance to check against log entries, yet.
Regards, Felipe.
El Jueves 13 de febrero de 2014 15:17, Federico Leva (Nemo) nemowiki@gmail.com escribió:
Felipe Ortega, 13/02/2014 14:57:
My question is: are there any reasons for redacting this (apparently public) info? I can't figure out why this could be sensitive data.
It's not redacted, it simply never existed. There aren't even log entries for old registrations; on some wiki(s) the field was populated with guesstimates. See also https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , https://bugzilla.wikimedia.org/show_bug.cgi?id=22097 depends on it/is a duplicate.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes Product Analyst Wikimedia Foundation
Felipe Ortega glimmer_phoenix@yahoo.es wrote:
Thanks a lot. Then, I look forward to the confirmation and implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
[...]
What "feature" do you mean?
Tim
Felipe Ortega, 14/02/2014 12:05:
Thanks a lot. Then, I look forward to the confirmation and implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
You could help assess the correctness of and/or code the guesstimate method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill further blanks.
Nemo
Hey folks,
I replied to all of the lists with the email below from my personal address. Turns out that I didn't sign up for analytics-l with that address so I'm sending again. Sorry for the duplicates.
----
I have a dataset containing estimated registration dates for editors who registered before Dec. 2005. My method assumes that user_id is monotonically increasing and sets the lowest upper-bound available.
For example. Let's assume the following rows:
user_id first_edit 12345 20040102030405 12344 NULL 12343 20040102050102
Since an editor couldn't have saved a revision before registering their account, we can assume that user 12345 registered there account on or before 20040102030405. If user_id is monotonically increasing, we also know that user 12344 must have registered on or before 20040102030405, which lets us fill in a NULL. Similarly, we have a first_edit timestamp for user 12343, but that edit happened pretty late. We can actually just continue to propagate the 20040102030405 timestamp to this user too.
After performing this approximation, we'd have the following rows:
user_id first_edit user_registration_approx 12345 20040102030405 20040102030405 12344 NULL 20040102030405 12343 20040102050102 20040102030405
In effect, this is similar to the approximation discussed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638, but I'm not trying to interpolate probable registration timings on users. In practice we're talking about a difference of seconds, so I haven't bothered with the extra work.
I'm generating a datafile for English now that I should be able to share the the end of the day:
- user_id - registration_type (see https://meta.wikimedia.org/wiki/Research:Attached_user and https://meta.wikimedia.org/wiki/Research:Newly_registered_user) - user_registration (from user table) - first_edit (lowest timestamp from "revision" and "archive" for user_id) - registration_approx (my approximation based on the method described above)
-Aaron
On Fri, Feb 14, 2014 at 6:06 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Felipe Ortega, 14/02/2014 12:05:
Thanks a lot. Then, I look forward to the confirmation and
implementation of this feature. In case it's better to open a new issue on bugzilla or any other action on my side (lend a hand with value reviewing/testing) just let me know.
You could help assess the correctness of and/or code the guesstimate method proposed in https://bugzilla.wikimedia.org/show_bug.cgi?id=18638 , for the script to fill further blanks.
Nemo
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics