[Foundation-l] Data retention

Charlotte Webb charlottethewebb at gmail.com
Wed Sep 17 16:17:37 UTC 2008


On 9/15/08, Joe Szilagyi <szilagyi at gmail.com> wrote:
>>>> CheckUser data used to be kept for 3 months, but Aaron recently
>>>> increased it to 5 months. I'm not sure why or on whose authority.
>>>>
>>>> <http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CheckUser/CheckUser.php?r1=39734&r2=40620
>>>
>>> I think Jon was inquiring about more than just checkuser (notice the
>>> "such as").  I would assume that anyone asking about data retention in
>>> general is not overly concerned with the specific modes of retention,
>>> but is more concerned with the maximum retention time (across all
>>> modes) of any particular type of private data.
>>
>> The other logs are not automatically rotated, and need to be manually
>> purged. The retention time is thus not consistent. Typically we have kept
>> around 6 months of data. There are error logs, and logs for various kinds
>> of special requests. They are not used for sockpuppet investigation.
>>
>> I've said in the past that I think 6 months would be a reasonable horizon
>> for all private data -- it would give us plenty of data for operations,
>> and would be a far shorter period than that used by the large commercial
>> websites.
>
> Is there a plan to detail all of this including time frames in a more public
> location, such as the privacy policy? If not, why not?

I think the basis for this change is that checkuser data older than 5
months has a 1% chance of being deleted with each passing edit[1], and
will [[almost surely]] be deleted before any of it becomes 6 months
old.

To get a rough idea of how many edits we get per month I picked a
recentchanges diff from a few minutes ago and compared it to an edit
from a month earlier:

> Edit at 15:07, 17 September 2008 by User:Hapsala on [[2008 Russian financial crisis]] [2]
> Edit at 15:07, 17 August 2008 by User:118d on [[Buses in London]] [3]

6535756 edits in this selected one-month period.

I just realized August has 31 days so let's round it down to 6.5 million :P

So there is a one percent chance of initiating a checkuser data purge,
and a 0.99 chance that all data will remain intact for the time being.
This 99% may seem high, but it becomes negligible over a month. I
don't know what the exact odds is as I cannot find a calculator that
gives me a non-zero result for "0.99 ^ (6.5 million)" [4].

Disclaimer: I don't know what kind of random number generator is being
used here so I can't comment on the integrity of it.

But I can say that the numbers game becomes less laughable on smaller
projects. Let's take the [[Hungarian Wikinews]] for example, which had
only 374 edits in an equal time-span[5][6].

So on this project there would be a 2.331 percent chance of
over-retaining checkuser data in violation of the privacy policy[7].

In any case the $wgCUDMaxAge has been reverted back to three
months[8]. Probably a good call.

—C.W.

P.S. I love the edit summary[9]. Tim should run for arbcom, or something.

[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CheckUser/CheckUser.php?annotate=40847#l101
[2] http://en.wikipedia.org/w/index.php?title=2008_Russian_financial_crisis&diff=239037926
[3] http://en.wikipedia.org/w/index.php?title=Buses_in_London&diff=232502170
[4] http://www.google.com/search?q=.99+%5E+6500000
[5] http://hu.wikinews.org/w/index.php?title=Sablon:Olaj&diff=9244
[6] http://hu.wikinews.org/w/index.php?title=Szerkeszt%C5%91:Gondnok&diff=8870
[7] http://www.google.com/search?q=.99+%5E+374
[8] http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CheckUser/CheckUser.php?r1=40740&r2=40847
[9] http://svn.wikimedia.org/viewvc/mediawiki?view=rev&revision=40847



More information about the foundation-l mailing list