[Foundation-l] Thank you for discussing my Top 25 Nonprofit LIst

Matthew Britton matthew.britton at btinternet.com
Thu Oct 4 07:25:08 UTC 2007

Gregory Maxwell wrote:
> On 10/3/07, Matthew Britton <matthew.britton at btinternet.com> wrote:
>> I'll try to clarify what geni is saying. The Wikimedia Foundation relies
>> exclusively on donations and has a very tight budget. It can only buy as
>> much hardware as it can afford, and can only just afford enough to keep
>> the sites running. (The toolserver had to be donated separately).
> It's a bit misleading to characterize this as a poor-wikimedia issue.
> Handling this amount of data is hard for anyone.  It's just that while
> other sites need a higher degree of data for things like selling
> themselves to advertisers, we don't... so efforts have been spent
> elsewhere instead.

Sorry about that. It is indeed hard for someone. Just easier for other 
sites with comparable traffic (e.g. ebay.com, microsoft.com) because 
they have vastly more resoures available. :)

> Historically we've only collected the information that we need for
> capacity planning.  I linked to that stuff up thread.
>> The resources just aren't available to completely log all site traffic -
>> it would require scripts to process the mess of data generated at a fast
>> enough pace to keep up without using up precious CPU time, and a whole
>> load of extra disk space to store this data.
> As of ~January, we send records of every access to an analysis system.
>  Prior to then technical issues prevented us from collecting that kind
> of data.
> On that system we log (to disk) 1:100 and 1:1000 samples of the
> traffic.  Logging all accesses to disk would result in, as I said
> before, about 0.6 TB of log data per day. We'd run out disk rather
> quickly. ;)
> We can send the data (at a configurable sample rate) to other hosts,
> or to programs for analysis.
> We have at least some resources to run some analysis programs but they
> must be very efficient unless they are to be run only on infrequently
> sampled data.
> We just don't have the analysis programs.

That's more or less what I was trying to say. The Foundation's resources 
also limit the number of tehnical staff it can hire, and keeping the 
site going has to be their first priority.

> I checked a simple aggregator for pageview stats into SVN last night.
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/counter/fast_counter.c?view=markup
> I've got a unique viewers by project/country one almost done, I'll
> probably check it in tonight.
>> It's not possible to just "release all log data", because it doesn't exist.
> Thats not correct anymore.
> Complete data is not stored, but it is now collected and can be
> transmitted. ... It's not possible to "release all log data" because
> there are have ethical, legal, and procedural obligations to avoid
> endangering the privacy of readers/editors with sloppy disclosures.

More information about the foundation-l mailing list