I'm checking in some code to deal with compressing data in the old
table. The primary motivation here is to decrease the amount of disk
and cache space necessary for storing page revision data, without the
complications and fragility of differential compression.[1]
Compression is done with gzdeflate() / gzinflate(), which requires zlib
support compiled into PHP. This is the same compression that would be
used in a gzip file, but without the header bytes.
Compressed revisions are marked with old_flags="gzip". The old_flags
column has existed unused for quite some time, so no schema change is
necessary. The compressed data goes back into old_text; I don't think
there is a problem with storing binary data in a TEXT field, as
supposedly TEXT and BLOB differ only in matching and sorting
characteristics.
Article::getRevisionText() accepts a row object (as from wfFetchObject)
containing both old_text and old_flags fields and returns the text,
uncompressed if necessary.
This scheme also works in the archive table, maybe... there are
probably problems with undeletion that need to be checked.
So far there's no on-the-fly compression; a maintenance script
compressOld.php is provided to batch-compress old revisions. It can be
given an arbitrary starting point old_id, and will go until it gets to
the end of the table or you kill it. It should be safe to run in the
background while the wiki is live; it makes single-row UPDATEs keyed by
old_id. On my 2 GHz Athlon XP this goes at about 10,000 rows per minute
otherwise unloaded.
I haven't done any comparative testing of load times, but the effect
should be dwarfed by parse/render times and will only come up on old
and diff views and a few other rare places.
I tested with the New Years' dump of the French Wikipedia (about 200k
rows in old).
Raw dump size:
old_table.sql 1,210,368,249
old_compressed.sql 485,536,046
Space saved: ~60%
If these ratios hold, I estimate the total savings at about 14
gigabytes, bringing our total db usage to something more like 20 GB.
This is a reasonably big improvement for very small changes in code.
(Note that the innodb data storage space never shrinks; to reclaim disk
space for purposes other than storing the next couple million edits
would require dumping everything and reimporting it fresh.)
There are a couple of downsides. The SQL dumps get slightly more
illegible, and old revisions won't be loadable on a MediaWiki installed
with some configurations of PHP (the default configure options don't
include zlib). Also, recompressing the resultant dump doesn't do so
well:
old_table.sql.bz2 199,394,376
old_compressed.sql.bz2 416,208,437
This doubles the size of the raw dumps. Ouch! Well, we should be
looking at a more usable dump format anyway.
-- brion vibber (brion @ pobox.com)
[1] Ultimately we'd probably save a lot of disk space by storing diffs
between revisions, but loading an individual revision then requires
sifting through multiple revisions from the last checkpoint, and
requires extra work to ensure that intermediate revisions are not
corrupted, reordered, removed, etc. By compressing each revision
separately, we still maintain the integrity of the rest of the history
if any one revision is corrupted, if histories are reordered or
recombined, if individual revisions are plucked out or blanked for
legal reasons, etc.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
I would like to invite people to join the Arabic wikipedia, but at the
moment editing is almost impossible there (at least for me). Try for
yourself (for example to create a link)
http://ar.wikipedia.org/wiki/How_to_edit_Arabic_pageshttp://ar.wikipedia.org/wiki/Wikipedia:Sandbox
With the help of Ibn Alnatheer I localized some parts of the GUI, but
after half an hour work there I need an aspirin...
Can something be done to make editing there easier? How does Hebrew
wikipedia handle the RTL stuff?
If there are unsolved problems, maybe the people from arabeyes.org can help.
greetings,
elian
PS:For recentchanges the order (same as Hebrew):
(comment) (talk) user time article hist diff
would look much better than the mess now. Could someone please change this?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (Darwin)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQE/9vPRY9sNrbXL4CERAqKTAKCZHbGO6R7iWgy4YDvNYGYqIbvJBgCfZ4mE
3cgcO+Z1yTxnMctRxOXx/kc=
=iWNQ
-----END PGP SIGNATURE-----
So, this is kind of a sideways suggestion, but... we just moved
Wikitravel to a Web hosting service (xlinternet.com). It's working
pretty much great, right out of the box.
Considering that MediaWiki runs on some pretty standard software (PHP,
mySQL), I wonder if it wouldn't be a good idea to leave most of the
yucky sysadmining problems up to folks who make it their business. I'm
sure that a project as big as Wikimedia could get some special
treatment.
I don't know what kind of bandwidth and storage requirements Wikimedia
has, but I doubt that they'd be insurmountable with any given Web
hosting service.
Just a suggestion to consider.
~ESP
--
Evan Prodromou <evan(a)wikitravel.org>
Wikitravel - http://www.wikitravel.org/
The free, complete, up-to-date and reliable world-wide travel guide
I did some googling tonight and found some more information about ways to
do some magic distributed caching.
An example for a proprietary product doing this is Cisco
DistributedDirector (19000$). Anybody?
Ok, so here is the open source alternative:
* Super Sparrow http://www.supersparrow.org/
Open Source, linux, and tested. It's running for example vergenet, you can
see it in action at http://www.vergenet.net/vergenet/.
In combination with Linux Virtual Server& Heartbeat plus distributed squid
'mirrors' this looks like a nice way for future growth.
IMO this is nothing for the immediate future, but good to keep in mind and
start playing with. Maybe it would also be possible to ask Horms (Simon
Horman, http://www.vergenet.net/~horms/) for advice, he definetely is an
expert in this field.
Have a nice new year!
Gabriel Wicke
I think it was my fault that Ursula went down... The time she went
down seems to correspond roughly with the time I arrived at the colo.
The last time I was in the colo, I hadmy laptop configured to use what
is now Ursula's IP address. So, when I plugged my laptopin, it must
have freaked Ursula out even after I changed the IP address on the
laptop. Sorry for the trouble.
Jason
Erik Moeller wrote:
> Brion-
> > Jason's got Ursula back up, and our new machine is also installed. I'm
> > copying files over so it can take over pliny's web work and let Ursula
> > do just the db.
>
> Out of curiosity, why did Ursula go down? If the cause is unknown, could
> there be an issue with our database that might cause such crashes?
>
> Regards,
>
> Erik
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)Wikipedia.org
> http://mail.wikipedia.org/mailman/listinfo/wikitech-l
--
"Jason C. Richey" <jasonr(a)bomis.com>
Does anyone know offhand how easy/difficult it would be to import stuff
sent through the backup MX into the mailing list archives on the main
server?
-- brion vibber (brion @ pobox.com)
Enjoy! http://download.wikimedia.org/
Since Geoffrin is still out of service, Ursula is serving this up from
.204.
The December update to the Tomeraider archives isn't online yet. I'll
see if I kept a local copy; if not I'll either have to get them from
Erik again or wait until Geoffrin is back up.
Happy new year, eveybody... and let's not forget that Wikipedia turns 3
on January 15! The terrible twos are coming to an end. :)
-- brion vibber (brion @ pobox.com)