Re: [Wikitech-l] how much redundant text in Wikipedia?

7 Apr 2009

      Hi!
...
You didn't address his idea one iota. Isn't this the relevant doc?
http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema...
It is relevant for mediawiki-l@ audience, not relevant for wikimedia- 
tech@ (if we get to wikimedia technology, it doesn't rely on default  
settings)
...
Maybe you could explain how the storage class renders his idea  
irrelevant?
Though probably Tim can be much better at explaining this, but 'text'  
just provides pointers to a "storage cloud", which can be whatever you  
want (different ES implementations can do different things).
It can point to sub-entries in bigger blobs, and supports two methods:
a) DiffHistoryBlob - differential storage, that has passed  
compression, with some adjustments for page blankings, etc
b) ConcatenatedGzipHistoryBlob - just plain concatenation of  
revisions, and compression on top
Both already guard against not only same but also similar text in  
subsequent revisions.
There're some other optimizations that we could do (optimized packing  
of pointers/flags in text table), but keep in mind, that every time  
you edit a page:
~180 bytes are added to revision table (and make additional 200 bytes  
in indexing)
~300 bytes are added to recentchanges (and make additional 400 bytes  
in indexing)
~370 bytes are added to cu_changes (300 bytes in indexing, these two  
tables are round-buffers though)
text is 85 bytes with no additional indexing (and even that was skewed  
by few cases, when we wrote directly to it)
even if it could be possible to reduce amount of pointers in text by  
reusing them (one can point same text entry to multiple revisions, as  
it was already noted), it could make maintenance/batch operations much  
more complicated.
Also, as blobs can get migrated, transformed, etc, it is better to do  
that in separate table, without touching the bigger 'revision' monster  
in the long run.
Also, if one would want to know 'what revision this text belongs to',  
another index would be added to revision, which is not that necessary  
with our one-direction join approaches. There're lots and lots of  
things you really don't want to do for the 1/7 storage cut. If we were  
always interested about storage cuts, mediawiki would not be able to  
do, what it can do now.
I am not against efficiency overall, but there are always tradeoffs.
Anyway, there's a bit more visual expression of our data sizes within  
core databases:
http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg
-- 
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

P.S. I just came back from Berlin, where had all the memories about  
2004 presentation done there by Tim and Brion, how we treat text  
storage - that was pre-ES era ;-)
P.P.S. Not only we were sitting in same c-base, where some original  
parties back then happened (it was first mediawiki dev meetup), but  
also went to same falafel place at 3AM (though last time I remember  
going there at 5AM :-)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] how much redundant text in Wikipedia?