Hello,
I am currently developing a test suite for the XML dumps, and I am curious about the specification of text.old_flags in MediaWiki's maintainance/tables.sql. The file describes the 'object' flag as
text field contained a serialized PHP object. object either contains multiple versions compressed to achieve a better compression ratio, or it refers to another row where the text can be found.
Is the „multiple versions” part still used in some project? If so, how should this be set up [1]?
Kind regards, Christian
P.S.: In #wikimedia-dev I was told, to bring up the question on this list. If there are further lists, where I should ask, please let me know.
[1] Before r6138 (back then still in Article.php not Revision.php), it seems the text was obtained by $object = unserialize( $text ); $text = $object->getItem( $hash ); . There it is somewhat obvious how a single object may return different texts. However, beginning with p6138 it seems the text is simply fetched by $obj = unserialize( $text ); [...] $text = $obj->getText(); . If a single object should return different texts, how does it determine, which text to return?
I don't know if there are any texts stored in the text table directly that contain multiple compressed revisions, on the production cluster. There are certainly some revision texts not stored in external store, which consist of serialized objcts (perhaps broken) or gzipped data. I just verified this by looking at some older entries in the text table for eo.wikipedia.
As I look at this: http://wikitech.wikimedia.org/view/Text_storage_data it appears that we may indeed have a few problematic entries lying around: 216694 219570 2876 object/concatenatedgziphistoryblob
If it will help you for testing, I'll try to track down a few such revisions. What the test suite should do right now is whatever the current code does when asked to fetch the revision. At some point we need to go through all the old revision texts and patch up anything broken to the extent possible. It will be a lot of work.
Ariel
Στις 08-02-2012, ημέρα Τετ, και ώρα 09:43 +0100, ο/η Christian Aistleitner έγραψε:
Hello,
I am currently developing a test suite for the XML dumps, and I am curious about the specification of text.old_flags in MediaWiki's maintainance/tables.sql. The file describes the 'object' flag as
text field contained a serialized PHP object. object either contains multiple versions compressed to achieve a better compression ratio, or it refers to another row where the text can be found.
Is the „multiple versions” part still used in some project? If so, how should this be set up [1]?
Kind regards, Christian
P.S.: In #wikimedia-dev I was told, to bring up the question on this list. If there are further lists, where I should ask, please let me know.
[1] Before r6138 (back then still in Article.php not Revision.php), it seems the text was obtained by $object = unserialize( $text ); $text = $object->getItem( $hash ); . There it is somewhat obvious how a single object may return different texts. However, beginning with p6138 it seems the text is simply fetched by $obj = unserialize( $text ); [...] $text = $obj->getText(); . If a single object should return different texts, how does it determine, which text to return?
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi Ariel,
On Wed, Feb 08, 2012 at 12:50:25PM +0200, Ariel T. Glenn wrote:
I don't know if there are any texts stored in the text table directly that contain multiple compressed revisions, on the production cluster.
Ok. Thanks.
There are certainly some revision texts not stored in external store, which consist of serialized objcts (perhaps broken) or gzipped data.
Local objects (with a single text), local gzip, and combinations thereof are fine. I had already tested them.
By reading more code I now see that I misunderstood the 'object' description of maintainance/tables.sql.
Just for the archives: The description
text field contained a serialized PHP object. object either contains multiple versions compressed to achieve a better compression ratio, or it refers to another row where the text can be found.
does not mean that the 'object' flag (for locally stored rows) allows to refer to a different row.
The situation is rather the following
+---------+----------+--------------+ | old_id | old_text | old_flags | +---------+----------+--------------+ | 1 | SER_OBJA | object,utf-8 | | 2 | SER_OBJB | object,utf-8 | | 3 | SER_OBJC | object,utf-8 | +---------+----------+--------------+
SER_OBJA is a serialized object, whose unserialized representation's method getText() yields the old_text for old_id 1.
SER_OBJB is a serialized object, whose unserialized representation's method getText() yields the old_text for old_id 2.
SER_OBJC is a serialized object, whose unserialized representation's method getText() yields the old_text for old_id 3.
Nothing fancy. End of story.
However, it's up to the objects, how to implement getText().
So for example, SER_OBJA may have further methods (e.g.: getItem( int hash ) ), and can make further text available through this additional method.
Then, SER_OBJB can act as proxy, proxying a call to (unserialize(SER_OBJA))->getItem( SOME_CONSTANT_HASH ) by it's (i.e.: SER_OBJB) own parameterless getText() method. The magic of fetching SER_OBJA, unserializing, ...) is hidden within SER_OBJB.
Furthermore, SER_OBJC may again act as proxy, just as SER_OBJB did, but SER_OBJC will likely use different SOME_CONSTANT_HASH.
Typically SER_OBJA would be a ConcatenatedGzipHistoryBlob, and SER_OBJB and SER_OBJC would be HistoryBlobStub s.
Hence, the description in tables.sql is somewhat accurate, but overly detailed and thereby hinting in a different direction (for local objects). It seems that all the flag 'object' denotes is: 1. unserialize. 2. call getText().
That's a terrific page. Too bad I did not find it myself :(
Thanks a lot for your help.
Kind regards, Christian
xmldatadumps-l@lists.wikimedia.org