how much redundant text in Wikipedia?

List overview All Threads
Download

newer

older

WikiSym 2009 in Orlando: Call for...

Wikimania 2009 Scholarships

jidanni＠jidanni.org

6 Apr 2009 6 Apr '09

10:57 p.m.

I'm curious what does SELECT COUNT(DISTINCT old_text), COUNT(*) FROM text; show on Wikipedia's database? On mine I get COUNT(DISTINCT old_text): 2913 COUNT(*): 3560 I.e., 1/7 of the rows are redundant.

Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused, https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware savings could even be achieved.

Show replies by date

K. Peachey

7 Apr 7 Apr

12:24 a.m.

On Tue, Apr 7, 2009 at 8:57 AM, jidanni@jidanni.org wrote:

...

I'm curious what does SELECT COUNT(DISTINCT old_text), COUNT(*) FROM text; show on Wikipedia's database? On mine I get COUNT(DISTINCT old_text): 2913 COUNT(*): 3560 I.e., 1/7 of the rows are redundant.

Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused, https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware savings could even be achieved.

From my understanding they have to be kept within the system to keep

us within the GFDL licenseing terms.

Thomas Dalton

12:28 a.m.

2009/4/7 K. Peachey p858snake@yahoo.com.au:

...

...
Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused, https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware savings could even be achieved.

From my understanding they have to be kept within the system to keep us within the GFDL licenseing terms.

The text doesn't need to be stored twice, though, just a note that it is the same text. However, I believe the Wikipedia databases have the text compressed, so that is effectively what happens already.

David Gerard

6:54 p.m.

2009/4/7 K. Peachey p858snake@yahoo.com.au:

...

On Tue, Apr 7, 2009 at 8:57 AM, jidanni@jidanni.org wrote:

...

...
Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused, https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware savings could even be achieved.

...

From my understanding they have to be kept within the system to keep us within the GFDL licenseing terms.

Whuh? That doesn't make sense. If an edit is reverted, it's had no effect!

- d.

Domas Mituzas

5:49 a.m.

Hello Jidanni,

Isn't this a bit redundant with your 'store-by-sha1' topic?

...

Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused,

mediawiki-l

Seriously, if you think that wikimedia operations is bunch of cretins where your simplistic observations are so much needed, you should try other paths of influencing, like stage a rebellion or something :-)

If you actually paid attention to the actual mediawiki capabilities (like, all the code in http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/storage/ ) or any of our site docs, you'd realize that we have the code and methods to facilitate much more efficient compression than one you are aiming at, way higher scalability, etc

...

Maybe some hardware savings could even be achieved.

You know that for a while we had virtually no hardware dedicated/ needed for text storage? We were using free disks that came with application servers. If you ever paid attention to how Wikipedia runs, you'd know that. Now that you don't, you end up being slightly too paternal.

Actually, we're discontinuing the 'hundreds of storage nodes' practice and will consolidate them back together (mostly for easier-to-manage reasons, but that may lead to way higher availability, etc) - but that already takes care of more than just entirely-same text, and still, even 1/7 wouldn't matter much, as hardware for that increments in multiple terabytes (and no, isn't insanely priced).

Cheers,

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Brian

5:57 a.m.

You didn't address his idea one iota. Isn't this the relevant doc? http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema...

Maybe you could explain how the storage class renders his idea irrelevant?

On Mon, Apr 6, 2009 at 11:49 PM, Domas Mituzas midom.lists@gmail.comwrote:

...

Hello Jidanni,

Isn't this a bit redundant with your 'store-by-sha1' topic?

...
Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused,

mediawiki-l

Seriously, if you think that wikimedia operations is bunch of cretins where your simplistic observations are so much needed, you should try other paths of influencing, like stage a rebellion or something :-)

If you actually paid attention to the actual mediawiki capabilities (like, all the code in http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/storage/ ) or any of our site docs, you'd realize that we have the code and methods to facilitate much more efficient compression than one you are aiming at, way higher scalability, etc

...
Maybe some hardware savings could even be achieved.

You know that for a while we had virtually no hardware dedicated/ needed for text storage? We were using free disks that came with application servers. If you ever paid attention to how Wikipedia runs, you'd know that. Now that you don't, you end up being slightly too paternal.

Actually, we're discontinuing the 'hundreds of storage nodes' practice and will consolidate them back together (mostly for easier-to-manage reasons, but that may lead to way higher availability, etc) - but that already takes care of more than just entirely-same text, and still, even 1/7 wouldn't matter much, as hardware for that increments in multiple terabytes (and no, isn't insanely priced).

Cheers,

Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

6:32 a.m.

Hi!

...

You didn't address his idea one iota. Isn't this the relevant doc? http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema...

It is relevant for mediawiki-l@ audience, not relevant for wikimedia- tech@ (if we get to wikimedia technology, it doesn't rely on default settings)

...

Maybe you could explain how the storage class renders his idea irrelevant?

Though probably Tim can be much better at explaining this, but 'text' just provides pointers to a "storage cloud", which can be whatever you want (different ES implementations can do different things).

It can point to sub-entries in bigger blobs, and supports two methods:

a) DiffHistoryBlob - differential storage, that has passed compression, with some adjustments for page blankings, etc b) ConcatenatedGzipHistoryBlob - just plain concatenation of revisions, and compression on top

Both already guard against not only same but also similar text in subsequent revisions.

There're some other optimizations that we could do (optimized packing of pointers/flags in text table), but keep in mind, that every time you edit a page:

~180 bytes are added to revision table (and make additional 200 bytes in indexing) ~300 bytes are added to recentchanges (and make additional 400 bytes in indexing) ~370 bytes are added to cu_changes (300 bytes in indexing, these two tables are round-buffers though) text is 85 bytes with no additional indexing (and even that was skewed by few cases, when we wrote directly to it)

even if it could be possible to reduce amount of pointers in text by reusing them (one can point same text entry to multiple revisions, as it was already noted), it could make maintenance/batch operations much more complicated. Also, as blobs can get migrated, transformed, etc, it is better to do that in separate table, without touching the bigger 'revision' monster in the long run.

Also, if one would want to know 'what revision this text belongs to', another index would be added to revision, which is not that necessary with our one-direction join approaches. There're lots and lots of things you really don't want to do for the 1/7 storage cut. If we were always interested about storage cuts, mediawiki would not be able to do, what it can do now.

I am not against efficiency overall, but there are always tradeoffs.

Anyway, there's a bit more visual expression of our data sizes within core databases: http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]] P.S. I just came back from Berlin, where had all the memories about 2004 presentation done there by Tim and Brion, how we treat text storage - that was pre-ES era ;-) P.P.S. Not only we were sitting in same c-base, where some original parties back then happened (it was first mediawiki dev meetup), but also went to same falafel place at 3AM (though last time I remember going there at 5AM :-)

j2k

10:19 a.m.

Hi!

Domas gave a very interesting link above:

...

Anyway, there's a bit more visual expression of our data sizes within core databases: http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg

I suppose in the s1 worksheet data_length and index_length are in bytes, and calculating the enwiki database size is approx. (data_length + index_length) / (1024 * 1024 *1024) GB, in case of enwiki db is roughly 334 GB.

But how can I interpret the s2/s3 worksheets? What means the data and index columns, and in what unit?

I'm just curious :)

Farewell, Glanthor

Glanthor

10:20 a.m.

Hi!

Domas gave a very interesting link above:

...

Anyway, there's a bit more visual expression of our data sizes within core databases: http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg

But how can I interpret the s2/s3 worksheets? What means the data and index columns, and in what unit?

I'm just curious :)

Farewell, Glanthor

Domas Mituzas

10:24 a.m.

Hi!

...

But how can I interpret the s2/s3 worksheets? What means the data and index columns, and in what unit?

composite data for all tables + composite index for all tables, per database, in megabytes.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Platonides

5:49 p.m.

Domas Mituzas wrote:

...

even if it could be possible to reduce amount of pointers in text by reusing them (one can point same text entry to multiple revisions, as it was already noted), it could make maintenance/batch operations much more complicated.

It's already this way, as rollback reuses (reused?) the pointers, so making other cases also reuse the text entry doesn't add any complexity.

...

Anyway, there's a bit more visual expression of our data sizes within core databases: http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg

How's that commonswiki is both on s2 and s3? (the one on s2 being 10 times larger)

Daniel Kinzler

6:49 p.m.

...

...
Anyway, there's a bit more visual expression of our data sizes within core databases: http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg

How's that commonswiki is both on s2 and s3? (the one on s2 being 10 times larger)

Ignoree the smaller one, it's an old copy from when common was moved to the other cluster.

-- daniel

Daniel Kinzler

8:12 a.m.

jidanni@jidanni.org schrieb:

...

I'm curious what does SELECT COUNT(DISTINCT old_text), COUNT(*) FROM text; show on Wikipedia's database? On mine I get COUNT(DISTINCT old_text): 2913 COUNT(*): 3560 I.e., 1/7 of the rows are redundant.

On Wikimedia wikis, text is stored in compressed blobs in extra database clusters. There is no way to get this information efficiently. If you want it, walk through a full history dump and store hashes for each revision text.

...

Currently undos, so frequent on wikis, just blindly create a duplicate row instead of checking if the old one could be reused, https://bugzilla.wikimedia.org/show_bug.cgi?id=18333 . Maybe some hardware savings could even be achieved.

For *reverts* this is already done, and that is indeed the only situation where it is reliable. "Undo" can also be applied to older changes, and it basically is a reverse patch. That is, the result of an undo may be a text that is different from all previous revisions. However, when undoing the *last* edit, it is indeed equivalent to a revert. Perhaps MediaWiki could make use of that.

Because of the fact that multiple revisions are compressed together into one blob, redundant text is not so bad - but it's only "nice" of both copies end up in the same blob. This is increasingly the case since Tim Starling implemented the revision reordering thingy.

-- daniel

jidanni＠jidanni.org

8 Apr 8 Apr

1:11 a.m.

It turns out it is very easy, http://bug-attachment.wikimedia.org/attachment.cgi?id=5997 , to squeeze current duplication out of the text table, for we the little guy.

I suppose I'll just do that often, as there is little interest in stopping new duplication coming in.

Who cares when you have all those fancy backend intelligent storage systems.

But when all you use is MediaWiki, then you're on your own when you want to reduce the size of your mysqldumps, compressed or not.

...

However, when undoing the *last* edit, it is indeed equivalent to a revert. Perhaps MediaWiki could make use of that.

Just check to see if the sizes are equal, then if the texts are equal, https://bugzilla.wikimedia.org/show_bug.cgi?id=18333

Domas Mituzas

9:53 p.m.

...

But when all you use is MediaWiki, then you're on your own when you want to reduce the size of your mysqldumps, compressed or not.

thats topic for mediawiki-l

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Aryeh Gregor

10:14 p.m.

On Wed, Apr 8, 2009 at 5:53 PM, Domas Mituzas midom.lists@gmail.com wrote:

...

thats topic for mediawiki-l

You know, wikitech-l is for MediaWiki development discussion as well, not just site operations. mediawiki-l is only for MediaWiki support, not for all MediaWiki discussion. At least, that's what I've always been told, and Meta agrees:

http://meta.wikimedia.org/wiki/Mailing_lists/overview

Domas Mituzas

9 Apr 9 Apr

6:37 a.m.

...

You know, wikitech-l is for MediaWiki development discussion as well, not just site operations. mediawiki-l is only for MediaWiki support, not for all MediaWiki discussion.

It is for wikimedia's mediawiki development - all our wikis surely need lots of that. But when it comes to: "when all you use is MediaWiki, then you're on your own " I'm not suggesting that you're on your own, I'm suggesting nice people on mediawiki-l.

Oh, and those problems, they definitely sound like support questions.

If you feel there is no arena to "discuss" general mediawiki rather than "support", we may assist you with communications infrastructure for that.

You know, foundation-l should be also used for mediawiki discussion (oh, sometimes it is), still, mediawiki, wikimedia, what the diff?

...

At least, that's what I've always been told, and Meta agrees:

http://meta.wikimedia.org/wiki/Mailing_lists/overview

It is a wiki! http://meta.wikimedia.org/w/index.php?title=Mailing_lists%2Foverview&dif... ;-)

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Mark Clements (HappyDog)

10:18 a.m.

"Domas Mituzas" midom.lists@gmail.com wrote in message news:C1EFA9F5-0AF0-4203-BDB9-A3055FEC52A6@gmail.com...

...

...
You know, wikitech-l is for MediaWiki development discussion as well, not just site operations. mediawiki-l is only for MediaWiki support, not for all MediaWiki discussion.

It is for wikimedia's mediawiki development - all our wikis surely need lots of that. But when it comes to: "when all you use is MediaWiki, then you're on your own " I'm not suggesting that you're on your own, I'm suggesting nice people on mediawiki-l.

Oh, and those problems, they definitely sound like support questions.

If you feel there is no arena to "discuss" general mediawiki rather than "support", we may assist you with communications infrastructure for that.

You know, foundation-l should be also used for mediawiki discussion (oh, sometimes it is), still, mediawiki, wikimedia, what the diff?

...
At least, that's what I've always been told, and Meta agrees:

http://meta.wikimedia.org/wiki/Mailing_lists/overview

It is a wiki! http://meta.wikimedia.org/w/index.php?title=Mailing_lists%2Foverview&dif... ;-)

That doesn't seem right to me, and it completely contradicts my expectations and the reason I've been using this list for the past however many years:

My understanding: * mediawiki-l - support * wikitech-l - development

That seems like a much clearer and more sensible division than: * mediawiki-l - development * wikitech-l - development but only if it's relevant to Wikimedia

You're right - it is a wiki.... which is why I trust the information that's been there since 2006, rather than the information that was added this morning.

- Mark Clements (HappyDog)

Brian

10:22 a.m.

Good catch

On Thu, Apr 9, 2009 at 4:18 AM, Mark Clements (HappyDog) < gmane@kennel17.co.uk> wrote:

...

"Domas Mituzas" midom.lists@gmail.com wrote in message news:C1EFA9F5-0AF0-4203-BDB9-A3055FEC52A6@gmail.com...

...
...
You know, wikitech-l is for MediaWiki development discussion as well, not just site operations. mediawiki-l is only for MediaWiki support, not for all MediaWiki discussion.

It is for wikimedia's mediawiki development - all our wikis surely need lots of that. But when it comes to: "when all you use is MediaWiki, then you're on your own " I'm not suggesting that you're on your own, I'm suggesting nice people on mediawiki-l.

Oh, and those problems, they definitely sound like support questions.

If you feel there is no arena to "discuss" general mediawiki rather than "support", we may assist you with communications infrastructure for that.

You know, foundation-l should be also used for mediawiki discussion (oh, sometimes it is), still, mediawiki, wikimedia, what the diff?

...
At least, that's what I've always been told, and Meta agrees:

http://meta.wikimedia.org/wiki/Mailing_lists/overview

It is a wiki!

http://meta.wikimedia.org/w/index.php?title=Mailing_lists%2Foverview&dif...

...
;-)

That doesn't seem right to me, and it completely contradicts my expectations and the reason I've been using this list for the past however many years:

My understanding:

mediawiki-l - support

wikitech-l - development

That seems like a much clearer and more sensible division than:

mediawiki-l - development

wikitech-l - development but only if it's relevant to Wikimedia

You're right - it is a wiki.... which is why I trust the information that's been there since 2006, rather than the information that was added this morning.

Mark Clements (HappyDog)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Domas Mituzas

10:38 a.m.

...

mediawiki-l - support

wikitech-l - development

*shrug*, probably we should be solving world hunger in wikitech too then, if we're going to discuss every mediawiki user's needs on this list. ok with me. I just said my opinion - none of this is appropriate to wikimedia technology, nor wikimedia users.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Domas Mituzas

10:42 a.m.

...

*shrug*, probably we should be solving world hunger in wikitech too then, if we're going to discuss every mediawiki user's needs on this list. ok with me. I just said my opinion - none of this is appropriate to wikimedia technology, nor wikimedia users.

small addendum - nor to mediawiki users who run compression scripts in maintenance/storage/

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Mark Clements (HappyDog)

10:46 a.m.

"Domas Mituzas" midom.lists@gmail.com wrote in message news:E231DB4D-6A58-494F-9314-77CCFEBE6E99@gmail.com...

...

...

mediawiki-l - support

wikitech-l - development

*shrug*, probably we should be solving world hunger in wikitech too then, if we're going to discuss every mediawiki user's needs on this list. ok with me. I just said my opinion - none of this is appropriate to wikimedia technology, nor wikimedia users.

I don't see how world hunger is connected to either Wikimedia nor MediaWiki. If it is then by all means discuss it here, otherwise it seems a little off-topic...

Wikimedia users: wikipedia-l, wikisource-l, etc. Wikimedia technology: wikitech-l

MediaWiki users: mediawiki-l MediaWiki technology: wikitech-l

I don't see what the confusion is.... oh, wait, you mean discussing two different but highly interconnected things in one newsgroup is confusing? Oh well, I'm sure you'll come to terms with it eventually.

- Mark Clements (HappyDog)

Domas Mituzas

11:01 a.m.

Hi!

...

Oh well, I'm sure you'll come to terms with it eventually.

yup. sorry. thats all me.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Robert Rohde

7 Apr 7 Apr

8:41 a.m.

On Mon, Apr 6, 2009 at 3:57 PM, jidanni@jidanni.org wrote:

...

I'm curious what does SELECT COUNT(DISTINCT old_text), COUNT(*) FROM text; show on Wikipedia's database? On mine I get COUNT(DISTINCT old_text): 2913 COUNT(*): 3560 I.e., 1/7 of the rows are redundant.

As others have noted, Wikimedia compresses everything and doesn't really store lots of redundant text.

That said, past analysis of edit summaries suggest that about 1 edit in 10 is a revert on enwiki.

-Robert Rohde

5583

Age (days ago)

5586

Last active (days ago)

wikitech-l@lists.wikimedia.org

23 comments

13 participants

tags (0)

participants (13)

Aryeh Gregor
Brian
Daniel Kinzler
David Gerard
Domas Mituzas
Glanthor
j2k
jidanni＠jidanni.org
K. Peachey
Mark Clements (HappyDog)
Platonides
Robert Rohde
Thomas Dalton