Hi,
I was reading about MySQL's Falcon engine (which appears to have reached an alpha branch), and was wondering if anyone had tried it with Mediawiki, and for that matter the whole Wikipedia dataset. Curious to know how it behaves, and how much efficiency the compressed storage engine gets...
Alex
Alex,
I was reading about MySQL's Falcon engine (which appears to have reached an
alpha branch), and was wondering if anyone had tried it with Mediawiki, and for that matter the whole Wikipedia dataset. Curious to know how it behaves, and how much efficiency the compressed storage engine gets...
Falcon isn't suitable for now to run Wikipedia dataset. It doesn't have covering indexes (all reads hit data rows), it doesn't have 'ORDER BY ... LIMIT' optimization, it is hungry for filesorts, etc.
Mediawiki's primary engine is InnoDB, then some sites may attempt to use MyISAM, though that isn't well supported...
For some wikipedias though, BLACKHOLE seems to be the best engine MySQL has produced.
Domas
Alex,
I was reading about MySQL's Falcon engine (which appears to have reached an
alpha branch), and was wondering if anyone had tried it
with Mediawiki,
and for that matter the whole Wikipedia dataset. Curious to know how it behaves, and how much efficiency the compressed storage engine gets...
Falcon isn't suitable for now to run Wikipedia dataset. It doesn't have covering indexes (all reads hit data rows), it doesn't have 'ORDER BY ... LIMIT' optimization, it is hungry for filesorts, etc.
Mediawiki's primary engine is InnoDB, then some sites may attempt to use MyISAM, though that isn't well supported...
For some wikipedias though, BLACKHOLE seems to be the best engine MySQL has produced.
Isn't it lacking replication too, which is rather important.
Jared
Interesting, if what you say is true. I assumed that limits and indexes were done by the primary mysql engine, not at the data container level. I also noted from another email about datadumps that the Wikipedia was moving filesystem based for article storage - makes sense to extend the message-handling-from-file code to cover articles too into the future. This would leave the DB primarily as an index.
Something that occured to me that might be a good idea (TM) is to link up to subversion for the file repositories. This would have big space benefits, as only deltas would be stored, and give a more powerful view to the data. Link this up with file based articles and you have the potential to (relatively) easily produce a standalone wiki engine that could work remotely, much as devs do. I guess the benefits for the wikipedia is slight, so it probably wouldn't happen in MW...
On 1/13/07, Domas Mituzas midom.lists@gmail.com wrote:
Alex,
I was reading about MySQL's Falcon engine (which appears to have reached an
alpha branch), and was wondering if anyone had tried it with Mediawiki, and for that matter the whole Wikipedia dataset. Curious to know how it behaves, and how much efficiency the compressed storage engine gets...
Falcon isn't suitable for now to run Wikipedia dataset. It doesn't have covering indexes (all reads hit data rows), it doesn't have 'ORDER BY ... LIMIT' optimization, it is hungry for filesorts, etc.
Mediawiki's primary engine is InnoDB, then some sites may attempt to use MyISAM, though that isn't well supported...
For some wikipedias though, BLACKHOLE seems to be the best engine MySQL has produced.
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Alex Powell escribió:
Something that occured to me that might be a good idea (TM) is to link up to subversion for the file repositories. This would have big space benefits, as only deltas would be stored, and give a more powerful view to the data.
Storing diffs was talked before. It wasn't an improve, as space would be similar to the compressed text it's used now. And using the deltas to the full article source could be large for some articles.
I didn't mean for text, though that was the logical conclusion. I mean for the image and media repositories. SVN seems to be quite good and managing deltas between binaries these days...
On 1/14/07, Platonides Platonides@gmail.com wrote:
Alex Powell escribió:
Something that occured to me that might be a good idea (TM) is to link
up to
subversion for the file repositories. This would have big space
benefits, as
only deltas would be stored, and give a more powerful view to the data.
Storing diffs was talked before. It wasn't an improve, as space would be similar to the compressed text it's used now. And using the deltas to the full article source could be large for some articles.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 1/15/07, Alex Powell alexp@exscien.com wrote:
I didn't mean for text, though that was the logical conclusion. I mean for the image and media repositories. SVN seems to be quite good and managing deltas between binaries these days...
I suspect that for most MediaWiki applications, binary deltas are not very useful for media content. Revised versions of images tend not to have much binary similarity to prior or subsequent versions; it's more common for a revised version to be subjected to a global change (such as resizing, recropping, or a global color balance) than to a local change. The same is probably true of most audio media.
In my experience doing enterprise disaster recovery and backup, subfile incrementals are usually only useful for uncompressed, unencrypted, segmented files updated in a chunky manner (most database systems), and obviously for logfiles which are always appended to or used in a circular manner. None of these file types is likely to be frequently uploaded to a MediaWiki installation in most applications I can envision. Any file which is completely rewritten every time it is touched (which includes virtually all media formats other than pure, uncompressed raw formats) will not benefit from incremental subfile versioning. Kelly
Hi!
Interesting, if what you say is true. I assumed that limits and indexes were
done by the primary mysql engine, not at the data container level.
Container provides different access methods to the data in it, so you can have different kinds of indexes, access methods, optimizations inside.
Say if container can allow efficient returning of ordered rows based on index (like InnoDB does), MySQL doesn't have to do sorting afterwards. Falcon, on the other hand, in some cases may provide better performance for unordered rows, but the way it reads data does not make covering index reads, or ordered index reads possible.
This
would leave the DB primarily as an index.
This is what we use core DBs now for. Though articles are stored in external instances, it is still MySQL (it is easier to handle replication that way ;-)
Something that occured to me that might be a good idea (TM) is to link up to subversion for the file repositories. This would have big space benefits, as only deltas would be stored, and give a more powerful view to the data.
Delta storage isn't that much different in terms of efficiency than compressing concatenated text. Moreover, the "view to the data" wouldn't be extended that much - we still have own version of wiki diffing, which is somewhat different from binary deltas.
With replication added (it is trivial of course in the kind of operations we're doing now), it may be more difficult to maintain Subversion-based repository (or some other kind of versioning system). Of course, it is possible.
Link
this up with file based articles and you have the potential to (relatively) easily produce a standalone wiki engine that could work remotely, much as devs do. I guess the benefits for the wikipedia is slight, so it probably wouldn't happen in MW...
MediaWiki isn't just storing articles. It is storing lots of information about relations between articles, tracking actions, changes, various metadata, etc.
I guess there're wiki engines with subversion storage, but.. we don't hear too much about them.
-- Domas http://dammit.lt/
wikitech-l@lists.wikimedia.org