Lee Daniel Crocker wrote:
Proposed solution: Let's create a standardized file format (probably something XML-ish) for storing the information contained in a wiki. All the text, revisions, meta-data, and so on would be stored in a well-defined format, so that, for example, upgrading the wiki software (from any version to any other--no need to do one at a time!) could be done by exporting the wiki into this format and then importing it into the new installation. The export format would be publishable
It sounds so easy. But would you accept this procedure if it requires that Wikipedia is unavailable or read-only for one hour? for one day? for one week? The conversion time should be a design requirement.
My experience is that software bandwidth (and sometimes hardware bandwidth) is the limit. The dump will be X bytes big, and the export/import procedures will pump at most Y bytes/second, making the whole procedure take X/Y seconds to complete. If you get acceptable numbers for the Estonian Wikipedia (say, 23 minutes), it will come as a surprise that the English is so many times bigger (say, 3 days). You might also get an error (hardware problem, power outage, whatever) after 75 % of the work is completed, and need to restart it.
XML is very good at making X bigger, and doesn't help increasing Y. Already after a short introduction, everybody calls themselves an expert in designing an XML DTD, but who cares to tweak the performance of the interface procedures? How do you become an expert before having tried this several times?
Quoting myself from [[meta:MediaWiki_architecture]]:
"As of February 2005, the "cur" table of the English Wikipedia holds 3 GB data and 500 MB index (download as 500 MB compressed dump) while the "old" table holds 80 GB data and 3 GB index (download as 29 GB compressed dump)."
Assuming these sizes would be the same for an XML dump (ROTFL) and that export/import could be done at 1 MB/second (optimistic), this is 3500 seconds or about one hour for the "cur" table and 83,000 seconds or close to 24 hours for the "old" table. And this is for the sizes of February 2005, not for May 2005 or July 2008. You do the math.
Not converting the database is the fastest way to cut conversion time. Perhaps you can live with the legacy format? Consider it.