[Mediawiki-l] Re: [Wikitech-l] Long-term: Wiki import/export format

Lars Aronsson lars at aronsson.se
Mon Mar 28 15:51:20 UTC 2005


Lee Daniel Crocker wrote:
> Proposed solution: Let's create a standardized file format (probably
> something XML-ish) for storing the information contained in a wiki.
> All the text, revisions, meta-data, and so on would be stored in a
> well-defined format, so that, for example, upgrading the wiki software
> (from any version to any other--no need to do one at a time!) could
> be done by exporting the wiki into this format and then importing it
> into the new installation.  The export format would be publishable

It sounds so easy.  But would you accept this procedure if it requires
that Wikipedia is unavailable or read-only for one hour? for one day?
for one week?  The conversion time should be a design requirement.

My experience is that software bandwidth (and sometimes hardware
bandwidth) is the limit.  The dump will be X bytes big, and the
export/import procedures will pump at most Y bytes/second, making the
whole procedure take X/Y seconds to complete.  If you get acceptable
numbers for the Estonian Wikipedia (say, 23 minutes), it will come as
a surprise that the English is so many times bigger (say, 3 days). You
might also get an error (hardware problem, power outage, whatever)
after 75 % of the work is completed, and need to restart it.

XML is very good at making X bigger, and doesn't help increasing Y.  
Already after a short introduction, everybody calls themselves an
expert in designing an XML DTD, but who cares to tweak the performance
of the interface procedures?  How do you become an expert before
having tried this several times?

Quoting myself from [[meta:MediaWiki_architecture]]:

 "As of February 2005, the "cur" table of the English Wikipedia holds 
  3 GB data and 500 MB index (download as 500 MB compressed dump)
  while the "old" table holds 80 GB data and 3 GB index (download as
  29 GB compressed dump)."

Assuming these sizes would be the same for an XML dump (ROTFL) and
that export/import could be done at 1 MB/second (optimistic), this is
3500 seconds or about one hour for the "cur" table and 83,000 seconds
or close to 24 hours for the "old" table.  And this is for the sizes
of February 2005, not for May 2005 or July 2008.  You do the math.

Not converting the database is the fastest way to cut conversion time.
Perhaps you can live with the legacy format?  Consider it.


-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se



More information about the MediaWiki-l mailing list