Problems: The frequently-changing database schema in which the wiki information is stored makes it difficult to maintain data across upgrades (requiring conversion scripts), offers no easy backup functionality, makes it difficult to access the data with other tools, and is generally fragile.
Proposed solution: Let's create a standardized file format (probably something XML-ish) for storing the information contained in a wiki. All the text, revisions, meta-data, and so on would be stored in a well-defined format, so that, for example, upgrading the wiki software (from any version to any other--no need to do one at a time!) could be done by exporting the wiki into this format and then importing it into the new installation. The export format would be publishable and easier to use for other applications, would be a simple file system for which commonly-available backup tools could be used. A periodic export/import would serve to clean the database of any reference errors and fragmentation. Tools could be created to work with the new format to create subsets, mirrors, and so on.
I already have some idea of what is needed, but I solicit input.
Hoi, As part of the Wikidata implementation we will have a shot at importing and exporting data using formats like XML. There are already some people interesting in helping out with this. Exporting the current wiki data in XML could be part of that effort, It would be a good thing to combine things as we propably do not want to publish XML in too many ways,
Thanks, GerardM
On Sun, 27 Mar 2005 18:53:33 -0800, Lee Daniel Crocker lee@piclab.com wrote:
Problems: The frequently-changing database schema in which the wiki information is stored makes it difficult to maintain data across upgrades (requiring conversion scripts), offers no easy backup functionality, makes it difficult to access the data with other tools, and is generally fragile.
Proposed solution: Let's create a standardized file format (probably something XML-ish) for storing the information contained in a wiki. All the text, revisions, meta-data, and so on would be stored in a well-defined format, so that, for example, upgrading the wiki software (from any version to any other--no need to do one at a time!) could be done by exporting the wiki into this format and then importing it into the new installation. The export format would be publishable and easier to use for other applications, would be a simple file system for which commonly-available backup tools could be used. A periodic export/import would serve to clean the database of any reference errors and fragmentation. Tools could be created to work with the new format to create subsets, mirrors, and so on.
I already have some idea of what is needed, but I solicit input.
-- Lee Daniel Crocker lee@piclab.com http://www.piclab.com/lee/ http://creativecommons.org/licenses/publicdomain/
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Lee Daniel Crocker wrote:
Proposed solution: Let's create a standardized file format (probably something XML-ish) for storing the information contained in a wiki. All the text, revisions, meta-data, and so on would be stored in a well-defined format, so that, for example, upgrading the wiki software (from any version to any other--no need to do one at a time!) could be done by exporting the wiki into this format and then importing it into the new installation. The export format would be publishable
It sounds so easy. But would you accept this procedure if it requires that Wikipedia is unavailable or read-only for one hour? for one day? for one week? The conversion time should be a design requirement.
My experience is that software bandwidth (and sometimes hardware bandwidth) is the limit. The dump will be X bytes big, and the export/import procedures will pump at most Y bytes/second, making the whole procedure take X/Y seconds to complete. If you get acceptable numbers for the Estonian Wikipedia (say, 23 minutes), it will come as a surprise that the English is so many times bigger (say, 3 days). You might also get an error (hardware problem, power outage, whatever) after 75 % of the work is completed, and need to restart it.
XML is very good at making X bigger, and doesn't help increasing Y. Already after a short introduction, everybody calls themselves an expert in designing an XML DTD, but who cares to tweak the performance of the interface procedures? How do you become an expert before having tried this several times?
Quoting myself from [[meta:MediaWiki_architecture]]:
"As of February 2005, the "cur" table of the English Wikipedia holds 3 GB data and 500 MB index (download as 500 MB compressed dump) while the "old" table holds 80 GB data and 3 GB index (download as 29 GB compressed dump)."
Assuming these sizes would be the same for an XML dump (ROTFL) and that export/import could be done at 1 MB/second (optimistic), this is 3500 seconds or about one hour for the "cur" table and 83,000 seconds or close to 24 hours for the "old" table. And this is for the sizes of February 2005, not for May 2005 or July 2008. You do the math.
Not converting the database is the fastest way to cut conversion time. Perhaps you can live with the legacy format? Consider it.
On Mon, 2005-03-28 at 17:51 +0200, Lars Aronsson wrote:
It sounds so easy. But would you accept this procedure if it requires that Wikipedia is unavailable or read-only for one hour? for one day? for one week? The conversion time should be a design requirement. ... Not converting the database is the fastest way to cut conversion time. Perhaps you can live with the legacy format? Consider it.
A properly written export shouldn't need to have exclusive access to the database at all. The only thing that would need that is a complete reinstall and import, which is only one application of the format and should be needed very rarely (switching to a wholly new hardware or software base, for example). In those few cases (maybe once every few years or so), Wikipedia being uneditable for a few days would not be such a terrible thing--better than it being down completely because the servers are overwhelmed.
mediawiki-l@lists.wikimedia.org