Long-term: Wiki import/export format

List overview All Threads
Download

newer

older

RE: [Mediawiki-l] Re: [Wikitech-l]...

Add evaluation param to action=raw...

Lee Daniel Crocker

28 Mar 2005 28 Mar '05

10:53 a.m.

Problems: The frequently-changing database schema in which the wiki information is stored makes it difficult to maintain data across upgrades (requiring conversion scripts), offers no easy backup functionality, makes it difficult to access the data with other tools, and is generally fragile. Proposed solution: Let's create a standardized file format (probably something XML-ish) for storing the information contained in a wiki. All the text, revisions, meta-data, and so on would be stored in a well-defined format, so that, for example, upgrading the wiki software (from any version to any other--no need to do one at a time!) could be done by exporting the wiki into this format and then importing it into the new installation. The export format would be publishable and easier to use for other applications, would be a simple file system for which commonly-available backup tools could be used. A periodic export/import would serve to clean the database of any reference errors and fragmentation. Tools could be created to work with the new format to create subsets, mirrors, and so on. I already have some idea of what is needed, but I solicit input. -- Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/> <http://creativecommons.org/licenses/publicdomain/>

Show replies by date

Brion Vibber

28 Mar 28 Mar

3:29 p.m.

Lee Daniel Crocker wrote:

...

See Special:Export and Special:Import. I'd like to transition our public backups to this format at some point. -- brion vibber (brion @ pobox.com)

Lee Daniel Crocker

3:42 p.m.

...

See Special:Export and Special:Import. I'd like to transition our public backups to this format at some point.

A non-standardized undocumented XML format is not really much better than the database; the purpose of the exercise is standardization, for now and the long term. Also, It needs to deal with such issues as properly recording the context of the data, the user database (with its attendant privacy issues), standardization of the wiki syntax itself, and other issues. -- Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/> <http://creativecommons.org/licenses/publicdomain/>

Brion Vibber

7:27 p.m.

Lee Daniel Crocker wrote:

...

See Special:Export and Special:Import. I'd like to transition our public backups to this format at some point.

A non-standardized undocumented XML format is not really much better than the database; the purpose of the exercise is standardization, for now and the long term.

Well in the meantime, we've got actual work to do, data to move, and a format already in use for it. It's even got an XML schema definition. ;) http://www.mediawiki.org/xml/export-0.1.xsd (Semantic web my ass! XML is meaningless; just a container format. :P)

...

Also, It needs to deal with such issues as properly recording the context of the data, the user database (with its attendant privacy issues),

That might be needed in addition, perhaps, but does not supercede the existing requirements to shuffle page and revision data around.

...

standardization of the wiki syntax itself, and other issues.

Now, that's a whole nother issue... -- brion vibber (brion @ pobox.com)

Karl Eichwalder

8:26 p.m.

Brion Vibber <brion(a)pobox.com> writes:

...

(Semantic web my ass! XML is meaningless; just a container format. :P)

As long as you try to derive the XML from the current wiki markup, it is surely meaningless because it is layout driven. Let us use a well defined schema and you will get the semantic web for free. As people add "categories" we could also add semantic markup step by step. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Yaroslav Fedevych

8:41 p.m.

On Mon, Mar 28, 2005 at 02:26:01PM +0200, Karl Eichwalder wrote:

...

Brion Vibber <brion(a)pobox.com> writes:

(Semantic web my ass! XML is meaningless; just a container format. :P)

As long as you try to derive the XML from the current wiki markup, it is surely meaningless because it is layout driven. Let us use a well defined schema and you will get the semantic web for free.

Meaningless? What about AsciiDoc? -- X windows: You haven't died 'til you've used it.

Karl Eichwalder

10:21 p.m.

Yaroslav Fedevych <jaroslaw(a)linux.org.ua> writes:

...

Meaningless?

At least, it is ambiguous. Surrounding a phrase with ''apostrophs'' means, you would like to see it in italics - the phrase might be a quotations, a "see also", a catch word, or something else. Worse, this kind of markup is very fragile (no line-breaks are allowed) and if if you make an error be accident, the wike will ''happily' accept it. The good thing is, the always morphing wiki syntax keeps us busy, especially those who like to write parsers ;-) -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Yaroslav Fedevych

11:19 p.m.

On Mon, Mar 28, 2005 at 04:21:33PM +0200, Karl Eichwalder wrote:

...

The good thing is, the always morphing wiki syntax keeps us busy, especially those who like to write parsers ;-)

Oh... Could those *incredibly* good and generous people write an AsciiDoc parser or at least make a quick point where and what to redefine to do it by myself? (Sources are not so self-contained as they seem to be) :) Many thanks! -- X Windows: Accept any substitute.

Lee Daniel Crocker

29 Mar 29 Mar

1:13 a.m.

...

(Semantic web my ass! XML is meaningless; just a container format. :P)

I don't think we should muck with the wikitext itself; that should simply be a piece of string data, perhaps in a CDATA section, and the semantics inside that is a matter for another standard entirely. I do, however, want to make sure that the format is extensible in being able to add new kinds of metadata without breaking anything. -- Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/> <http://creativecommons.org/licenses/publicdomain/>

GerardM

28 Mar 28 Mar

7:09 p.m.

Hoi, As part of the Wikidata implementation we will have a shot at importing and exporting data using formats like XML. There are already some people interesting in helping out with this. Exporting the current wiki data in XML could be part of that effort, It would be a good thing to combine things as we propably do not want to publish XML in too many ways, Thanks, GerardM On Sun, 27 Mar 2005 18:53:33 -0800, Lee Daniel Crocker <lee(a)piclab.com> wrote:

...

Brion Vibber

7:41 p.m.

GerardM wrote:

...

As part of the Wikidata implementation we will have a shot at importing and exporting data using formats like XML. There are already some people interesting in helping out with this. Exporting the current wiki data in XML could be part of that effort, It would be a good thing to combine things as we propably do not want to publish XML in too many ways,

XML by itself doesn't provide any structure. What sort of XML schema is used on WikiData's import/export or which method you use to pull the old freeform text entries in the first place are relatively minor details; the hard part is making structure out of the text, and there's not really a good generation solution. That's going to be very specific to the task (converting Wiktionary to a structured database). -- brion vibber (brion @ pobox.com)

Karl Eichwalder

8:11 p.m.

Brion Vibber <brion(a)pobox.com> writes:

...

really a good generation solution. That's going to be very specific to the task (converting Wiktionary to a structured database).

The obvious choice for wiktionary would be trimmed version of the TEI dictionary module. The resulting XML you could store in a database system like idzebra (indexdata.dk). -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Lars Aronsson

11:51 p.m.

Lee Daniel Crocker wrote:

...

Proposed solution: Let's create a standardized file format (probably something XML-ish) for storing the information contained in a wiki. All the text, revisions, meta-data, and so on would be stored in a well-defined format, so that, for example, upgrading the wiki software (from any version to any other--no need to do one at a time!) could be done by exporting the wiki into this format and then importing it into the new installation. The export format would be publishable

It sounds so easy. But would you accept this procedure if it requires that Wikipedia is unavailable or read-only for one hour? for one day? for one week? The conversion time should be a design requirement. My experience is that software bandwidth (and sometimes hardware bandwidth) is the limit. The dump will be X bytes big, and the export/import procedures will pump at most Y bytes/second, making the whole procedure take X/Y seconds to complete. If you get acceptable numbers for the Estonian Wikipedia (say, 23 minutes), it will come as a surprise that the English is so many times bigger (say, 3 days). You might also get an error (hardware problem, power outage, whatever) after 75 % of the work is completed, and need to restart it. XML is very good at making X bigger, and doesn't help increasing Y. Already after a short introduction, everybody calls themselves an expert in designing an XML DTD, but who cares to tweak the performance of the interface procedures? How do you become an expert before having tried this several times? Quoting myself from [[meta:MediaWiki_architecture]]: "As of February 2005, the "cur" table of the English Wikipedia holds 3 GB data and 500 MB index (download as 500 MB compressed dump) while the "old" table holds 80 GB data and 3 GB index (download as 29 GB compressed dump)." Assuming these sizes would be the same for an XML dump (ROTFL) and that export/import could be done at 1 MB/second (optimistic), this is 3500 seconds or about one hour for the "cur" table and 83,000 seconds or close to 24 hours for the "old" table. And this is for the sizes of February 2005, not for May 2005 or July 2008. You do the math. Not converting the database is the fastest way to cut conversion time. Perhaps you can live with the legacy format? Consider it. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

Yaroslav Fedevych

29 Mar 29 Mar

12:01 a.m.

On Mon, Mar 28, 2005 at 05:51:20PM +0200, Lars Aronsson wrote:

...

Not converting the database is the fastest way to cut conversion time. Perhaps you can live with the legacy format? Consider it.

What about having it (*not* necessarily XML) as an option and being customizable? Not every Mediawiki-powered site is that large and some of them in fact may require something like that. -- The environment of today... tomorrow! X windows.

Edward Z. Yang

12:05 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Yaroslav Fedevych wrote:

...

On Mon, Mar 28, 2005 at 05:51:20PM +0200, Lars Aronsson wrote: What about having it (*not* necessarily XML) as an option and being customizable? Not every Mediawiki-powered site is that large and some of them in fact may require something like that.

Surely there are other features one could be working on. Although I'm not saying that any feature that won't be used in the Wikipedia project is useless, I think that we should definitely prioritize. My two cents. - -- Edward Z. Yang Personal: edwardzyang(a)thewritingpot.com SN:Ambush Commander Website: http://www.thewritingpot.com/ GPGKey:0x869C48DA http://www.thewritingpot.com/gpgpubkey.asc 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) iD8DBQFCSCtVqTO+fYacSNoRAtVCAJ9AlZordT/M1/9QSB8Sj6Y70VnaaQCeLNXo CjBXP4zZOFF9iavekqpZ8Ak= =yKwB -----END PGP SIGNATURE-----

Lee Daniel Crocker

1:33 a.m.

On Mon, 2005-03-28 at 11:05 -0500, Edward Z. Yang wrote:

...

Surely there are other features one could be working on. Although I'm not saying that any feature that won't be used in the Wikipedia project is useless, I think that we should definitely prioritize.

Fortunately we have Brion for that :-) Seriously, though, different people have different talents and concerns, and not everyone is best used in the same way. There are lots of programmers and sysadmins better better than me who can keep things running, but I've always had my eye on the long term. A project like ours needs vision as much as it needs code. -- Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/> <http://creativecommons.org/licenses/publicdomain/>

Ray Saintonge

3:22 a.m.

Lee Daniel Crocker wrote:

...

On Mon, 2005-03-28 at 11:05 -0500, Edward Z. Yang wrote:

Surely there are other features one could be working on. Although I'm not saying that any feature that won't be used in the Wikipedia project is useless, I think that we should definitely prioritize.

For the luddites among us the ways of the developers are mysterious. Brion has always been quick to answer technical concerns, even if it's just to say, "Your idea is impossible." A luddite needs to accept such an answer when faced with the even less desirable alternative of learning something. Brion communicates well, but when he is the only one doing it one quickly gets the misimpression that we have a one-man development staff. Communication with the masses is important, and if there were more doing it perhaps the suggestion of a fork that has come from Wikinews would not be happening. There has been much talk about XML, but it would be sheer pretence to suggest that I understand.any of it. Nevertheless I will presume to speak as democratically unelected representative of the luddites. Many of us are happy to carry on with our familiar wiki markup and feel a great sense of unease with the kind of rampant innovationism that has recently been characterized by the use of templates. The question that this boils down to is, "How will adopting XML affect the average user?" Ec

John Fader

4:14 a.m.

On Mon, 28 Mar 2005 11:22:20 -0800, Ray Saintonge <saintonge(a)telus.net> wrote:

...

The question that this boils down to is, "How will adopting XML affect the average user?"

None. This purely affects what a database dump file looks like (giant pile of unreadable XML crap vs giant pile of unreadable mysql crap). Mirrors and those mining mediawiki datadumps care, ordinary users won't see any difference. -- John Fader

Edward Z. Yang

4:17 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 John Fader wrote:

...

On Mon, 28 Mar 2005 11:22:20 -0800, Ray Saintonge <saintonge(a)telus.net> wrote: None. This purely affects what a database dump file looks like (giant pile of unreadable XML crap vs giant pile of unreadable mysql crap). Mirrors and those mining mediawiki datadumps care, ordinary users won't see any difference.

The idea (in my understanding) is to standardize the dump format to increase forward/backward compatibility. I don't have any problems with standardization, but of course, as John puts it, we're still going to have a lot of crap when it comes to database dumps, perhaps even more so with XML. - -- Edward Z. Yang Personal: edwardzyang(a)thewritingpot.com SN:Ambush Commander Website: http://www.thewritingpot.com/ GPGKey:0x869C48DA http://www.thewritingpot.com/gpgpubkey.asc 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (MingW32) iD8DBQFCSGZ0qTO+fYacSNoRAvkvAJ0YrC6A5vjUHoMu8WuMR/MMBEPVjACdH5iQ FwvyRLqcJsrtQGzMLTW50Dg= =CmgB -----END PGP SIGNATURE-----

Lars Aronsson

5:19 a.m.

Ray Saintonge wrote:

...

There has been much talk about XML, but it would be sheer pretence to suggest that I understand.any of it. Nevertheless I will presume to speak as democratically unelected representative of the luddites. Many of us are happy to carry on with our familiar wiki markup and

In *this* thread (see Subject: Wiki import/export format), the mentioning of XML was *not* related to the wiki markup. Some people apparently believed it was, but that was a mistake. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

Jakob Voss

6:43 a.m.

Lars Aronsson wrote:

...

Ray Saintonge wrote:

In *this* thread (see Subject: Wiki import/export format), the mentioning of XML was *not* related to the wiki markup. Some people apparently believed it was, but that was a mistake.

But XML representation of wiki markup is what is needed while the database dump is much more comfortable to handle as raw SQL as you showed above. Jakob

Karl Eichwalder

12:12 a.m.

Lars Aronsson <lars(a)aronsson.se> writes:

...

It sounds so easy. But would you accept this procedure if it requires that Wikipedia is unavailable or read-only for one hour? for one day? for one week?

It could be done on-the-fly, even if it takes some weeks. "Simply" also start storing the converted articles in a second table (or database system)...

...

Assuming these sizes would be the same for an XML dump (ROTFL) and that export/import could be done at 1 MB/second (optimistic), this is 3500 seconds or about one hour for the "cur" table and 83,000 seconds or close to 24 hours for the "old" table. And this is for the sizes of February 2005, not for May 2005 or July 2008. You do the math.

Of course, you must use a database system designed for holding XML data ;) If we start using XML properly we can give up on many a lot hacks. We will also save resources because the set of allowed tags is limited and their usage is well-defined :) -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Yaroslav Fedevych

12:51 a.m.

On Mon, Mar 28, 2005 at 06:12:10PM +0200, Karl Eichwalder wrote:

...

BTW, I hate to type articles in XML, it sucks for that purpose badly. Structured text is the thing to go for, but XML... Arrrrgggghhh!!!! -- Передчасна оптимізація -- корінь усього зла. Д. Е. Кнут

Lee Daniel Crocker

1:35 a.m.

On Mon, 2005-03-28 at 19:51 +0300, Yaroslav Fedevych wrote:

...

On Mon, Mar 28, 2005 at 06:12:10PM +0200, Karl Eichwalder wrote:

BTW, I hate to type articles in XML, it sucks for that purpose badly. Structured text is the thing to go for, but XML... Arrrrgggghhh!!!!

Believe me, no one is suggesting such an evil thing. Wikitext syntax is too complex already--it needs to be simplified, not made worse. We're talking about under-the-hood functionality here, not user interface. -- Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/> <http://creativecommons.org/licenses/publicdomain/>

Karl Eichwalder

1:53 a.m.

Yaroslav Fedevych <jaroslaw(a)linux.org.ua> writes:

...

BTW, I hate to type articles in XML, it sucks for that purpose badly.

You can use an XML editor or plug-in.

...

Structured text is the thing to go for, but XML... Arrrrgggghhh!!!!

I would not mind switching to SGML with some minimization feature switched on; it is possible to write SGML with minimal markup. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Yaroslav Fedevych

2:56 a.m.

On Mon, Mar 28, 2005 at 07:53:32PM +0200, Karl Eichwalder wrote:

...

Yaroslav Fedevych <jaroslaw(a)linux.org.ua> writes:

BTW, I hate to type articles in XML, it sucks for that purpose badly.

You can use an XML editor or plug-in.

Structured text is the thing to go for, but XML... Arrrrgggghhh!!!!

I would not mind switching to SGML with some minimization feature switched on; it is possible to write SGML with minimal markup.

As I have mentioned -- if markup is minimal, there are worthier alternatives which permit 1:1 conversion... If {SG,X}ML is really needed. -- Перша річ, котрої навчається початкуючий користувач Windows -- встановлення системи. Потім він рідко займається чимось іншим.

Karl Eichwalder

4:36 a.m.

Yaroslav Fedevych <jaroslaw(a)linux.org.ua> writes:

...

As I have mentioned -- if markup is minimal, there are worthier alternatives which permit 1:1 conversion... If {SG,X}ML is really needed.

I simply believe in standards - others do not. And the rest refuses to accept something not invented by themselves. -- http://www.gnu.franken.de/ke/ | ,__o | _-\_<, | (*)/'(*) Key fingerprint = F138 B28F B7ED E0AC 1AB4 AA7F C90A 35C3 E9D0 5D1C

Lee Daniel Crocker

1:26 a.m.

On Mon, 2005-03-28 at 17:51 +0200, Lars Aronsson wrote:

...

It sounds so easy. But would you accept this procedure if it requires that Wikipedia is unavailable or read-only for one hour? for one day? for one week? The conversion time should be a design requirement. ... Not converting the database is the fastest way to cut conversion time. Perhaps you can live with the legacy format? Consider it.

A properly written export shouldn't need to have exclusive access to the database at all. The only thing that would need that is a complete reinstall and import, which is only one application of the format and should be needed very rarely (switching to a wholly new hardware or software base, for example). In those few cases (maybe once every few years or so), Wikipedia being uneditable for a few days would not be such a terrible thing--better than it being down completely because the servers are overwhelmed. -- Lee Daniel Crocker <lee(a)piclab.com> <http://www.piclab.com/lee/> <http://creativecommons.org/licenses/publicdomain/>

Ray Saintonge

2:49 a.m.

Lars Aronsson wrote:

...

Lee Daniel Crocker wrote:

Of course we've had unplanned downtime that has lasted for more than an hour. Then too en:wikipedia has so far resisted converting to Unicode. :-) Ec

Lars Aronsson

5:01 a.m.

Ray Saintonge wrote:

...

Of course we've had unplanned downtime that has lasted for more than an hour. Then too en:wikipedia has so far resisted converting to Unicode. :-)

All I'm saying is that time is an important parameter, and that you can calculate how much time will be needed, and set a limit to what is acceptable (if it means downtime). If you have lots of data (e.g. the English "old" table), it is often an advantage if you can convert one piece at a time. I guess this could be done for converting the old table to Unicode. You could add column (or a separate table) that indicates if entry X of the old table is in Unicode, alter the software to look in this column, then start the conversion one article at a time as a low priority background process. After the conversion is completed, the software can be altered back again. That strategy would be the total opposite of defining a unified export dump format and stop everything for a week to do a dump, convert, and import. Just like a pipeline and a tanker ship are two different ways to transport oil. -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se

6963

days inactive

6963

days old

wikitech-l@lists.wikimedia.org

Manage subscription

29 comments

10 participants

tags (0)

participants (10)

Brion Vibber
Edward Z. Yang
GerardM
Jakob Voss
John Fader
Karl Eichwalder
Lars Aronsson
Lee Daniel Crocker
Ray Saintonge
Yaroslav Fedevych