As we were (OK: I am;-) running into trouble integrating HTML-to-XML parsing into the Bison-based parser, I have written a specialized C++ class that can do this prior to the actual parsing. It will output only correct XML *structure*, and (as far as I can tell) correct XHTML rules (<tr> in <table> etc.) as well.
"Broken" HTML will be changed into < / > entities, so only valid XML will reach the output. However, I took some care to automagically fix the "usual suspects" (obligatory 21C3 reference) of HTML ugliness, like not-closed <li> and various table chaos. Even a lonely <caption> (not closed) somewhere in the text will generate a full table. It might not be pretty, but it will be vaild XML.
While this is primarily intended for the wiki-to-XML parser, it might work for enforcing XML output for the current parser as well. We'd only have to run the wiki source through it before actually parsing.
Source: CVS HEAD, Module "flexbisonparse", file "html2xml.cpp". (GPL, of course)
Magnus
On Monday 10 January 2005 17:23, Magnus Manske wrote:
While this is primarily intended for the wiki-to-XML parser, it might work for enforcing XML output for the current parser as well. We'd only have to run the wiki source through it before actually parsing.
just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?
daniel
On Tuesday 11 January 2005 22:27, Daniel Wunsch wrote:
just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?
My software (will) work like this.
NSK schrieb:
On Tuesday 11 January 2005 22:27, Daniel Wunsch wrote:
just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?
My software (will) work like this.
Will you use this (wikipedia's) to-be XML markup, or your own brand?
Might just as well create a standard here ;-)
Magnus
On Tuesday 11 January 2005 23:29, Magnus Manske wrote:
Will you use this (wikipedia's) to-be XML markup, or your own brand?
It will use a standard wiki markup which I will create.
It is my intention that my wiki software should provide lots of parsers/converters for other markups, including HTML, XHTML, OpenOffice, MediaWiki, TikiWiki, WikkaWiki et cetera, probably after the 1.0 version.
Find out more at http://maatworks.wikinerds.org/index.php/NGWP
On Wed, Jan 12, 2005 at 01:13:13AM +0200, NSK wrote:
Find out more at http://maatworks.wikinerds.org/index.php/NGWP
Shouldn't it be named NGMW ;-)
ciao, tom
On Wednesday 12 January 2005 12:01, Thomas R. Koll wrote:
On Wed, Jan 12, 2005 at 01:13:13AM +0200, NSK wrote:
Find out more at http://maatworks.wikinerds.org/index.php/NGWP
Shouldn't it be named NGMW ;-)
It has nothing to do with MW.
I already have a new MW here: http://maatworks.wikinerds.org/index.php/WikiAnt
They are totally independent programs. NGWP has shares no code with any other project. NGWP is actually not just a wiki/CMS but also a new object-oriented platform.
Oh, I forgot to add that NGWP means New Generation Wiki Platform.
On Wednesday 12 January 2005 18:37, NSK wrote:
On Wednesday 12 January 2005 12:01, Thomas R. Koll wrote:
On Wed, Jan 12, 2005 at 01:13:13AM +0200, NSK wrote:
Find out more at http://maatworks.wikinerds.org/index.php/NGWP
Shouldn't it be named NGMW ;-)
It has nothing to do with MW.
I already have a new MW here: http://maatworks.wikinerds.org/index.php/WikiAnt
They are totally independent programs. NGWP has shares no code with any other project. NGWP is actually not just a wiki/CMS but also a new object-oriented platform.
On Tue, 11 Jan 2005 21:27:23 +0100, Daniel Wunsch the.gray@gmx.net wrote:
just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?
Well, it would have to be *as well as* wiki-markup, not instead - else what would you edit? But I seem to remember storing parsed XML representations as a form of caching being discussed as part of the architecture of an XML-based parsing system.
Rowan Collins wrote:
Well, it would have to be *as well as* wiki-markup, not instead - else what would you edit? But I seem to remember storing parsed XML representations as a form of caching being discussed as part of the architecture of an XML-based parsing system.
It'd be possible to do XML-to-WikiMarkup for that, if it's acceptable to have some sort of "canonical" formatting of the Wikimarkup rather than the character-for-character original.
-Mark
wikitech-l@lists.wikimedia.org