HTML-to-XML

List overview All Threads
Download

newer

older

Good work!

I get many errors at...

Magnus Manske

10 Jan 2005 10 Jan '05

10:23 a.m.

As we were (OK: I am;-) running into trouble integrating HTML-to-XML parsing into the Bison-based parser, I have written a specialized C++ class that can do this prior to the actual parsing. It will output only correct XML *structure*, and (as far as I can tell) correct XHTML rules (<tr> in <table> etc.) as well.

"Broken" HTML will be changed into < / > entities, so only valid XML will reach the output. However, I took some care to automagically fix the "usual suspects" (obligatory 21C3 reference) of HTML ugliness, like not-closed <li> and various table chaos. Even a lonely <caption> (not closed) somewhere in the text will generate a full table. It might not be pretty, but it will be vaild XML.

While this is primarily intended for the wiki-to-XML parser, it might work for enforcing XML output for the current parser as well. We'd only have to run the wiki source through it before actually parsing.

Source: CVS HEAD, Module "flexbisonparse", file "html2xml.cpp". (GPL, of course)

Magnus

Show replies by date

Daniel Wunsch

11 Jan 11 Jan

2:27 p.m.

On Monday 10 January 2005 17:23, Magnus Manske wrote:

...

While this is primarily intended for the wiki-to-XML parser, it might work for enforcing XML output for the current parser as well. We'd only have to run the wiki source through it before actually parsing.

just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?

daniel

NSK

3:23 p.m.

On Tuesday 11 January 2005 22:27, Daniel Wunsch wrote:

...

just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?

My software (will) work like this.

-- NSK Come to see the new wikiprojects at http://portal.wikinerds.org

Magnus Manske

3:29 p.m.

NSK schrieb:

...

On Tuesday 11 January 2005 22:27, Daniel Wunsch wrote:

...
just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?

My software (will) work like this.

Will you use this (wikipedia's) to-be XML markup, or your own brand?

Might just as well create a standard here ;-)

Magnus

NSK

5:13 p.m.

On Tuesday 11 January 2005 23:29, Magnus Manske wrote:

...

Will you use this (wikipedia's) to-be XML markup, or your own brand?

It will use a standard wiki markup which I will create.

It is my intention that my wiki software should provide lots of parsers/converters for other markups, including HTML, XHTML, OpenOffice, MediaWiki, TikiWiki, WikkaWiki et cetera, probably after the 1.0 version.

Find out more at http://maatworks.wikinerds.org/index.php/NGWP

-- NSK Come to see the new wikiprojects at http://portal.wikinerds.org

Thomas R. Koll

12 Jan 12 Jan

4:01 a.m.

On Wed, Jan 12, 2005 at 01:13:13AM +0200, NSK wrote:

...

Find out more at http://maatworks.wikinerds.org/index.php/NGWP

Shouldn't it be named NGMW ;-)

ciao, tom

-- == Weblinks == * http://shop.wikipedia.org - WikiReader Internet zu kaufen * http://de.wikipedia.org/wiki/Benutzer:TomK32 * http://www.hammererlehen.de - Urlaub in Berchtesgaden

NSK

10:37 a.m.

On Wednesday 12 January 2005 12:01, Thomas R. Koll wrote:

...

On Wed, Jan 12, 2005 at 01:13:13AM +0200, NSK wrote:

...
Find out more at http://maatworks.wikinerds.org/index.php/NGWP

Shouldn't it be named NGMW ;-)

It has nothing to do with MW.

I already have a new MW here: http://maatworks.wikinerds.org/index.php/WikiAnt

They are totally independent programs. NGWP has shares no code with any other project. NGWP is actually not just a wiki/CMS but also a new object-oriented platform.

-- NSK Visit my wiki at http://jnana.wikinerds.org

NSK

9:34 p.m.

Oh, I forgot to add that NGWP means New Generation Wiki Platform.

On Wednesday 12 January 2005 18:37, NSK wrote:

...

On Wednesday 12 January 2005 12:01, Thomas R. Koll wrote:

...
On Wed, Jan 12, 2005 at 01:13:13AM +0200, NSK wrote:

...
Find out more at http://maatworks.wikinerds.org/index.php/NGWP

Shouldn't it be named NGMW ;-)

It has nothing to do with MW.

I already have a new MW here: http://maatworks.wikinerds.org/index.php/WikiAnt

They are totally independent programs. NGWP has shares no code with any other project. NGWP is actually not just a wiki/CMS but also a new object-oriented platform.

-- NSK Visit my wiki at http://jnana.wikinerds.org

Rowan Collins

11 Jan 11 Jan

4:46 p.m.

On Tue, 11 Jan 2005 21:27:23 +0100, Daniel Wunsch the.gray@gmx.net wrote:

...

just a thought that came to me: would it make any sense to store parsed XML in the DB instead of wiki-markup?

Well, it would have to be *as well as* wiki-markup, not instead - else what would you edit? But I seem to remember storing parsed XML representations as a form of caching being discussed as part of the architecture of an XML-based parsing system.

-- Rowan Collins BSc [IMSoP]

Delirium

4:55 p.m.

Rowan Collins wrote:

...

Well, it would have to be *as well as* wiki-markup, not instead - else what would you edit? But I seem to remember storing parsed XML representations as a form of caching being discussed as part of the architecture of an XML-based parsing system.

It'd be possible to do XML-to-WikiMarkup for that, if it's acceptable to have some sort of "canonical" formatting of the Wikimarkup rather than the character-for-character original.

-Mark

7279

Age (days ago)

7282

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Daniel Wunsch
Delirium
Magnus Manske
NSK
Rowan Collins
Thomas R. Koll