A while ago I started some experimental client software that took the output from wiki2xml, I got sidetracked but now I've got some more time I'm wanting to get back to it.
A few questions:
I've searched the list and see there is now a proper flex/bison parser. The wiki2xml convertor has not had any checkins for a while so I presume it's now defunct?
Does the flex/bison parser produce roughly the same XML as wiki2xml? (same tag names, nesting etc)
Is there a DTD, XML schema for the wikiXML? How about a rough spec?
Jim
Jim Higson schrieb:
A while ago I started some experimental client software that took the output from wiki2xml, I got sidetracked but now I've got some more time I'm wanting to get back to it.
A few questions:
I've searched the list and see there is now a proper flex/bison parser. The wiki2xml convertor has not had any checkins for a while so I presume it's now defunct?
Yup. If you know Bison, we'd be glad if you could take a look at it. Especially the HTML parsing needs a lot of work.
In the flexbisonparse module, there is also a "preprocessor" of mine which tries to convert HTML to wiki text as far as possible, which might then ease the parser code. Using the preprocessor, basically only <div> and <font> need to be taken care of by the parser, and the usual wiki tags (<pre>, <nowiki>, <math> etc.).
Does the flex/bison parser produce roughly the same XML as wiki2xml? (same tag names, nesting etc)
No. But the new one is better! :-)
Is there a DTD, XML schema for the wikiXML? How about a rough spec?
No DTD or the like, but try the example at the end of this mail (can't attach files on the mailing list...)
Your help with the parser would be much appreciated.
Magnus
Example :
This is '''bold''' and ''italics'' and '''''both'''''.
List test * dot ** two dots # number ## two numbers #* number, dot
Link test : [[solo link]], [[target|text]], [[image:test.jpg|thumb|100px|text]], [[target|]]
:Indent
{| ! a th-like element |parameter| a cell | another cell |- |parameter=something| another cell, another row |}
== Heading 2 == === Heading 3 ===
<nowiki>A nowiki text</nowiki>
XML: <article><paragraph>This is <bold>bold</bold> and <italics>italics</italics> and <italics><bold>both</bold></italics>.</paragraph><paragraph>List test</paragraph><list type='bullet'><listitem>dot<list type='bullet'><listitem>two dots</listitem></list></listitem></list><list type='numbered'><listitem>number<list type='numbered'><listitem>two numbers</listitem></list><list type='bullet'><listitem>number, dot</listitem></list></listitem></list><paragraph>Link test : <link><linktarget>solo link</linktarget></link>, <link><linktarget>target</linktarget><linkoption>text</linkoption></link>, <link><linktarget>image:test.jpg</linktarget> <linkoption>thumb</linkoption> <linkoption>100px</linkoption> <linkoption>text</linkoption></link>, <link emptypipeatend='yes'><linktarget>target</linktarget></link></paragraph><list type='indent'><listitem>Indent</listitem></list><table><tablerow><tablehead>a th-like element</tablehead><tablecell><attrs><attr name='parameter' isnull='yes'></attr></attrs> a cell</tablecell><tablecell>another cell</tablecell></tablerow><tablerow><tablecell><attrs><attr name='parameter'>something</attr></attrs> another cell, another row</tablecell></tablerow></table><heading level='2'> Heading 2 </heading><heading level='3'> Heading 3 </heading><paragraph><extension name='nowiki'>A nowiki text</extension></paragraph></article>
Magnus Manske wrote:
Jim Higson schrieb:
A while ago I started some experimental client software that took the output from wiki2xml, I got sidetracked but now I've got some more time I'm wanting to get back to it.
A few questions:
I've searched the list and see there is now a proper flex/bison parser. The wiki2xml convertor has not had any checkins for a while so I presume it's now defunct?
Yup. If you know Bison, we'd be glad if you could take a look at it. Especially the HTML parsing needs a lot of work.
I'm affraid not. I did a class last year in lex+yacc, so I mostly know my way round a spec, but I've no experience using it for a real language, especially one like wikitext which wasn't designed with formal grammars in mind.
A quick overview of what I'm doing: For my undergraduate disertation I'm writing a partial reimplementation of the mediawiki interface without any dynamic component on the server. This isn't intended to replace the current PHP interface, I am running it as an experiment into what is possible using very low spec web servers.
At the moment what I've got uses a javascript half-port of wiki2xml. If the project were to be taken any futher it would have to use a functionally identical parser to the Bison one, which as far as I can see would involve either modifying Bison to output javascript (very hard) or a C to javascript converter (also very hard!). As you can probably tell, I'll never fully reimplement the parsing process and don't intend this code to be used except for as a neat demonstration. Still, I'd like my intermediate XML format to be near the 'official' one because it is possible my presentation layer might be teamed up with a server-side parser (using something like &action=parsedxml instead of &action=raw). Even so, it isn't trying to be a replacement interface because it places too many requirements on the client and for the /Special:foo pages it will probably always delegate to PHP. At best it might one day be possible to run this project in parallel to a mediawiki wiki.
In the flexbisonparse module, there is also a "preprocessor" of mine which tries to convert HTML to wiki text as far as possible, which might then ease the parser code. Using the preprocessor, basically only <div> and <font> need to be taken care of by the parser, and the usual wiki tags (<pre>, <nowiki>, <math> etc.).
Does the flex/bison parser produce roughly the same XML as wiki2xml? (same tag names, nesting etc)
No. But the new one is better! :-)
Good, except this means a bit more work for me ;)
Is there a DTD, XML schema for the wikiXML? How about a rough spec?
No DTD or the like, but try the example at the end of this mail (can't attach files on the mailing list...)
Your help with the parser would be much appreciated.
I wish I could give more help with it. I can't really do much of anything until this disertation is done. After that possibly.
The example was very helpful, thanks.
Jim
Jim Higson wrote:
A question regarding lists as XML:
It seems like with the current parser nested lists are always contained within listitems.
What should be the output from something like:
** foo ** bar
Should there be a 'phantom' listitem to contan the nested list, such as:
<list> <listitem> <list> <listitem>foo</listitem> <listitem>foo</listitem> </list> </listitem> </list>
Or is it valid for a List to directly contain a list, ie:
<list> <list> <listitem>foo</listitem> <listitem>foo</listitem> </list> </list>
Of course, starting two lists like that is bad syntax, but parsable. I can build a decent parse tree of list wikitext now, but I'm not sure which is the correct XML output.
Jim Higson wrote:
What should be the output from something like:
** foo ** bar
Should there be a 'phantom' listitem to contan the nested list, such as:
[snip]
Or is it valid for a List to directly contain a list, ie:
In HTML, a list cannot directly contain another list; the child list must sit inside a list item. If you break the list tree by skipping a level in this way you will get a phantom bullet point.
(If you just want to indent to second level, you can do this:
:* foo :* bar
which will make the top-level list a definition list which has no visible item marker.)
Of course, starting two lists like that is bad syntax, but parsable. I can build a decent parse tree of list wikitext now, but I'm not sure which is the correct XML output.
It could be perfectly good and correct if that's how that schema is defined. However it may be better to stick with how the HTML lists work.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Jim Higson wrote:
What should be the output from something like:
** foo ** bar
Should there be a 'phantom' listitem to contan the nested list, such as:
[snip]
Or is it valid for a List to directly contain a list, ie:
In HTML, a list cannot directly contain another list; the child list must sit inside a list item. If you break the list tree by skipping a level in this way you will get a phantom bullet point.
I take the attitude that the wiki XML representation doesn't have to follow HTML too closely. Having said that, it does make semantic sense that a listitem would contain the nested list, because it will usually be a subclause of that item.
At the moment my 'parser' adds phantom items if two levels of list start at once, and gives them a phantom="true" attribute. This is ok, but maybe it is unnecessary because the XML-to-whatever formatter could just look to see if items have text nodes and if not format them without markers. I don't want to just invent attributes as I need them, in case my xml strays too far from the flex/bison parser's.
Maybe one way I could help would be drawing up a DTD? That way the validity of parser output could be easily checked.
My little parser does lists completely reliably now (I'm guessing the flex/bison one would gets this right too?). Even though I'm not using a parser generator, I'm using a lot of 'proper parser' techniques. Some output:
* normal list... * ...easy! ** nested *#** bit odd here! #*** huh? #*#*:# who would write this awful wikitext? #*# at least it parses! * Oh, the joy of parse trees! # foo
# bar ** baz # bac
<list type="bullet"> <listitem> normal list... </listitem> <listitem> ...easy! <list type="bullet"> <listitem> nested </listitem> </list> </listitem> <listitem phantom="true"> <list type="numbered"> <listitem phantom="true"> <list type="bullet"> <listitem phantom="true"> <list type="bullet"> <listitem> bit odd here! </listitem> </list> </listitem> </list> </listitem> </list> </listitem> </list> <list type="numbered"> <listitem phantom="true"> <list type="bullet"> <listitem phantom="true"> <list type="bullet"> <listitem phantom="true"> <list type="bullet"> <listitem> huh? </listitem> </list> </listitem> </list> </listitem> <listitem phantom="true"> <list type="numbered"> <listitem phantom="true"> <list type="bullet"> <listitem phantom="true"> <list type="indent"> <listitem phantom="true"> <list type="numbered"> <listitem> who would write this awful wikitext? </listitem> </list> </listitem> </list> </listitem> </list> </listitem> <listitem> at least it parses! </listitem> </list> </listitem> </list> </listitem> </list> <list type="bullet"> <listitem> Oh, the joy of parse trees! </listitem> </list> <list type="numbered"> <listitem> foo </listitem> </list> <list type="numbered"> <listitem> bar </listitem> </list> <list type="bullet"> <listitem phantom="true"> <list type="bullet"> <listitem> baz </listitem> </list> </listitem> </list> <list type="numbered"> <listitem> bac </listitem> </list>
Jim Higson wrote in gmane.science.linguistics.wikipedia.technical:
#*#*:# who would write this awful wikitext?
<list type="indent">
note that ":" is not "indent", but rather a definition list, i.e. <dl>, <dt> and <dd> in html. compare:
; term : definition ; term 2 : definition 2 : definition without term
the use of : for indentation is a side-effect of how definition lists are generally rendered in browsers.
kate.
Kate Turner wrote:
Jim Higson wrote in gmane.science.linguistics.wikipedia.technical:
#*#*:# who would write this awful wikitext?
<list type="indent">
note that ":" is not "indent", but rather a definition list, i.e. <dl>,
<dt> and <dd> in html. compare:
; term : definition ; term 2 : definition 2 : definition without term
the use of : for indentation is a side-effect of how definition lists are generally rendered in browsers.
Thanks for the note.
I'd used : as a bauble[1] since that's how wiki2xml and (I think) flexbisonparse handle them.
I hadn't come across the ; and : notation before.
[1] My converter calls the list chars (# and *) 'baubles', which is nice when you're writing code to decorate (parse) trees. One of the things with parsers is you have to think of so many names for things!
A definition list is still a list - maybe <list type=definition> would be a good representation for items starting with ;
-- Jim
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Jim Higson schrieb:
I'd used : as a bauble[1] since that's how wiki2xml and (I think) flexbisonparse handle them.
That is entirely my fault :-)
I used to dislike the ":"-";" notation and ignored the ";" part in favor of getting the parser ahead. Actually, I was about to suggest to get rid of the ";" from the wiki syntax altogether, but then I found it useful for my wikimaps implementation...
Magnus
Jim Higson wrote:
A while ago I started some experimental client software that took the output from wiki2xml, I got sidetracked but now I've got some more time I'm wanting to get back to it.
A few questions:
I've searched the list and see there is now a proper flex/bison parser. The
Speaking of which, why have a flex/bison parser? Wouldn't it be better if mediawiki created XML pages directly, like an "atom feed" or "rss" button? Mediawiki already carries a HTML engine for rendering wikitext to HTML, wouldn't it be easy to, with little modification, make it output XML (or even Docbook/XML) instead of HTML?
Cheers, Pedro.
Pedro Medeiros wrote:
Speaking of which, why have a flex/bison parser? Wouldn't it be better if mediawiki created XML pages directly, like an "atom feed" or "rss" button? Mediawiki already carries a HTML engine for rendering wikitext to HTML, wouldn't it be easy to, with little modification, make it output XML (or even Docbook/XML) instead of HTML?
Our present "parser" is a hack with a series of regexps and other horrors, whose steps often stomp on each other and produce hard to fix errors. It's not something to be emulated; rather it is our greatest shame. Currently we cannot guarantee that XHTML output will be well-formed, so changing it to a custom XML format would be a waste of time, as it would not be transformable.
A character-by-character parser that can go from the beginning to the end and churn something out that's guaranteed to be well-formed should be less error-prone and easier to maintain. Whether flex/bison is the best route I cannot say, but it's worth exploring.
Having this parser output an internal XML format instead of XHTML directly means a) we can maintain semantic information that would be lost in HTML and b) we can keep the base _parser_ separate from the code that does things like check for page existence, format the URLs for local links, and perhaps template transclusions. This allows transformation to other formats (XHTML, DocBook?) with less crap than eg trying to rewrite all the HTML into DocBook.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Our present "parser" is a hack with a series of regexps and other horrors, whose steps often stomp on each other and produce hard to fix errors. It's not something to be emulated; rather it is our greatest shame. Currently we cannot guarantee that XHTML output will be well-formed, so changing it to a custom XML format would be a waste of time, as it would not be transformable.
But, still, a parser written in php is necessary. Albeit a better one.
A character-by-character parser that can go from the beginning to the end and churn something out that's guaranteed to be well-formed should be less error-prone and easier to maintain. Whether flex/bison is the best route I cannot say, but it's worth exploring.
A proof-of-concept implementation might be a good thing to have around. But if I may, I can't see how, for instance, a simple flex/bison parser could adequately parse a set of varying extension languages, like the one used in <math> tags, into valid XML (In this case, MathML, I guess).
The parser would have to be modular, so each parser module would be used to translate a language. Well, this sparks some ideas.
Having this parser output an internal XML format instead of XHTML directly means a) we can maintain semantic information that would be lost in HTML and b) we can keep the base _parser_ separate from the code that does things like check for page existence, format the URLs for local links, and perhaps template transclusions. This allows transformation to other formats (XHTML, DocBook?) with less crap than eg trying to rewrite all the HTML into DocBook.
I completely agree. My question was about the best way of doing that parsing.
Cheers, Pedro.
Pedro Medeiros wrote:
A proof-of-concept implementation might be a good thing to have around. But if I may, I can't see how, for instance, a simple flex/bison parser could adequately parse a set of varying extension languages, like the one used in <math> tags, into valid XML (In this case, MathML, I guess).
It wouldn't be expected to; that would be passed off to the extension in a subsequent stage. (Have you checked the mailing list archives for past discussion on this topic?)
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Pedro Medeiros wrote:
A proof-of-concept implementation might be a good thing to have around. But if I may, I can't see how, for instance, a simple flex/bison parser could adequately parse a set of varying extension languages, like the one used in <math> tags, into valid XML (In this case, MathML, I guess).
It wouldn't be expected to; that would be passed off to the extension in a subsequent stage. (Have you checked the mailing list archives for past discussion on this topic?)
Not thorough, but I got most of the idea. But I won't be using php (and mediawiki extensions) to work on the XML file produced by flexbisonparse. I intend to transform the XML file into Docbook/XML (and from there to anything else). Docbook/XML supports math formulae natively, so I guess I won't be needing libgd-generated PNG images of them, right?
Cheers, Pedro.
wikitech-l@lists.wikimedia.org