Hello all,
after so much talk about an alternate parser, here it is! Well, half of it, anyway.
Attached (only 8KB), you'll find my manual C++ source (GPL) to convert wiki markup to XML.
Remarks: * This was hacked in a few days, so don't expect it to work perfectly, although it does surprisingly well so far. * It doesn't check for invalid user XML constructs (only partially implemented so far) * It doesn't recognize nowiki and pre tags * Some small wikimarkup is not supported (ISBNs, for example) All of these problems are due to to early stages of this project, not because of "can't do".
* At the moment, it reads the source from the file "test.txt". Switching this to piped command line input is dead easy, though. * Because of strangeness in the Firebird XML display, I have to replace "—" with something else ("!mdash;" in this case...), otherwise it doesn't render as valid XML.
I have tested it against several "live" wikipedia articles, and benchmarked a few on my laptop: * List_of_state_leaders_in_1839 : 200 pages per second (link-heavy page) * Operation_Downfall : 380 pages per second ("normal" page of above-average length) * List_of_Baronies : 3 pages per second (source is 233KB...) * Sian_Lloyd : 2300 pages per second (stub)
For an average page (in number of links, tables, overall length etc.), I estimate above 500 conversions per second on an off-the-shelf laptop (with stuff running in the background). This ain't bad...
I am certain that further tweaking could enhance this some more.
I have not yet found a page that doesn't output correct XML; however, it crashes on [[Results of the Canadian federal election, 2004]], our longest page, for unknown reasons. I assume it voted for the other party ;-)
I would this consider that a promising start. Please give it a try, and maybe look into the PHP XML-parser.
Magnus
Hello Magnus,
Magnus Manske schrieb:
after so much talk about an alternate parser, here it is! Well, half of it, anyway.
Attached (only 8KB), you'll find my manual C++ source (GPL) to convert wiki markup to XML.
wasn't there something about attachments? ;-)
At least, i didn't get the source.
Eckhart
<curse mode='censored'>!"§$%&/()</curse>
Try http://www.magnusmanske.de/wiki2xml.zip
Magnus
Eric Pöhlsen wrote:
At least, i didn't get the source.
Eckhart
I guess the list does not allow attachments
Eric
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
My 'unzip' command can't unzip this file.... strange.
Emmanuel Engelhart
I used 7zip in zip mode, which seems to be incompatible to zip mode :-(
New version up, try it now. All good things...
Magnus
Emmanuel Engelhart wrote:
My 'unzip' command can't unzip this file.... strange.
Emmanuel Engelhart _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
I have now created a new module in the CVS : "wiki2xml". Check it out ;-)
Magnus
Emmanuel Engelhart wrote:
My 'unzip' command can't unzip this file.... strange.
Emmanuel Engelhart _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
I was very happy, because you coded this first parser very fast. And, I support the idea of making a real wiki parser. But, it seems that you didn't use flex/bison or any parser generator. I think continuing in this direction, would be a very bad idea. The best specialists use parser generators, and we are not so good like parsing specialists...
Emmanuel
Selon Magnus Manske magnus.manske@web.de:
I have now created a new module in the CVS : "wiki2xml". Check it out ;-)
Magnus
Emmanuel Engelhart wrote:
My 'unzip' command can't unzip this file.... strange.
Emmanuel Engelhart _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Emmanuel Engelhart wrote:
I was very happy, because you coded this first parser very fast. And, I support the idea of making a real wiki parser. But, it seems that you didn't use flex/bison or any parser generator. I think continuing in this direction, would be a very bad idea. The best specialists use parser generators, and we are not so good like parsing specialists...
I agree. The reasons I didn't use a parser generator are * I was already working on this when I wrote my initial mail to the list, and I didn't think of a parser generator then * I don't know s**t about bison or flex
I *did* continue to develop this, because I believe we can use it as a black box. It already works OK (except for a headings bug, which I'm about to fix now), and someone (hint!;-) could start working on integrating it into a new PHP structure that will trake the XML and output (X)HTML. If we later replace my manual version with an automatically generated one, it should not matter at all to the rest of the parser.
Magnus
Magnus Manske a écrit :
Emmanuel Engelhart wrote:
I was very happy, because you coded this first parser very fast. And, I support the idea of making a real wiki parser. But, it seems that you didn't use flex/bison or any parser generator. I think continuing in this direction, would be a very bad idea. The best specialists use parser generators, and we are not so good like parsing specialists...
I agree. The reasons I didn't use a parser generator are
- I was already working on this when I wrote my initial mail to the
list, and I didn't think of a parser generator then
- I don't know s**t about bison or flex
I *did* continue to develop this, because I believe we can use it as a black box. It already works OK (except for a headings bug, which I'm about to fix now), and someone (hint!;-) could start working on integrating it into a new PHP structure that will trake the XML and output (X)HTML. If we later replace my manual version with an automatically generated one, it should not matter at all to the rest of the parser.
Magnus
I think it is the time to start [[Wikimedia:WikiSyntax]] to contain the EBNF wikisyntax to use in Flex/Bison, Lex/Yacc, JavaCC, etc....
See http://bugzilla.wikipedia.org/show_bug.cgi?id=7
I've start a very very little stub on http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:WikiSyntaxe
PS : Of course i totally agree on the model
Wikisyntax (MUST STAY A NON XML SYNTAX) | | Parsing | | | XMLTreeDumping | | XML (Cached) | | Postprocessing | (for templates, magics, and links, and options) | | XSLT-like-Tansformation | | -------|-------------- - - - - - - - - | | | | (X)HTML PDF XML/MATHML/SVG ?????
Xmlizer
Magnus Manske wrote:
I *did* continue to develop this, because I believe we can use it as a black box. It already works OK (except for a headings bug, which I'm about to fix now), and someone (hint!;-) could start working on integrating it into a new PHP structure that will trake the XML and output (X)HTML. If we later replace my manual version with an automatically generated one, it should not matter at all to the rest of the parser.
I think that for that to be true there would first have to be consensus about the interface of the black box, which means that it should be clear (1) what the formal grammar of the mark-up is and (2) what the corresponding XML is. I think (2) could be straightforward (simply the syntax tree in XML form, with e.g. non-terminals as tag names) but for (1) it is not just a matter of creating a grammar that accepts all current contents, but you need to decide how far you want to go with the functional decomposition, i.e., how detailed you want the syntax tree to be. One extreme would be to let all parsing be done by the parser, and no more matching should be necessary in the translation afterwards. There are also some other vey deep and fundamental decisions that have to be made there. For example, do we accept all possible input or do we only accept well-formed markup? In the first case the grammar becomes very complex and the transformation very hard because it has to correct the non-wellformed input. In the second case the grammar and the transformation become much simpeler but you would give the user a harsh SYNTAX ERROR if the input is not well-formed and possibly you would discover that most of the current pages are in fact not well-formed.
It is hard to have a discussion over (1) if you are not already using bison/yacc because ideally your grammar should be LALR(1) or nearly so with only a few ambiguities and it is nearly impossibl to analyse that without these tools.
-- Jan Hidders
Jan Hidders wrote:
Magnus Manske wrote:
I *did* continue to develop this, because I believe we can use it as a black box. It already works OK (except for a headings bug, which I'm about to fix now), and someone (hint!;-) could start working on integrating it into a new PHP structure that will trake the XML and output (X)HTML. If we later replace my manual version with an automatically generated one, it should not matter at all to the rest of the parser.
I think that for that to be true there would first have to be consensus about the interface of the black box, which means that it should be clear (1) what the formal grammar of the mark-up is and (2) what the corresponding XML is. I think (2) could be straightforward (simply the syntax tree in XML form, with e.g. non-terminals as tag names) but for (1) it is not just a matter of creating a grammar that accepts all current contents, but you need to decide how far you want to go with the functional decomposition, i.e., how detailed you want the syntax tree to be. One extreme would be to let all parsing be done by the parser, and no more matching should be necessary in the translation afterwards. There are also some other vey deep and fundamental decisions that have to be made there. For example, do we accept all possible input or do we only accept well-formed markup? In the first case the grammar becomes very complex and the transformation very hard because it has to correct the non-wellformed input. In the second case the grammar and the transformation become much simpeler but you would give the user a harsh SYNTAX ERROR if the input is not well-formed and possibly you would discover that most of the current pages are in fact not well-formed.
It is hard to have a discussion over (1) if you are not already using bison/yacc because ideally your grammar should be LALR(1) or nearly so with only a few ambiguities and it is nearly impossibl to analyse that without these tools.
I think one point is out of the question : The parser should accept even malformed source, just as the current PHP-based parser does now (or ought to do;-).
For the rest, I take the more practical approach here; (1) is what is allowed today, for the time being. I don't need a formal grammar for the manual parser, I hack away until it does what it should do. Once we replace my black box with a lexx/bison black box, we have to worry about the grammar.
In the meantime, this blackbox will do fine, meaning, it will suffice to write everything after the conversion to XML. If the black box doesn't work with all articles during this initial phase - so what? We'll just *assume* that it works correctly, leave problematic articles out during testing, and bulid a XML->XHTML component that works. By the time this is done, we should have that grammar finalized and can replace the black box with a bison/whatever version.
I just don't want to lose the momentum by establishing the grammar before doing anything else. There is plenty we can do already, based on what we have.
Magnus
On Mon, Aug 16, 2004 at 04:44:26PM +0200, Magnus Manske wrote:
It is hard to have a discussion over (1) if you are not already using bison/yacc because ideally your grammar should be LALR(1) or nearly so with only a few ambiguities and it is nearly impossibl to analyse that without these tools.
I think one point is out of the question : The parser should accept even malformed source, just as the current PHP-based parser does now (or ought to do;-).
For the rest, I take the more practical approach here; (1) is what is allowed today, for the time being. I don't need a formal grammar for the manual parser, I hack away until it does what it should do. Once we replace my black box with a lexx/bison black box, we have to worry about the grammar.
In the meantime, this blackbox will do fine, meaning, it will suffice to write everything after the conversion to XML. If the black box doesn't work with all articles during this initial phase - so what? We'll just *assume* that it works correctly, leave problematic articles out during testing, and bulid a XML->XHTML component that works. By the time this is done, we should have that grammar finalized and can replace the black box with a bison/whatever version.
I just don't want to lose the momentum by establishing the grammar before doing anything else. There is plenty we can do already, based on what we have.
Code first, think later - that's the only way which works. Too many promising projects died because of wasting too much time on discussion before there was any code.
Magnus Manske wrote:
Jan Hidders wrote:
I think that for that to be true there would first have to be consensus about the interface of the black box, which means that it should be clear (1) what the formal grammar of the mark-up is and (2) what the corresponding XML is. [...] do we accept all possible input or do we only accept well-formed markup?
I think one point is out of the question : The parser should accept even malformed source, just as the current PHP-based parser does now (or ought to do;-).
I'm happy either way as long as the decision is taken in full awareness of the increase in complexity of the code.
For the rest, I take the more practical approach here; (1) is what is allowed today, for the time being. I don't need a formal grammar for the manual parser, I hack away until it does what it should do.
But until you know the grammar you do not know what "it should do". Until you know the grammar you do not know what XML to produce, what the DTD is and what it means. That you can parse most of the current articles and produce well-formed XML is virtually meaningless unless you can tell the programmer of the code that processes the XML what the DTD is and what it means.
Once we replace my black box with a lexx/bison black box, we have to worry about the grammar.
It's not that simple. You cannot generate a parser for *any* formal grammer, so there is probably going to be some tweaking of the grammar (apart from other changes such as moving some of the parsing from or to the transformation phase) which will mean that probably the output XML will also change. Note that by "hacking away" as you are doing now you are in fact already implicitly choosing a grammar by the way that you produce your XML. Unless you make those choices explicit it's hard to tell if they are the right ones.
I just don't want to lose the momentum by establishing the grammar before doing anything else.
We don't have to. Note that we do not need *the* grammar to start. It would probably be better anyway to take a gradual refinement approach by which I mean here the following:
1. You start with a very simple grammar that basically chops the source into big semantic chuncks. So you get a very small parse tree with leafs that contain large pieces of unparsed text. These leaves are transformed by the old code and then combined into to the final output.
2. You test if for the most pages you generate the same output as before. If you are not happy with this you redefine the grammar and corresponding transformation until you are.
3. Until the leaves no longer contain unparsed text you refine the syntax such that the parse tree becomes bigger and the leaves smaller, then go back to step 2.
4. Remove all old transformation code.
If you tell me what grammar you would like to start with I (and others) can probably help you with the bison code.
-- Jan Hidders
Jan Hidders wrote:
Magnus Manske wrote:
For the rest, I take the more practical approach here; (1) is what is allowed today, for the time being. I don't need a formal grammar for the manual parser, I hack away until it does what it should do.
But until you know the grammar you do not know what "it should do". Until you know the grammar you do not know what XML to produce, what the DTD is and what it means. That you can parse most of the current articles and produce well-formed XML is virtually meaningless unless you can tell the programmer of the code that processes the XML what the DTD is and what it means.
Well, the tags we can use are HTML (limited set, most of them nested), and : http://en.wikipedia.org/wiki/Wikipedia:How_to_edit_a_page
So, we can create internal and external links. Internal links can have parameters and sub-links, as can templates. IMHO, that pretty much defines what the XML should look like, or should be able to describe. Give my parser thingy a try, have a look at the output, and tell me what the XML it outputs cannot describe that out MediaWiki parser can (current bugs excluded:-).
If I know the input format (wiki markup) and the desired output (XML), I can create a black box that can do the translation.
Remember, we're not talking about the grammar of C# here. This is wiki markup. Links, bold, italics, lists, and tables, that's about it . I might not be able to solve a traveling salesman problem with 10000 cities in my head, but with four cities, it *is* rather trivial.
In the wiki2xml ware, I don't even bother with a parsing tree, just a little recursive call for nested links (images) and some lists for the table markup. Yeah! :-)
Once we replace my black box with a lexx/bison black box, we have to worry about the grammar.
It's not that simple. You cannot generate a parser for *any* formal grammer, so there is probably going to be some tweaking of the grammar (apart from other changes such as moving some of the parsing from or to the transformation phase) which will mean that probably the output XML will also change. Note that by "hacking away" as you are doing now you are in fact already implicitly choosing a grammar by the way that you produce your XML. Unless you make those choices explicit it's hard to tell if they are the right ones.
Good judement comes from experience. Experience comes from bad judgement. ;-)
I just don't want to lose the momentum by establishing the grammar before doing anything else.
We don't have to. Note that we do not need *the* grammar to start. It would probably be better anyway to take a gradual refinement approach by which I mean here the following:
- You start with a very simple grammar that basically chops the
source into big semantic chuncks. So you get a very small parse tree with leafs that contain large pieces of unparsed text. These leaves are transformed by the old code and then combined into to the final output.
- You test if for the most pages you generate the same output as
before. If you are not happy with this you redefine the grammar and corresponding transformation until you are.
- Until the leaves no longer contain unparsed text you refine the
syntax such that the parse tree becomes bigger and the leaves smaller, then go back to step 2.
- Remove all old transformation code.
If you tell me what grammar you would like to start with I (and others) can probably help you with the bison code.
So, what you'd need is an EBNF representation or something? I've never worked with bison or the like. Is there a Windows version, by the way (yes, I *have* Linux, but currently I'm mostly under W).
Magnus
On Monday 16 August 2004 19:35, Magnus Manske wrote:
So, what you'd need is an EBNF representation or something?
Yes, that is all I ask. A precise formal grammar in whatever notation you like but preferrably in the input format of bison. If only because that would document what it is that your parser exactly does. Note there are some requirements: - it should accept all possible input strings - it should ideally do all the necessary parsing i.e. from the parse tree we should be able to generate the output with a single tree-walk, and - it should be unambiguous (and even LALR(1)) or have an explicit conflict resolution rule
Just having the first two would already be nice, we could then work from there towards a grammar that satisifies the final requirement. To begin with you could simply give the grammar of what you think your current parser does.
I've never worked with bison or the like. Is there a Windows version, by the way (yes, I *have* Linux, but currently I'm mostly under W).
Yes there is:
http://gnuwin32.sourceforge.net/packages/bison.htm
Are you familiar with the XML parser in PHP? That should give you an idea of how to place the code in the grammar.
-- Jan Hidders
Jan Hidders wrote:
On Monday 16 August 2004 19:35, Magnus Manske wrote:
So, what you'd need is an EBNF representation or something?
Yes, that is all I ask. A precise formal grammar in whatever notation you like but preferrably in the input format of bison. If only because that would document what it is that your parser exactly does. Note there are some requirements:
- it should accept all possible input strings
- it should ideally do all the necessary parsing i.e. from the parse tree we
should be able to generate the output with a single tree-walk, and
- it should be unambiguous (and even LALR(1)) or have an explicit conflict
resolution rule
I can't prove it at the moment, but after having played around with wikisyntax for a while, I'm *pretty* sure it's not LALR(1). There's too much reuse of symbols that requires potentially unbounded lookahead to resolve ambiguities. Even if it were, the LALR(1) grammar would be so unintuitive and unlike the functional intent of wikisyntax that it wouldn't be very useful as a reference.
I did some searching for general LR(n) parsers (some call themselves GLR) parsers a while back, and none are too great. The two most promising are Elkhound from Berkeley [http://www.cs.berkeley.edu/~smcpeak/elkhound/] and Packrat parsing [http://www.pdos.lcs.mit.edu/~baford/packrat/]. The latter is much cleaner, and uses a "parsing expression grammar" (PEG), basically a CFG with ordered productions, to resolve ambiguity. However, it uses O(n) RAM where n is the size of the file to be parsed. Elkhound is more efficient, but I wasn't able to figure out how exactly to build something clean out of it.
So, good luck. =]
-Mark
Jan Hidders wrote:
Timwi wrote:
MediaWiki's syntax is perfectly LALR(1), except for only one very important exception: lists.
?? Could you explain that last remark a bit more?
I am not sure what you are asking: do you want to know why I think MediaWiki's syntax is LALR(1), or do you want to know why lists are not?
I cannot give a definite answer to the first question other than "it has worked for me so far, and I have already thought about some of the things that are yet to come, and have not come up with any problems".
As for the second question, the problem with lists is that one token (e.g. '*') at the beginning of several lines can represent the same item:
* One *# Two *# Three * Four
Although there are four lines and four '*'s, there are only three list items in the outermost list, with the second item containing a nested list spanning two lines. The inner list essentially has a '*' in the middle of it to indicate that it does not end here. In other words, there is no clearly-indicated beginning or end to the inner list, and the inner list has tokens inside it that do not belong to it.
I have solved this currently by first parsing it as
<listblock> <listline><listbullet /> One</listline> <listline><listbullet /><listnumbered /> Two</listline> <listline><listbullet /><listnumbered /> Three</listline> <listline><listbullet /> Four</listline> </listblock>
and then calling a separate routine that transforms this into the desired structure.
Notice that this does not defeat the purpose of a parser, in case you are thinking this. The above example uses "one" as an example for the contents of a "listline", but in reality it can contain bold, italics, links, etc. and the good thing about using a parser is that I have one node in my structure that represents this whole text block, already fully parsed. I do not need to care what it contains, and I do not need to worry that rearranging the nodes representing the list will in any way conflict or interfere with the contents of the node representing the text block.
I love parsing. Does it show? :-)
Timwi
Timwi wrote:
As for the second question, the problem with lists is that one token (e.g. '*') at the beginning of several lines can represent the same item:
- One
*# Two *# Three
- Four
Although there are four lines and four '*'s, there are only three list items in the outermost list, with the second item containing a nested list spanning two lines. The inner list essentially has a '*' in the middle of it to indicate that it does not end here. In other words, there is no clearly-indicated beginning or end to the inner list, and the inner list has tokens inside it that do not belong to it.
Ah, of course. Never mind LALR(1). Recognizing a single block (items that all belong to the same list) is not possible with a context-free language. I think I'll keep proving that one as an exercise for my students. :-) But that tells us that a fancier parser generator is not going to solve that problem for us.
I have solved this currently by first parsing it as
<listblock> <listline><listbullet /> One</listline> <listline><listbullet /><listnumbered /> Two</listline> <listline><listbullet /><listnumbered /> Three</listline> <listline><listbullet /> Four</listline> </listblock>
and then calling a separate routine that transforms this into the desired structure.
Afterwards? Could you not have a global variable that contains a stack of bullettypes (of the preceding line) and do this while parsing? I suspect you now have to materialize the parse tree, no?
I love parsing. Does it show? :-)
It does. :-)
-- Jan Hidders
Jan Hidders wrote:
I have solved this currently by first parsing it as
<listblock> <listline><listbullet /> One</listline> <listline><listbullet /><listnumbered /> Two</listline> <listline><listbullet /><listnumbered /> Three</listline> <listline><listbullet /> Four</listline> </listblock>
and then calling a separate routine that transforms this into the desired structure.
Afterwards? Could you not have a global variable that contains a stack of bullettypes (of the preceding line) and do this while parsing? I suspect you now have to materialize the parse tree, no?
You would have to "materialise" some stuff in memory anyway. I don't think "materialising" the 'ListBlock', 'ListLine' and 'ListBullet' objects (and freeing them again later) isn't a particularly significant performance loss. If you think it is, of course, you are free to modify my parser accordingly when it's finished. :-)
Timwi
Delirium wrote:
I can't prove it at the moment, but after having played around with wikisyntax for a while, I'm *pretty* sure it's not LALR(1). There's too much reuse of symbols that requires potentially unbounded lookahead to resolve ambiguities. Even if it were, the LALR(1) grammar would be so unintuitive and unlike the functional intent of wikisyntax that it wouldn't be very useful as a reference.
It is my experience that you never really know until you have actually looked. We also have the option of going with the usual disambiguation rules in yacc/bison.
I did some searching for general LR(n) parsers (some call themselves GLR) parsers a while back, and none are too great. The two most promising are Elkhound from Berkeley [http://www.cs.berkeley.edu/~smcpeak/elkhound/] and Packrat parsing [http://www.pdos.lcs.mit.edu/~baford/packrat/]. The latter is much cleaner, and uses a "parsing expression grammar" (PEG), basically a CFG with ordered productions, to resolve ambiguity. However, it uses O(n) RAM where n is the size of the file to be parsed. Elkhound is more efficient, but I wasn't able to figure out how exactly to build something clean out of it.
You missed one. bison. :-) And indeed that is also an option that we have: go with an GLR parser.
-- Jan Hidders
Jan Hidders wrote:
It is my experience that you never really know until you have actually looked. We also have the option of going with the usual disambiguation rules in yacc/bison.
The default disambiguation rule for shift/reduce conflicts in bison is to do the shift. So far, I have crafted my grammar in such a way that this is always the correct thing to do. I think this is always possible.
The default disambiguation rule for reduce/reduce conflicts is to reduce the rule that is mentioned earlier in the source file. So far, I have not had a reduce/reduce conflict, but clearly, whenever this disambiguation rule produces something other than the desired result, all you need to do is to change the order of the rules.
Timwi
wikitech-l@lists.wikimedia.org