-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)
Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-)
Try it out at
http://www.magnusmanske.de/wikipedia/wiki2xml.php
Just paste a wiki source text in, and get the XML. As you will notice, it wasn't written for speed.
It is not a "real" parser, but the structure is simlar to what a parser generator would make, except taking a few shortcuts here and there.
This could be the heart of a *real* export function. Just write a XML-to-PDF generator (and replace the templates, and get rid of the categories and language links) and you're done! :-)
Magnus
On Apr 7, 2005 8:58 PM, Magnus Manske magnus.manske@web.de wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)
Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-)
Try it out at
I get this for a longer text: <?xml version='1.0' encoding='UTF-8' ?> <article>ERROR!</article>
Is this a timeout or does the parser stop on unbalanced tags?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Axel schrieb:
I get this for a longer text:
<?xml version='1.0' encoding='UTF-8' ?>
<article>ERROR!</article>
Is this a timeout or does the parser stop on unbalanced tags?
I just noticed I didn't enter any code for doing the table row element "|--" ;-)
It was only a few lines and is online now; maybe it choked on that. The error message is a "last resort" of the parser, telling me I didn't catch all cases.
Please try again; if the error keeps coming, please let me know which text you used, so I can debug.
Disclaimer: The ";" markup and the table caption are not done yet.
Magnus
On Apr 7, 2005 10:15 PM, Magnus Manske magnus.manske@web.de wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Please try again; if the error keeps coming, please let me know which text you used, so I can debug.
Same problem again. You can use this old text: http://www.bliki.info/test.txt
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Axel schrieb:
On Apr 7, 2005 10:15 PM, Magnus Manske magnus.manske@web.de wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Please try again; if the error keeps coming, please let me know which text you used, so I can debug.
Same problem again. You can use this old text: http://www.bliki.info/test.txt
Fixed. There was a heading like "== stuff == " (note the final space), which, strictly speaking, is broken (yes, I'll make it recognize it anyway soon).
So the parser tried to fall back to rendering it as a "normal" line (not as a heading), which it couldn't do because the line starts with a "=", which indicates a header line.
I've added a fallback to force it to render as normal text now. For what it's worth, I've just successfully converted [[de:Mainz]], >140KB of text. Took 52 seconds but, well...
Magnus
Magnus Manske wrote:
Same problem again. You can use this old text: http://www.bliki.info/test.txt
Fixed. There was a heading like "== stuff == " (note the final space), which, strictly speaking, is broken (yes, I'll make it recognize it anyway soon).
So the parser tried to fall back to rendering it as a "normal" line (not as a heading), which it couldn't do because the line starts with a "=", which indicates a header line.
I've added a fallback to force it to render as normal text now. For what it's worth, I've just successfully converted [[de:Mainz]], >140KB of text. Took 52 seconds but, well...
A suggestion on this kind of error:
I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.
In my parser I output errors like this:
This is <b>not well formed
<paragraph lineno="0" charno="0"> This is <wikiwyg:syntax-error type="close missing"> html:b/ </wikiwyg:syntax-error> not well formed </paragraph>
and:
There is no such character entity as &wumpus; <paragraph lineno="0" charno="0"> There is no such character entity as <syntax-error type="unknown entity" name="wumpus"> <char-ref name="wumpus"/> </syntax-error> </paragraph>
So I detect what the user was trying to do, but not correct them. Correction is done in a layer between parser and presenter.
Firstly, this provides a separation of interests that if done right can make the parser much simpler, but it also allows a user to choose to have syntax errors not corrected and shown on the page so that imperfect articles can very easily be seen and fixed. See:
http://users.aber.ac.uk/jqh1/err_bold.png http://users.aber.ac.uk/jqh1/err_wumpus.png
Jim
-- visit my new wiki engine - http://81.5.150.113/wysi
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I have uploaded a new version with bugfixes and major speed improvements. It now parses about 20KB of wiki text per second on my machine (which is still slower than out current parser, as I am well aware).
Jim Higson schrieb:
A suggestion on this kind of error:
I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.
That is a good idea, but it depends on the user getting direct feedback from the parser. But unless I can make mine orders of magnitudes faster, it probably won't becom out "live" default parser.
That means for whatever purpose it will be used, the result should look like the one from out "official" parser. When you export a nice-looking wiki page into (e.g.) PDF, you don't want to lose headings in the process. You'd have to look through all of the output carefully (bugs might be less visible than missing headings), verifying the parser.
Of course, if this would ever be adopted as the "official machine", it would be a different situation altogether.
Magnus
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Answering myself, I have now a class variable to determine how to handle these cases. If it is set to "don't tolerate", it inserts
<error type="heading" reason="trailing blank"/>
for this specific error, and renders it at plain text.
Magnus
Magnus Manske schrieb:
I have uploaded a new version with bugfixes and major speed improvements. It now parses about 20KB of wiki text per second on my machine (which is still slower than out current parser, as I am well aware).
Jim Higson schrieb:
A suggestion on this kind of error:
I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.
That is a good idea, but it depends on the user getting direct feedback from the parser. But unless I can make mine orders of magnitudes faster, it probably won't becom out "live" default parser.
That means for whatever purpose it will be used, the result should look like the one from out "official" parser. When you export a nice-looking wiki page into (e.g.) PDF, you don't want to lose headings in the process. You'd have to look through all of the output carefully (bugs might be less visible than missing headings), verifying the parser.
Of course, if this would ever be adopted as the "official machine", it would be a different situation altogether.
Magnus
_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Magnus Manske wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Answering myself, I have now a class variable to determine how to handle these cases. If it is set to "don't tolerate", it inserts
<error type="heading" reason="trailing blank"/>
This is a bit similar to my output. Given input such as:
==heading text== mess blah blah
I output:
<heading-line> <heading level="2"> heading text </heading> <wikiwyg:syntax-error type="extra content"> mess </wikiwyg:syntax-error> </heading-line> <paragraph> blah blah </paragraph>
This is designed to try to capture what the user wanted to do. This could be corrected by appending the content of the syntax-error onto the next heading, or the presenter could ignore the error or highlight it. Eitherway the problem is kept pretty local and nothing much is missing from the page.
Trailing space is just a special case in which the syntax-error tag would contain a text node with just whitespace.
My corrector changes the XML to this:
<heading level="2"> heading text </heading> <paragraph> wikiwyg:syntax-correction blah blah </wikiwyg:syntax-correction> </paragraph>
And then the presenter treats syntax-correction as DOM DocumentFragments.
-- Jim
Your many discussions are nice to read, but wouldn't it be better that you open a page for this subject-matter on the meta wiki ? KISS - keep it short and simple Tom
Magnus Manske wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I have uploaded a new version with bugfixes and major speed improvements. It now parses about 20KB of wiki text per second on my machine (which is still slower than out current parser, as I am well aware).
Jim Higson schrieb:
A suggestion on this kind of error:
I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.
That is a good idea, but it depends on the user getting direct feedback from the parser. But unless I can make mine orders of magnitudes faster, it probably won't becom out "live" default parser.
On the subject of speed you might be interested in some of the optimisations I made to my parser, which made it 3 times as fast. I've uploaded a small section from my writing here:
http://users.aber.ac.uk/jqh1/optimisations.pdf
Basically, I'm subsetting the language based on some quick-to-run tests and only using possible grammars at each stage.
That means for whatever purpose it will be used, the result should look like the one from out "official" parser. When you export a nice-looking wiki page into (e.g.) PDF, you don't want to lose headings in the process. You'd have to look through all of the output carefully (bugs might be less visible than missing headings), verifying the parser.
I'm not saying errors shouldn't be corrected, just that they shouldn't be corrected in the parser. A "parser -> corrector -> PDF maker" archicture wouldn't loose the headings.
Marking syntax errors visually would just be for the website, and then only for people who want to see them (so being a janitor is easy!)
Of course, if this would ever be adopted as the "official machine", it would be a different situation altogether.
Magnus -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCVmEjCZKBJbEFcz0RAtN8AJ0S6OYBV1d9ooLK1EVS1EkLbfvvWACeLiLC jVrcr3kQsgMDiLaSRbqZNLs= =KBvd -----END PGP SIGNATURE-----
Thanks a lot for this tool, I'm not sure I've got a use for it yet (don't count on me for any pdf template), but I'm sure any xml fan will like it.
Do you mind if I put a link toward your site... do you intend to change place/adress for it ?
François
Magnus Manske wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)
Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-)
Try it out at
http://www.magnusmanske.de/wikipedia/wiki2xml.php
Just paste a wiki source text in, and get the XML. As you will notice, it wasn't written for speed.
It is not a "real" parser, but the structure is simlar to what a parser generator would make, except taking a few shortcuts here and there.
This could be the heart of a *real* export function. Just write a XML-to-PDF generator (and replace the templates, and get rid of the categories and language links) and you're done! :-)
Magnus -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCVYLOCZKBJbEFcz0RAntrAJ0cbZTbRf5IIjyK2ltgv2oNZ28+cgCfWTjn tl8TiwYzjBFGHpHUnpkVJXQ= =1Y9V -----END PGP SIGNATURE-----
Magnus Manske wrote:
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)
Try it out at
So whatever became of the flex/bison parser, then? Has it been abandoned because nobody can make sense of my code? :-)
Anyway, I've found a bug in your yanrap (Yet Another Not-Really-A-Parser):
Input: * one ** one point one * two
Output: <?xml version='1.0' encoding='UTF-8' ?> <article rendertime='0.0019750595092773 sec'><list type='bullet'><listitem>one<list type='bullet'><listitem>one point one</listitem></list>two</listitem></list></article>
Should have been: <?xml version='1.0' encoding='UTF-8' ?> <article rendertime='0.0019750595092773 sec'><list type='bullet'><listitem>one<list type='bullet'><listitem>one point one</listitem></list></listitem><listitem>two</listitem></list></article>
Timwi schrieb:
Magnus Manske wrote:
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)
Try it out at
So whatever became of the flex/bison parser, then? Has it been abandoned because nobody can make sense of my code? :-)
AFAIK, I was the only one who tried, and I'm scared :-)
Every time I changed something in the bison code, it didn't do as I expected, even when I was sure I got it right. Somehow Flex/Bison and I are incompatible...
Anyway, I've found a bug in your yanrap (Yet Another Not-Really-A-Parser):
I'll get to that...
Thanks, Magnus
wikitech-l@lists.wikimedia.org