Wiki-to-XML pseudo-parser in PHP

List overview All Threads
Download

newer

older

Re: [Wikipedia-l] Alexa.com blogs...

RE: [Wikitech-l] dumps

Magnus Manske

7 Apr 2005 7 Apr '05

8:58 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)

Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-)

Try it out at

http://www.magnusmanske.de/wikipedia/wiki2xml.php

Just paste a wiki source text in, and get the XML. As you will notice, it wasn't written for speed.

It is not a "real" parser, but the structure is simlar to what a parser generator would make, except taking a few shortcuts here and there.

This could be the heart of a *real* export function. Just write a XML-to-PDF generator (and replace the templates, and get rid of the categories and language links) and you're done! :-)

Magnus

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCVYLOCZKBJbEFcz0RAntrAJ0cbZTbRf5IIjyK2ltgv2oNZ28+cgCfWTjn tl8TiwYzjBFGHpHUnpkVJXQ= =1Y9V -----END PGP SIGNATURE-----

Show replies by date

Axel

7 Apr 7 Apr

9:21 p.m.

On Apr 7, 2005 8:58 PM, Magnus Manske magnus.manske@web.de wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)

Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-)

Try it out at

I get this for a longer text: <?xml version='1.0' encoding='UTF-8' ?> <article>ERROR!</article>

Is this a timeout or does the parser stop on unbalanced tags?

-- Axel Kramer http://www.phpeclipse.de - PHP Eclipse Plugin http://www.plog4u.org - Wikipedia Eclipse Plugin

Magnus Manske

10:15 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Axel schrieb:

...

I get this for a longer text:

<?xml version='1.0' encoding='UTF-8' ?>

<article>ERROR!</article>

Is this a timeout or does the parser stop on unbalanced tags?

I just noticed I didn't enter any code for doing the table row element "|--" ;-)

It was only a few lines and is online now; maybe it choked on that. The error message is a "last resort" of the parser, telling me I didn't catch all cases.

Please try again; if the error keeps coming, please let me know which text you used, so I can debug.

Disclaimer: The ";" markup and the table caption are not done yet.

Magnus

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCVZTzCZKBJbEFcz0RAipQAJ4kiSdflS4/d9psOmqZQFfNU4t7AQCfcOdk YQm1KtJpDNFLrJqpdaboPeQ= =Y2uJ -----END PGP SIGNATURE-----

Axel

10:28 p.m.

On Apr 7, 2005 10:15 PM, Magnus Manske magnus.manske@web.de wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Please try again; if the error keeps coming, please let me know which text you used, so I can debug.

Same problem again. You can use this old text: http://www.bliki.info/test.txt

-- Axel Kramer http://www.phpeclipse.de - PHP Eclipse Plugin http://www.plog4u.org - Wikipedia Eclipse Plugin

Magnus Manske

11:17 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Axel schrieb:

...

On Apr 7, 2005 10:15 PM, Magnus Manske magnus.manske@web.de wrote:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Please try again; if the error keeps coming, please let me know which text you used, so I can debug.

Same problem again. You can use this old text: http://www.bliki.info/test.txt

Fixed. There was a heading like "== stuff == " (note the final space), which, strictly speaking, is broken (yes, I'll make it recognize it anyway soon).

So the parser tried to fall back to rendering it as a "normal" line (not as a heading), which it couldn't do because the line starts with a "=", which indicates a header line.

I've added a fallback to force it to render as normal text now. For what it's worth, I've just successfully converted [[de:Mainz]], >140KB of text. Took 52 seconds but, well...

Magnus

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCVaNmCZKBJbEFcz0RAr0WAJ407qPk0lrvafoJID5aTEzo/NJgvACdHVIt JuMovjGJQBZz7+xfzOCaxJU= =G9uZ -----END PGP SIGNATURE-----

Jim Higson

8 Apr 8 Apr

10:59 a.m.

Magnus Manske wrote:

...

...
Same problem again. You can use this old text: http://www.bliki.info/test.txt

Fixed. There was a heading like "== stuff == " (note the final space), which, strictly speaking, is broken (yes, I'll make it recognize it anyway soon).

So the parser tried to fall back to rendering it as a "normal" line (not as a heading), which it couldn't do because the line starts with a "=", which indicates a header line.

I've added a fallback to force it to render as normal text now. For what it's worth, I've just successfully converted [[de:Mainz]], >140KB of text. Took 52 seconds but, well...

A suggestion on this kind of error:

I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.

In my parser I output errors like this:

This is <b>not well formed

<paragraph lineno="0" charno="0"> This is <wikiwyg:syntax-error type="close missing"> html:b/ </wikiwyg:syntax-error> not well formed </paragraph>

and:

There is no such character entity as &wumpus; <paragraph lineno="0" charno="0"> There is no such character entity as <syntax-error type="unknown entity" name="wumpus"> <char-ref name="wumpus"/> </syntax-error> </paragraph>

So I detect what the user was trying to do, but not correct them. Correction is done in a layer between parser and presenter.

Firstly, this provides a separation of interests that if done right can make the parser much simpler, but it also allows a user to choose to have syntax errors not corrected and shown on the page so that imperfect articles can very easily be seen and fixed. See:

http://users.aber.ac.uk/jqh1/err_bold.png http://users.aber.ac.uk/jqh1/err_wumpus.png

Jim

-- visit my new wiki engine - http://81.5.150.113/wysi

Magnus Manske

12:47 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

I have uploaded a new version with bugfixes and major speed improvements. It now parses about 20KB of wiki text per second on my machine (which is still slower than out current parser, as I am well aware).

Jim Higson schrieb:

...

A suggestion on this kind of error:

I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.

That is a good idea, but it depends on the user getting direct feedback from the parser. But unless I can make mine orders of magnitudes faster, it probably won't becom out "live" default parser.

That means for whatever purpose it will be used, the result should look like the one from out "official" parser. When you export a nice-looking wiki page into (e.g.) PDF, you don't want to lose headings in the process. You'd have to look through all of the output carefully (bugs might be less visible than missing headings), verifying the parser.

Of course, if this would ever be adopted as the "official machine", it would be a different situation altogether.

Magnus

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCVmEjCZKBJbEFcz0RAtN8AJ0S6OYBV1d9ooLK1EVS1EkLbfvvWACeLiLC jVrcr3kQsgMDiLaSRbqZNLs= =KBvd -----END PGP SIGNATURE-----

Magnus Manske

1:11 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Answering myself, I have now a class variable to determine how to handle these cases. If it is set to "don't tolerate", it inserts

for this specific error, and renders it at plain text.

Magnus

Magnus Manske schrieb:

...

I have uploaded a new version with bugfixes and major speed improvements. It now parses about 20KB of wiki text per second on my machine (which is still slower than out current parser, as I am well aware).

Jim Higson schrieb:

...
...
A suggestion on this kind of error:

I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.

That is a good idea, but it depends on the user getting direct feedback from the parser. But unless I can make mine orders of magnitudes faster, it probably won't becom out "live" default parser.

That means for whatever purpose it will be used, the result should look like the one from out "official" parser. When you export a nice-looking wiki page into (e.g.) PDF, you don't want to lose headings in the process. You'd have to look through all of the output carefully (bugs might be less visible than missing headings), verifying the parser.

Of course, if this would ever be adopted as the "official machine", it would be a different situation altogether.

Magnus

_______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCVmbnCZKBJbEFcz0RAkyqAJ9cL/2nZBR/C814wGMvzx6zYbj6jgCeKBk/ G74LpYeJwK6vkq3gSj7oYgA= =kXva -----END PGP SIGNATURE-----

Jim Higson

1:41 p.m.

Magnus Manske wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Answering myself, I have now a class variable to determine how to handle these cases. If it is set to "don't tolerate", it inserts

<error type="heading" reason="trailing blank"/>

This is a bit similar to my output. Given input such as:

==heading text== mess blah blah

I output:

<heading-line> <heading level="2"> heading text </heading> <wikiwyg:syntax-error type="extra content"> mess </wikiwyg:syntax-error> </heading-line> <paragraph> blah blah </paragraph>

This is designed to try to capture what the user wanted to do. This could be corrected by appending the content of the syntax-error onto the next heading, or the presenter could ignore the error or highlight it. Eitherway the problem is kept pretty local and nothing much is missing from the page.

Trailing space is just a special case in which the syntax-error tag would contain a text node with just whitespace.

My corrector changes the XML to this:

<heading level="2"> heading text </heading> <paragraph> wikiwyg:syntax-correction blah blah </wikiwyg:syntax-correction> </paragraph>

And then the presenter treats syntax-correction as DOM DocumentFragments.

-- Jim

Thomas Gries

5:27 p.m.

Your many discussions are nice to read, but wouldn't it be better that you open a page for this subject-matter on the meta wiki ? KISS - keep it short and simple Tom

Jim Higson

1:25 p.m.

Magnus Manske wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

I have uploaded a new version with bugfixes and major speed improvements. It now parses about 20KB of wiki text per second on my machine (which is still slower than out current parser, as I am well aware).

Jim Higson schrieb:

...
A suggestion on this kind of error:

I think the best behaviour is to try to work out what the user intended, but not correct it in the parser, because without formal definition and when a parser is used as the reference of the language anything it doesn't mark as an error becomes valid syntax.

...

That is a good idea, but it depends on the user getting direct feedback from the parser. But unless I can make mine orders of magnitudes faster, it probably won't becom out "live" default parser.

On the subject of speed you might be interested in some of the optimisations I made to my parser, which made it 3 times as fast. I've uploaded a small section from my writing here:

http://users.aber.ac.uk/jqh1/optimisations.pdf

Basically, I'm subsetting the language based on some quick-to-run tests and only using possible grammars at each stage.

...

That means for whatever purpose it will be used, the result should look like the one from out "official" parser. When you export a nice-looking wiki page into (e.g.) PDF, you don't want to lose headings in the process. You'd have to look through all of the output carefully (bugs might be less visible than missing headings), verifying the parser.

I'm not saying errors shouldn't be corrected, just that they shouldn't be corrected in the parser. A "parser -> corrector -> PDF maker" archicture wouldn't loose the headings.

Marking syntax errors visually would just be for the website, and then only for people who want to see them (so being a janitor is easy!)

...

Of course, if this would ever be adopted as the "official machine", it would be a different situation altogether.

Magnus -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCVmEjCZKBJbEFcz0RAtN8AJ0S6OYBV1d9ooLK1EVS1EkLbfvvWACeLiLC jVrcr3kQsgMDiLaSRbqZNLs= =KBvd -----END PGP SIGNATURE-----

FxParlant

7 Apr 7 Apr

11:21 p.m.

Thanks a lot for this tool, I'm not sure I've got a use for it yet (don't count on me for any pdf template), but I'm sure any xml fan will like it.

Do you mind if I put a link toward your site... do you intend to change place/adress for it ?

François

Magnus Manske wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)

Written as a single class, it takes a MediaWiki-style markuped (is that a word?) source and generates the XML flavor Timwi and I have been using in all our unfinished projects! ;-)

Try it out at

http://www.magnusmanske.de/wikipedia/wiki2xml.php

Just paste a wiki source text in, and get the XML. As you will notice, it wasn't written for speed.

It is not a "real" parser, but the structure is simlar to what a parser generator would make, except taking a few shortcuts here and there.

This could be the heart of a *real* export function. Just write a XML-to-PDF generator (and replace the templates, and get rid of the categories and language links) and you're done! :-)

Magnus -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.1 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCVYLOCZKBJbEFcz0RAntrAJ0cbZTbRf5IIjyK2ltgv2oNZ28+cgCfWTjn tl8TiwYzjBFGHpHUnpkVJXQ= =1Y9V -----END PGP SIGNATURE-----

Timwi

13 Apr 13 Apr

4:51 p.m.

Magnus Manske wrote:

...

Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)

Try it out at

http://www.magnusmanske.de/wikipedia/wiki2xml.php

So whatever became of the flex/bison parser, then? Has it been abandoned because nobody can make sense of my code? :-)

Anyway, I've found a bug in your yanrap (Yet Another Not-Really-A-Parser):

Input: * one ** one point one * two

Output: <?xml version='1.0' encoding='UTF-8' ?> <article rendertime='0.0019750595092773 sec'><list type='bullet'><listitem>one<list type='bullet'><listitem>one point one</listitem></list>two</listitem></list></article>

Should have been: <?xml version='1.0' encoding='UTF-8' ?> <article rendertime='0.0019750595092773 sec'><list type='bullet'><listitem>one<list type='bullet'><listitem>one point one</listitem></list></listitem><listitem>two</listitem></list></article>

Magnus Manske

14 Apr 14 Apr

4:18 p.m.

Timwi schrieb:

...

Magnus Manske wrote:

...
Here it is! The millionth pseudo-parser I wrote for wiki(p|m)edia! :-)

Try it out at

http://www.magnusmanske.de/wikipedia/wiki2xml.php

So whatever became of the flex/bison parser, then? Has it been abandoned because nobody can make sense of my code? :-)

AFAIK, I was the only one who tried, and I'm scared :-)

Every time I changed something in the bison code, it didn't do as I expected, even when I was sure I got it right. Somehow Flex/Bison and I are incompatible...

...

Anyway, I've found a bug in your yanrap (Yet Another Not-Really-A-Parser):

I'll get to that...

Thanks, Magnus

7209

Age (days ago)

7216

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

6 participants

tags (0)

participants (6)

Axel
FxParlant
Jim Higson
Magnus Manske
Thomas Gries
Timwi