*Parser* for Perl or PHP? - Wikitext-l

List overview All Threads
Download

newer

Parser for Perl or PHP?

older

Proposed new parser

What are the hard bits?

Tim Landscheidt

13 Dec 2009 13 Dec '09

6:19 p.m.

Hi,

I'm currently maintaining wikilint (cf. URI:http://toolserver.org/~timl/cgi-bin/wikilint) that re- views Wikipedia articles for common problems. At the moment, it is a powerful, but ugly mess of regular expressions ga- lore. Fixing bugs is a nightmare.

Ideally, a redesign would parse the source in a tree-like structure and then work on that. So I went to CPAN and [[mw:Alternative parsers]] and found out that:

a) there are lots of "release early, release once" "imple- mentations" that do not anything useful and do not seem to be in further development, and b) for many people, "parser" seems to have the meaning "converter".

So I'll probably have to start another try. As for wikilint I do not have to be able to parse 100 % of all thinkable wi- ki markup (if the article cannot be parsed, it probably is broken anyway), I could go for a rather "lean" approach. For the tree structure, I would opt for DOM to maximize code re- usability with wiki markup in a separate namespace. If there are no relevant fundaments to build on, I would prefer Perl, ideally enhancing an existing CPAN module like WWW::Wikipedia::Entry.

Any pointers to things that I overlooked? Thoughts on in- terfaces & Co.? Volunteers? :-)

Tim

Show replies by date

Platonides

13 Dec 13 Dec

10:43 p.m.

Tim Landscheidt wrote:

...

Any pointers to things that I overlooked? Thoughts on in- terfaces & Co.? Volunteers? :-)

Tim

It's a bit hard for me to understand what your tool does, since it gives a blank page when English is selected, and it takes the html source instead of the wiki source.

I get that you look for two kind of bugs: "wiki text errors" (like an unclosed tag) and "wikipedia errors" (the date doesn't conform to the manual of style).

For the first kind of errors, I have long dreamed of a feature which actually gave out errors in such case. Wikitext must accept everything, but some acceptances are in "quirks mode". If it showed the errors (to power users which opt in), they could be fixed in a more specific way than just relying on the generic parser fix to "do the right thing".

The second class of errors could be checked on top of it. It's project specific, anyway.

It would provide a definition of right wikitext, have some support at upstream, fix things like bug 21798 or bug 21534 and hopefully even improve the parser. I have dealt with the parser a bit (see bug 18765) and I don't think we could make some things remotely sane as they are handled at completely different steps. But linting completely insane ones shouldn't be too hard. :)

On the other hand, going into the Parser is probably quite far from what you expected when wanting to leave your ugly mess of regexes. Also, I may have misunderstood your position and it may not be appropiate for your lint expectations.

Tim Landscheidt

14 Dec 14 Dec

11:44 a.m.

Platonides platonides@gmail.com wrote:

...

...
Any pointers to things that I overlooked? Thoughts on in- terfaces & Co.? Volunteers? :-)

...

It's a bit hard for me to understand what your tool does, since it gives a blank page when English is selected, and it takes the html source instead of the wiki source.

Ah! Didn't notice that. It works (solely) on the wiki source, though.

...

I get that you look for two kind of bugs: "wiki text errors" (like an unclosed tag) and "wikipedia errors" (the date doesn't conform to the manual of style). [...]

It does mostly the latter, but I'm not looking for some grammar to define an article complying with a manual of style, but for a parser to parse wikitext.

...

[...] I have dealt with the parser a bit (see bug 18765) and I don't think we could make some things remotely sane as they are handled at completely different steps. But linting completely insane ones shouldn't be too hard. :)

...

On the other hand, going into the Parser is probably quite far from what you expected when wanting to leave your ugly mess of regexes. Also, I may have misunderstood your position and it may not be appropiate for your lint expectations.

I think so :-). My use case with wikilint and some other tools is:

- Are there more than one and less than x images per arti- cle? - Is there more than one link to another article? - Are there links in a "See also" section that have already appeared in the article? - If there are "Main article:" links, do they appear direct- ly following a section heading indented and italic? - Does the {{Personendaten}} data have a fuzzy relationship to the introductory line of the article?

To address these, I'd like to parse the wiki source from a concatenation of characters to a logical structure. The MediaWiki parser does not seem to care for that, so I have not looked further into that (and don't plan to do so).

So, to emphasize: I'm looking for *a* parser, that's a lowercase "p".

Tim

Magnus Manske

12:48 p.m.

On Sun, Dec 13, 2009 at 6:19 PM, Tim Landscheidt tim@tim-landscheidt.de wrote:

...

Hi,

I'm currently maintaining wikilint (cf. URI:http://toolserver.org/~timl/cgi-bin/wikilint) that re- views Wikipedia articles for common problems. At the moment, it is a powerful, but ugly mess of regular expressions ga- lore. Fixing bugs is a nightmare.

Ideally, a redesign would parse the source in a tree-like structure and then work on that. So I went to CPAN and [[mw:Alternative parsers]] and found out that:

a) there are lots of "release early, release once" "imple- mentations" that do not anything useful and do not seem to be in further development, and b) for many people, "parser" seems to have the meaning "converter".

So I'll probably have to start another try. As for wikilint I do not have to be able to parse 100 % of all thinkable wi- ki markup (if the article cannot be parsed, it probably is broken anyway), I could go for a rather "lean" approach. For the tree structure, I would opt for DOM to maximize code re- usability with wiki markup in a separate namespace. If there are no relevant fundaments to build on, I would prefer Perl, ideally enhancing an existing CPAN module like WWW::Wikipedia::Entry.

Any pointers to things that I overlooked? Thoughts on in- terfaces & Co.? Volunteers? :-)

This falls more into the "converter" group, but http://toolserver.org/~magnus/wiki2xml/w2x.php generates pretty usable XML output, especially when you use the option for API template resolution. You can use the source directly on command line, or just query the tool via GET or POST.

Cheers, Magnus

Tim Landscheidt

17 Dec 17 Dec

11:57 p.m.

Magnus Manske magnusmanske@googlemail.com wrote:

...

[...]

...
Any pointers to things that I overlooked? Thoughts on in- terfaces & Co.? Volunteers? :-)

...

This falls more into the "converter" group, but http://toolserver.org/~magnus/wiki2xml/w2x.php generates pretty usable XML output, especially when you use the option for API template resolution. You can use the source directly on command line, or just query the tool via GET or POST.

I had looked at it (and tested that it produced valid XML, though its author's name left little doubt for that :-)), but converting wiki text to XML to a DOM tree seems to pro- vide too many points of failure, and there were several calls to str_replace () & Co. that looked difficult to translate into a tree model.

Tim

5459

Age (days ago)

5463

Last active (days ago)

wikitext-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Magnus Manske
Platonides
Tim Landscheidt