any new progress of the parser?

List overview All Threads
Download

newer

older

parsing template data

Another approach: write a grammar...

mingli yuan

12 Jul 2008 12 Jul '08

3:58 a.m.

I am a new member of the list. But it seems that this list was so quiet for months. any new progress of any plan of parsers?

I have tried Steve Bennett's antlr grammar, but was failed to generate parser for it. And I noticed also that no new progress for this antlr grammar.

Dose the MediaWiki tech team have any roadmaps for the parser? Who dose push the things progress more?

Thanks for your reply.

Regards, Mingli

Attachments:

attachment.htm (text/html — 471 bytes)

Show replies by date

Tomaž Šolc

14 Jul 14 Jul

6:19 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi Mingli

I guess everyone gave up on the dream of being able to define the current syntax in any sane, well-defined form ;)

I tried to build a parser similar to flexbisonparse a while ago, using flex and bison to create an XML parse tree. Of course, I failed miserably after two weeks of work and went back to the Perl regex monstrosity we use at the company. But I did find out the following things which may be useful for any future efforts:

I believe it's wrong to attempt to create a single parser for MediaWiki syntax (like flexbisonparse attempted). A better and much more simple way is to define multiple formal grammars for each step in the parsing. This way you can get around the problem when an xml-like tag is constructed from different templates for example. My attempt included separate flex/bison parsers for:

<noinclude>, <includeonly>, ... parts

templates transclusion (e.g. {{{ and {{, constructs)

text formatting

possibly more steps for tables, etc. but I didn't get this far.

The biggest problem defining these is graceful degradation on broken input. It's not that hard to get the parser to work in simple, well defined cases. But if you want to get anywhere near the way the current parser degrades on ambiguous input the parser definitions start to grow out of hand. And parsing speed ends up in the dumps. You're just trying to cram context into a context-free grammar.

- From my observations I believe that the only possible way that any formal grammar will replace the current PHP parser is if the MediaWiki team is prepared to change the current philosophy of desperately trying to make sense of any kind of broken string of characters the user provides i.e. if MediaWiki could throw up a syntax error on invalid input and/or they significantly reduce the number of valid constructs (all horrible combinations of bold/italics markup come to mind)

Given my understanding of the project I find this extremely unlikely. But then I'm not a MediaWiki developer, so I might be completely wrong here.

Best regards Tomaž Šolc

- --- Tomaž Šolc, Research & Development Zemanta Ltd, London, Ljubljana www.zemanta.com mail: tomaz@zemanta.com blog: http://www.tablix.org/~avian/blog

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIeygjyJ/LzBrnoEgRAjIxAKCnrR+nBb5R43c7nJc+JJbokQvojwCgjt7F Y8+Pajt/fmGF4KO48SGSYAE= =WL/Z -----END PGP SIGNATURE-----

David Gerard

9:01 a.m.

2008/7/14 Tomaž Šolc tomaz.solc@zemanta.com:

...

From my observations I believe that the only possible way that any

formal grammar will replace the current PHP parser is if the MediaWiki team is prepared to change the current philosophy of desperately trying to make sense of any kind of broken string of characters the user provides i.e. if MediaWiki could throw up a syntax error on invalid input and/or they significantly reduce the number of valid constructs (all horrible combinations of bold/italics markup come to mind) Given my understanding of the project I find this extremely unlikely. But then I'm not a MediaWiki developer, so I might be completely wrong here.

I suspect it's highly unlikely that we'll ever have a situation where any wikitext will come up with "SYNTAX ERROR" or equivalent. (Some templates on en:wp do something like this for bad parameters, but they try to make the problem reasonably obvious to fix.) Basically, the stuff's gotta work for someone who can't work a computer or think in terms of this actually being a computer language rather than text with markup. I would *guess* that an acceptable failure mode would be just to render the text unprocessed.

The thing to do with particularly problematic "bad" constructs would be to go through the wikitext corpus and see how often they're actually used and how fixable they are.

Remember also third-party users of MediaWiki, who may expect a given bug effect to work as a feature.

- d.

mingli yuan

10:10 a.m.

Thansk, Tomaž and David.

My concern is that the mediawiki dev team should have some plan whatever the parser will parse one time or many times. Someone should push things to progress gradually.

Wikimedia projects have been accumulated so huge a repository of knowledge. And these knowledge should be used in a wider circumstances. Could you imagine that wikipedia articles was always bounded with a php regexp parser? Then any formal description of the wikitext is welcomed. We should free the knowledge from its format.

Thanks again.

Regards, Mingli

On Mon, Jul 14, 2008 at 9:01 PM, David Gerard dgerard@gmail.com wrote:

...

2008/7/14 Tomaž Šolc tomaz.solc@zemanta.com:

...

From my observations I believe that the only possible way that any

formal grammar will replace the current PHP parser is if the MediaWiki team is prepared to change the current philosophy of desperately trying to make sense of any kind of broken string of characters the user provides i.e. if MediaWiki could throw up a syntax error on invalid input and/or they significantly reduce the number of valid constructs (all horrible combinations of bold/italics markup come to mind) Given my understanding of the project I find this extremely unlikely. But then I'm not a MediaWiki developer, so I might be completely wrong

here.

I suspect it's highly unlikely that we'll ever have a situation where any wikitext will come up with "SYNTAX ERROR" or equivalent. (Some templates on en:wp do something like this for bad parameters, but they try to make the problem reasonably obvious to fix.) Basically, the stuff's gotta work for someone who can't work a computer or think in terms of this actually being a computer language rather than text with markup. I would *guess* that an acceptable failure mode would be just to render the text unprocessed.

The thing to do with particularly problematic "bad" constructs would be to go through the wikitext corpus and see how often they're actually used and how fixable they are.

Remember also third-party users of MediaWiki, who may expect a given bug effect to work as a feature.

d.

d.

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Jim R. Wilson

2:18 p.m.

Don't worry Mingli,

...

My concern is that the mediawiki dev team should have some plan whatever the parser will parse one time or many times.

It is almost certainly impossible to parse wikitext "one time" - it's too beautifully complex for that.

...

Someone should push things to progress gradually.

In another 8 to 10 months, someone will try again, there will be a big flareup of activity regarding a standardized, formalized, perfectly context-free mediawiki grammar and subsequent language-agnostic parser. At the end of that struggle and strife, we'll be back here where we started.

I'm not being cynical here (nor am I trying to prematurely instigate another flamewar) - it's just the nature of the problem. A lot of really bright minds have attempted to fit wikitext into a traditional grammar mold. The problem is that it's not a traditional grammar.

My recommendation is to address the actual reason why someone might want a context-free grammar in the first place. Considering how much time and creative energy has been spent on trying to create the one-true-parser, I wonder whether it would be easier to simply port the existing Parser to other languages directly (regular expressions and all). I bet it would be.

-- Jim R. Wilson (jimbojw)

On Mon, Jul 14, 2008 at 9:10 AM, mingli yuan mingli.yuan@gmail.com wrote:

...

Thansk, Tomaž and David.

My concern is that the mediawiki dev team should have some plan whatever the parser will parse one time or many times. Someone should push things to progress gradually.

Wikimedia projects have been accumulated so huge a repository of knowledge. And these knowledge should be used in a wider circumstances. Could you imagine that wikipedia articles was always bounded with a php regexp parser? Then any formal description of the wikitext is welcomed. We should free the knowledge from its format.

Thanks again.

Regards, Mingli

On Mon, Jul 14, 2008 at 9:01 PM, David Gerard dgerard@gmail.com wrote:

...
2008/7/14 Tomaž Šolc tomaz.solc@zemanta.com:

...

From my observations I believe that the only possible way that any

formal grammar will replace the current PHP parser is if the MediaWiki team is prepared to change the current philosophy of desperately trying to make sense of any kind of broken string of characters the user provides i.e. if MediaWiki could throw up a syntax error on invalid input and/or they significantly reduce the number of valid constructs (all horrible combinations of bold/italics markup come to mind) Given my understanding of the project I find this extremely unlikely. But then I'm not a MediaWiki developer, so I might be completely wrong here.

I suspect it's highly unlikely that we'll ever have a situation where any wikitext will come up with "SYNTAX ERROR" or equivalent. (Some templates on en:wp do something like this for bad parameters, but they try to make the problem reasonably obvious to fix.) Basically, the stuff's gotta work for someone who can't work a computer or think in terms of this actually being a computer language rather than text with markup. I would *guess* that an acceptable failure mode would be just to render the text unprocessed.

The thing to do with particularly problematic "bad" constructs would be to go through the wikitext corpus and see how often they're actually used and how fixable they are.

Remember also third-party users of MediaWiki, who may expect a given bug effect to work as a feature.

d.

d.

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Jay R. Ashworth

2:25 p.m.

On Mon, Jul 14, 2008 at 01:18:27PM -0500, Jim R. Wilson wrote:

...

...
Someone should push things to progress gradually.

In another 8 to 10 months, someone will try again, there will be a big flareup of activity regarding a standardized, formalized, perfectly context-free mediawiki grammar and subsequent language-agnostic parser. At the end of that struggle and strife, we'll be back here where we started.

I'm not being cynical here (nor am I trying to prematurely instigate another flamewar) - it's just the nature of the problem. A lot of really bright minds have attempted to fit wikitext into a traditional grammar mold. The problem is that it's not a traditional grammar.

My appraisal of Steve's work, as I watched it here, is that that's not actually true this time. Steve has gotten a lot closer than anyone else whose work I'd looked at -- it's actually functional right now for probably better than 75% of the mediawiki-alike uses you might want to put it to, I think, based on how he was describing it.

And more importantly, he picked a base that makes it easier to extend the work he already did.

...

My recommendation is to address the actual reason why someone might want a context-free grammar in the first place. Considering how much time and creative energy has been spent on trying to create the one-true-parser, I wonder whether it would be easier to simply port the existing Parser to other languages directly (regular expressions and all). I bet it would be.

Yeah, but that wasn't in fact the raeson, I don't think.

My view of the two goals was:

1) create a replacement parser that will drop-in and actually be maintainable and understandable.

2) create a framework for parsers that will work with MW-compatible text, and which can be used for other things.

I, for example, want to be able to drop an MW compatible parser into WebGUI, so content people can completely avoid HTML.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274 Those who cast the vote decide nothing. Those who count the vote decide everything. -- (Josef Stalin)

David Gerard

2:37 p.m.

2008/7/14 Jim R. Wilson wilson.jim.r@gmail.com:

...

My recommendation is to address the actual reason why someone might want a context-free grammar in the first place. Considering how much time and creative energy has been spent on trying to create the one-true-parser, I wonder whether it would be easier to simply port the existing Parser to other languages directly (regular expressions and all). I bet it would be.

The question is how to implement such highly desirable features as a good WYSIWYG editor without knowing what each of those bits of syntax actually mean. A parseable grammar would make it just that bit closer. A grammar with a small amount of context might be workable (to minimise the recalculation needed at each keystroke). Etc.

The other use case is a 100% (once the really stupidly unnecessary emergent effects are ignored) provably correct reimplementation in another language, e.g. an optional C-based parser, a Java parser, etc., etc.

- d.

Jay R. Ashworth

2:40 p.m.

On Mon, Jul 14, 2008 at 07:37:15PM +0100, David Gerard wrote:

...

The other use case is a 100% (once the really stupidly unnecessary emergent effects are ignored) provably correct reimplementation in another language, e.g. an optional C-based parser, a Java parser, etc., etc.

If a parser could be created that was close enough to 100% that Brion and Tim were happy with it... in C... I suspect it might go on line pretty quickly.

Speed's an issue when you serve as many pages as we do...

Cheers, -- jr 'where, by "we", I mean...' a

mingli yuan

8:09 p.m.

Hi, Jim and Jay. Thanks for your reply.

...

Considering how much time and creative energy has been spent on trying to create the one-true-parser, I wonder whether it would be easier to simply port the existing Parser to other languages directly (regular expressions and all). I bet it would be.

I had ported part of the existing parser into Java years ago. I still remembered the hard time, it is hard to understand the existing code, and the result regexp-based parser is buggy.

I think at least two things are needed:

* an abstract description of what the parser do. the rules should be clear. and it should be independent with the language implemented. maybe regexp based, or context-free grammar plus something more, or something totally different.

* a test suit. I know we already have one in svn. and this test suit is independent with the programming language. that is a good start.

On the other hand, I see other endeavor to standardize wiki markup outside the WikiMedia community. I mean wikiCreole in WikiSym2006 and WikiModel from sourceforge. After taking a quick look, I just want to know what make wikiCreole different? Is it an endeavor to standardize wiki markup, or it is simply just another wiki markup?

On Tue, Jul 15, 2008 at 2:40 AM, Jay R. Ashworth jra@baylink.com wrote:

...

On Mon, Jul 14, 2008 at 07:37:15PM +0100, David Gerard wrote:

...
The other use case is a 100% (once the really stupidly unnecessary emergent effects are ignored) provably correct reimplementation in another language, e.g. an optional C-based parser, a Java parser, etc., etc.

If a parser could be created that was close enough to 100% that Brion and Tim were happy with it... in C... I suspect it might go on line pretty quickly.

Speed's an issue when you serve as many pages as we do...

Cheers,

-- jr 'where, by "we", I mean...' a

Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
        Those who cast the vote decide nothing.
        Those who count the vote decide everything.
          -- (Josef Stalin)
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Platonides

15 Jul 15 Jul

4:22 a.m.

mingli yuan wrote:

...

On the other hand, I see other endeavor to standardize wiki markup outside the WikiMedia community. I mean wikiCreole in WikiSym2006 and WikiModel from sourceforge. After taking a quick look, I just want to know what make wikiCreole different? Is it an endeavor to standardize wiki markup, or it is simply just another wiki markup?

It *aimed* to standarize wiki markup, but by creating another markup with what they considered "fancy". Ie, things like being EBNF wasn't considered on designing it so there're rules like "// toggle italics except if it appears as part of a url". OTOH it has a set of rules defining the grammar, which is always good, albeit wikicreaole is smaller and much less powerful than mediawiki's markup.

Tomaž Šolc

10:24 a.m.

...

My recommendation is to address the actual reason why someone might want a context-free grammar in the first place. Considering how much time and creative energy has been spent on trying to create the one-true-parser, I wonder whether it would be easier to simply port the existing Parser to other languages directly (regular expressions and all). I bet it would be.

My experience is that is was easier to learn how the parser behaves from studying example outputs than to deduce that from its source code.

In the ideal case, a clean parser implementation (in any language) would be almost as good as a formal definition of the syntax. That's basically the reason I see for all of us trying to come up with a context-free grammat - it gives you a parser that is easy to understand and easy to port to other platforms.

Now the current MediaWiki parser is all but clean. It's not a simple class with well defined interfaces that you can stick into another PHP program. It also doesn't generate a clean parse tree - it mangles strings until it arrives at something HTML-like and then cleans it up.

Since the syntax includes all sorts of ways a page can interact with data outside of the current the interface of such a pluggable class would probably be pretty complex. Maybe one way of making some progress is to decide on this interface and push the existing parser towards it?

By the way, are any of you attending Wikimania? I would love to participate in any discussion on this topic.

Best regards Tomaž

6000

Age (days ago)

6003

Last active (days ago)

wikitext-l@lists.wikimedia.org

10 comments

7 participants

tags (0)

participants (7)

David Gerard
Jay R. Ashworth
Jim R. Wilson
mingli yuan
Platonides
Tomaž Šolc
Tomaž Šolc