Soo Reams schrieb:
I think work on a clean grammar and a slick parser are among the most important discussions I've ever read on here, and it's good to see it going somewhere.
I'm actually quite surprised it has gone on this long - usually these discussions are much shorter to my recollection.
The first time I (personally) ever thought about the problem of formalizing the grammar was about two years ago, when I first started with MW (around version 1.5.1). The problems then were the same as they are now, and they're the same as they're going to be onwards into the foreseeable future.
It's important to remember that MediaWiki syntax isn't a light-markup-language in the "traditional" sense. That is, unlike Markdown, Textile, APT and the like, wikitext is inexorably part of a rich infrastructure of functionality, and that infrastructure very heavily affects the grammar.
For example, language specificity (as Simetrical mentioned) would probably require that the MediaWiki grammar be a conglomerate of a myriad individual grammars for various language groups.
For another example, consider #REDIRECTs. When the #REDIRECT pattern is encountered at the beginning of a page, any subsequent content is ignored (stripped at submission time). And the "output" is variable. That is, it has an effect on the system whereby the rendered output depends on the viewing context - either it redirects to another page, or renders a link thereto.
Also consider extension tags. If no extension tag has claimed a particular handle, then the angle brackets are converted into their html encoded equivalents. That is, "<this>[[whatever]]</this>" becomes "<this><a href=...>whatever</a></this>". On the other hand, if an extension had hooked "this", then the [[whatever]] inside may be treated as a link, plaintext, or something totally different depending on the extension's implementation.
Perhaps even more complex is the treatment of parser functions, which continue to operate within the scope of page parsing (interpreting template parameters, etc), but ultimately give the option to the implementor to conditionally disable these features. That is, although {{#this:param1|param2|param3}} would usually be parsed as a call to the 'this' parser function with three parameters, it doesn't have to be. It could be a single parameter containing "param1|param2|param3".
It may even be possible to use reserved mediawiki template processing characters in the input So continuing this example, say the 'this' parser function wanted all internal text to be unparsed - treated as one string. Then "{{#this:{{whatever}}" may be treated as a call to 'this' with the parameter "{{whatever". I'm not absolutely sure if this works, as I haven't tested it, but if so, then that further complicates the tokenizer.
I'm not trying to be too defeatist here, though I sincerely doubt that these kinds of infrastructural ties will be explainable via a grammar - much less one with limited lookahead and lookbehind. The best one could hope for might be to define the basic wikitext markup language, ignoring the meanings of Namespaces, templating/transclusion, extension tags and parser functions. Even then, what use is such a grammar? It probably won't help simplify the MediaWiki Parser significantly since all the ignored features would still need to be accounted for, as they would be in any other application that hopes to integrate with MW syntax (for example an external WYSIWYG editor).
In all sincerity, I wish the best of luck to anyone who attempts to fully specify the wikitext syntax. As mentioned previously, the reward for such a feat could be as many as several beers. :)
-- Jim R. Wilson (jimbojw)
Another feature is multi-language support. The meaning .
On Nov 9, 2007 11:42 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Backwards compatibility. The main suggestion I've seen is rewriting the parser in such a way as to make it behave like the old one in everything except a few unavoidable corner cases (bold italics is not a corner case).
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
On 11/9/07, Stephen Bain stephen.bain@gmail.com wrote:
Well then, should it just take everything until the next whitespace?
Remember that some languages (like CJK) don't use whitespace to separate words. You would eat the entire paragraph. Regardless, I think we could probably do with eating all letter-characters (and number-characters? maybe not) from any alphabet that uses whitespace, for every language. Especially useful for sites like Commons or Meta or mediawiki.org. I've remarked on this before.
Anyway, if this behavior is not consistent across languages, we have the obvious problem that the parsing grammar depends on the language. This is probably not desirable. I suspect it would be entirely possible to make this behavior consistent across languages in this case, however, as I say.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l