Re: [Wikitech-l] EBNF grammar project status?

9 Nov 2007


      Soo Reams schrieb:
...
I think work on a clean grammar and a slick parser are among the most
important discussions I've ever read on here, and it's good to see it
going somewhere.
I'm actually quite surprised it has gone on this long - usually these
discussions are much shorter to my recollection.
The first time I (personally) ever thought about the problem of
formalizing the grammar was about two years ago, when I first started
with MW (around version 1.5.1).  The problems then were the same as
they are now, and they're the same as they're going to be onwards into
the foreseeable future.
It's important to remember that MediaWiki syntax isn't a
light-markup-language in the "traditional" sense.  That is, unlike
Markdown, Textile, APT and the like, wikitext is inexorably part of a
rich infrastructure of functionality, and that infrastructure very
heavily affects the grammar.
For example, language specificity (as Simetrical mentioned) would
probably require that the MediaWiki grammar be a conglomerate of a
myriad individual grammars for various language groups.
For another example, consider #REDIRECTs.  When the #REDIRECT pattern
is encountered at the beginning of a page, any subsequent content is
ignored (stripped at submission time).  And the "output" is variable.
That is, it has an effect on the system whereby the rendered output
depends on the viewing context - either it redirects to another page,
or renders a link thereto.
Also consider extension tags.  If no extension tag has claimed a
particular handle, then the angle brackets are converted into their
html encoded equivalents.  That is, "<this>[[whatever]]</this>"
becomes "&lt;this&gt;<a href=...>whatever</a>&lt;/this&gt;".  On the
other hand, if an extension had hooked "this", then the [[whatever]]
inside may be treated as a link, plaintext, or something totally
different depending on the extension's implementation.
Perhaps even more complex is the treatment of parser functions, which
continue to operate within the scope of page parsing (interpreting
template parameters, etc), but ultimately give the option to the
implementor to conditionally disable these features.  That is,
although {{#this:param1|param2|param3}} would usually be parsed as a
call to the 'this' parser function with three parameters, it doesn't
have to be.  It could be a single parameter containing
"param1|param2|param3".
It may even be possible to use reserved mediawiki template processing
characters in the input  So continuing this example, say the 'this'
parser function wanted all internal text to be unparsed - treated as
one string.  Then "{{#this:{{whatever}}" may be treated as a call to
'this' with the parameter "{{whatever".  I'm not absolutely sure if
this works, as I haven't tested it, but if so, then that further
complicates the tokenizer.
I'm not trying to be too defeatist here, though I sincerely doubt that
these kinds of infrastructural ties will be explainable via a grammar
- much less one with limited lookahead and lookbehind.  The best one
could hope for might be to define the basic wikitext markup language,
ignoring the meanings of Namespaces, templating/transclusion,
extension tags and parser functions.  Even then, what use is such a
grammar? It probably won't help simplify the MediaWiki Parser
significantly since all the ignored features would still need to be
accounted for, as they would be in any other application that hopes to
integrate with MW syntax (for example an external WYSIWYG editor).
In all sincerity, I wish the best of luck to anyone who attempts to
fully specify the wikitext syntax.  As mentioned previously, the
reward for such a feat could be as many as several beers. :)
-- Jim R. Wilson (jimbojw)
Another feature is multi-language support.  The meaning .
On Nov 9, 2007 11:42 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
...
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
...
Backwards compatibility. The main suggestion I've seen is rewriting
the parser in such a way as to make it behave like the old one in
everything except a few unavoidable corner cases (bold italics is not
a corner case).
I would view bold italics with adjacent apostrophes as a corner case.
The behavior in that case makes very little sense and I doubt it's
being widely used.
On 11/9/07, Stephen Bain stephen.bain@gmail.com wrote:
...
Well then, should it just take everything until the next whitespace?
Remember that some languages (like CJK) don't use whitespace to
separate words.  You would eat the entire paragraph.  Regardless, I
think we could probably do with eating all letter-characters (and
number-characters? maybe not) from any alphabet that uses whitespace,
for every language.  Especially useful for sites like Commons or Meta
or mediawiki.org.  I've remarked on this before.
Anyway, if this behavior is not consistent across languages, we have
the obvious problem that the parsing grammar depends on the language.
This is probably not desirable.  I suspect it would be entirely
possible to make this behavior consistent across languages in this
case, however, as I say.

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] EBNF grammar project status?