TL:DR; You get to a spec by paying down technical debt that untangles
wikitext parsing from being intricately tied to the internals of
mediawiki implementation and state.
In discussions, there is far too much focus on the fact that you cannot
write a BNF grammar or yacc / lex / bison / whatever or that quote
parsing is context-sensitive. I don't think it is as much of a big deal.
For example, you could use Markdown for parsing but that doesn't change
much of the picture outlined below ... I think all of that is less of an
issue compared to the following:
Right now, mediawiki HTML output depends on the following:
* input wikitext
* wiki config (including installed extensions)
* installed templates
* media resources (images, audio, video)
* PHP parser hooks that expose parsing internals and implementation
details (not replicable in other parsers)
* wiki messages (ex: cite output)
* state of the corpus and other db state (ex: red links, bad images)
* user state (prefs, etc.)
So, one reason for the complexity in implementing a wikitext parser is
because the output HTML is not simply a straightforward transformation
of input wikitext (and some config). There is far too much other state
that gets in the way.
The second reason for complexity is because markup errors aren't bounded
to narrow contexts, but, can leak out and impact output of the entire
page. Some user pages seem to exploit this as a feature even (unclosed
The third source of complexity is because some parser hooks expose
internals of the implementation (Before/After Strip/Tidy and other such
hooks). An implementation without tidy or that handles wikitext
different might not have the same pipeline.
However, we can still get to a spec that is much more replicable if we
start cleaning up some of this incrementally and paying down technical
debt. Here are some things going on right now towards that.
* We are close to getting rid of Tidy which removes it from the equation.
* There are RFCs that propose defining DOM scopes and propose that
output of templates (and extensions) be a DOM (vs a string), with some
caveats (that I will ignore for here). If we can get to implementing
these, we immediately isolate the parsing of a top-level page from the
details of how extensions and transclusions are processed.
* RFCs that propose that things like red links, bad images, user state,
site messages not be an input into the core wikitext parse. From a
spec-point of view, they should be viewed as post-processing
transformations. However, for efficiency reasons, an implementation
might choose to integrate that as part of the parse, but that is not a
Separately, here is one other thing we can consider:
* Deprecate and replace tag hooks that expose parser internals.
When all of these are done, it become far more feasible to think of
defining a spec for wikitext parsing that is not tied to the internals
of mediawiki or its extensions. At that point, you could implement
templating via Lua or via JS or via Ruby ... the specifics are
immaterial. What matters is those templating implementations and
extensions produce output with certain properties. You can then specify
that mediawiki-HTML is a series of transformations that are applied to
the output of the wikitext parser ... and where there can be multiple
spec-compliant implementations of that parser.
I think it is feasible to get there. But, whether we want a spec for
wikitext and should work towards that is a different question.
On 08/01/2016 08:34 PM, Gergo Tisza wrote:
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier
Do you believe that declaring "the
implementation is the spec" is a
sustainable way of encouraging contribution to our projects?
Reimplementing Wikipedia's parser (complete with template inclusions,
Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
practically impossible. What we do or do not declare won't change that.
There are many other, more realistic ways to encourage contribution by
users who are interested in wikis, but not in Wikimedia projects.
(Supporting Markdown would certainly be one of them.) But historically the
WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
other actor has been both willing and able to step up in its place.
Wikitech-l mailing list