TL:DR; You get to a spec by paying down technical debt that untangles wikitext parsing from being intricately tied to the internals of mediawiki implementation and state.
In discussions, there is far too much focus on the fact that you cannot write a BNF grammar or yacc / lex / bison / whatever or that quote parsing is context-sensitive. I don't think it is as much of a big deal. For example, you could use Markdown for parsing but that doesn't change much of the picture outlined below ... I think all of that is less of an issue compared to the following:
Right now, mediawiki HTML output depends on the following: * input wikitext * wiki config (including installed extensions) * installed templates * media resources (images, audio, video) * PHP parser hooks that expose parsing internals and implementation details (not replicable in other parsers) * wiki messages (ex: cite output) * state of the corpus and other db state (ex: red links, bad images) * user state (prefs, etc.) * Tidy
So, one reason for the complexity in implementing a wikitext parser is because the output HTML is not simply a straightforward transformation of input wikitext (and some config). There is far too much other state that gets in the way.
The second reason for complexity is because markup errors aren't bounded to narrow contexts, but, can leak out and impact output of the entire page. Some user pages seem to exploit this as a feature even (unclosed div tags).
The third source of complexity is because some parser hooks expose internals of the implementation (Before/After Strip/Tidy and other such hooks). An implementation without tidy or that handles wikitext different might not have the same pipeline.
However, we can still get to a spec that is much more replicable if we start cleaning up some of this incrementally and paying down technical debt. Here are some things going on right now towards that.
* We are close to getting rid of Tidy which removes it from the equation. * There are RFCs that propose defining DOM scopes and propose that output of templates (and extensions) be a DOM (vs a string), with some caveats (that I will ignore for here). If we can get to implementing these, we immediately isolate the parsing of a top-level page from the details of how extensions and transclusions are processed. * RFCs that propose that things like red links, bad images, user state, site messages not be an input into the core wikitext parse. From a spec-point of view, they should be viewed as post-processing transformations. However, for efficiency reasons, an implementation might choose to integrate that as part of the parse, but that is not a requirement.
Separately, here is one other thing we can consider: * Deprecate and replace tag hooks that expose parser internals.
When all of these are done, it become far more feasible to think of defining a spec for wikitext parsing that is not tied to the internals of mediawiki or its extensions. At that point, you could implement templating via Lua or via JS or via Ruby ... the specifics are immaterial. What matters is those templating implementations and extensions produce output with certain properties. You can then specify that mediawiki-HTML is a series of transformations that are applied to the output of the wikitext parser ... and where there can be multiple spec-compliant implementations of that parser.
I think it is feasible to get there. But, whether we want a spec for wikitext and should work towards that is a different question.
Subbu.
On 08/01/2016 08:34 PM, Gergo Tisza wrote:
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier robla@wikimedia.org wrote:
Do you believe that declaring "the implementation is the spec" is a sustainable way of encouraging contribution to our projects?
Reimplementing Wikipedia's parser (complete with template inclusions, Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is practically impossible. What we do or do not declare won't change that.
There are many other, more realistic ways to encourage contribution by users who are interested in wikis, but not in Wikimedia projects. (Supporting Markdown would certainly be one of them.) But historically the WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no other actor has been both willing and able to step up in its place. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l