Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2 Aug 2016


      TL:DR; You get to a spec by paying down technical debt that untangles 
wikitext parsing from being intricately tied to the internals of 
mediawiki implementation and state.
In discussions, there is far too much focus on the fact that you cannot 
write a BNF grammar or yacc / lex / bison / whatever or that quote 
parsing is context-sensitive. I don't think it is as much of a big deal. 
For example, you could use Markdown for parsing but that doesn't change 
much of the picture outlined below ... I think all of that is less of an 
issue compared to the following:
Right now, mediawiki HTML output depends on the following:
* input wikitext
* wiki config (including installed extensions)
* installed templates
* media resources (images, audio, video)
* PHP parser hooks that expose parsing internals and implementation 
details (not replicable in other parsers)
* wiki messages (ex: cite output)
* state of the corpus and other db state (ex: red links, bad images)
* user state (prefs, etc.)
* Tidy
So, one reason for the complexity in implementing a wikitext parser is 
because the output HTML is not simply a straightforward transformation 
of input wikitext (and some config). There is far too much other state 
that gets in the way.
The second reason for complexity is because markup errors aren't bounded 
to narrow contexts, but, can leak out and impact output of the entire 
page. Some user pages seem to exploit this as a feature even (unclosed 
div tags).
The third source of complexity is because some parser hooks expose 
internals of the implementation (Before/After Strip/Tidy and other such 
hooks). An implementation without tidy or that handles wikitext 
different might not have the same pipeline.
However, we can still get to a spec that is much more replicable if we 
start cleaning up some of this incrementally and paying down technical 
debt. Here are some things going on right now towards that.
* We are close to getting rid of Tidy which removes it from the equation.
* There are RFCs that propose defining DOM scopes and propose that 
output of templates (and extensions) be a DOM (vs a string), with some 
caveats (that I will ignore for here). If we can get to implementing 
these, we immediately isolate the parsing of a top-level page from the 
details of how extensions and transclusions are processed.
* RFCs that propose that things like red links, bad images, user state, 
site messages not be an input into the core wikitext parse. From a 
spec-point of view, they should be viewed as post-processing 
transformations. However, for efficiency reasons, an implementation 
might choose to integrate that as part of the parse, but that is not a 
requirement.
Separately, here is one other thing we can consider:
* Deprecate and replace tag hooks that expose parser internals.
When all of these are done, it become far more feasible to think of 
defining a spec for wikitext parsing that is not tied to the internals 
of mediawiki or its extensions. At that point, you could implement 
templating via Lua or via JS or via Ruby ... the specifics are 
immaterial. What matters is those templating implementations and 
extensions produce output with certain properties. You can then specify 
that mediawiki-HTML is a series of transformations that are applied to 
the output of the wikitext parser ... and where there can be multiple 
spec-compliant implementations of that parser.
I think it is feasible to get there. But, whether we want a spec for 
wikitext and should work towards that is a different question.
Subbu.
On 08/01/2016 08:34 PM, Gergo Tisza wrote:
...
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier robla@wikimedia.org wrote:
...
Do you believe that declaring "the implementation is the spec" is a
sustainable way of encouraging contribution to our projects?
Reimplementing Wikipedia's parser (complete with template inclusions,
Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
practically impossible. What we do or do not declare won't change that.
There are many other, more realistic ways to encourage contribution by
users who are interested in wikis, but not in Wikimedia projects.
(Supporting Markdown would certainly be one of them.) But historically the
WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
other actor has been both willing and able to step up in its place.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links