2010-08-10 12:45, Thomas Dalton skrev:
On 10 August 2010 11:09, David
Gerard<dgerard(a)gmail.com> wrote:
On 9 August 2010 17:04, Mark Clements
(HappyDog)<gmane(a)kennel17.co.uk> wrote:
This kind of unexpected edge-case is arguably
something that should be fixed
in any formal markup specification.
How prevalent is it in actual wikitext? Is it an edge case people
actually use much, or are all instances of it basically errors?
That'll be the question.
Its only potential use is in making the wikitext more easily readable,
which doesn't seem important enough to warrant just a weird edge-case.
Any formal spec is going to end up breaking things, that can't really
be helped (unless we just write down a spec for the current behaviour,
bugs and all, which sounds like a lost opportunity to me).
If you consider the large body of information tied to MediaWiki
syntax, it is likely that for any border case, there is a revision of
some page that will trigger that border case.
Regarding strategy on how to replace the MediaWiki parser, I can
see two extremes:
1. Search out all wierd edge cases and reproduce them in parser rules.
Walk through the revisions of Wikipedia and for each edge case, note
all revisions for which the parser rule for the edge case is
executed. Based on the data determine which edge cases can be
safely removed. Or define a conversion for the content.
2. Don't support any edge cases. Just consider the content broken and
let the wiki users themself fix it. Historic revisions of pages
will be permanently broken.
I am trying to support as many edge cases as far as reasonable in
my attemt to write a new parser. I seems, however, as if the parser
is actively developed, and backwards compliancy with edge cases maybe
isn't much of a concern. For instance, in 1.16.0beta3 we have:
$text = $this->doAllQuotes( $text );
$text = $this->replaceInternalLinks( $text );
$text = $this->replaceExternalLinks( $text );
which in trunk is:
$text = $this->replaceInternalLinks( $text );
$text = $this->doAllQuotes( $text );
$text = $this->replaceExternalLinks( $text );
So, it is now possible to have apostrophes in internal links, but
still not in external.
From the parser's point of view, the edge cases can be divided into
"harmless", where a rule to support it does not increase the
complexity of the parser significantly, and "harmful", where adding a
rule to support them would either dramatically increase the size of
the parser or make it possible to craft contents that will take more
than linear time or memory to process. The edge cases surrounding
links definitely fall into the harmful category. I will be writing a
separate post about links later.
Maybe it would be a good idea to provide som feedback to the user
regarding bad syntax. In my parser implementation, I am considering
generating special events for syntax that should be avoided. For
instance:
begin_table:
begin = BEGIN_TABLE NEWLINE*
(
{
X->beginGarbageBlock(X, "Unsupported syntax: content between
the {| and the first column in a table.");
}
((inline_element)=> garbage_inline_text NEWLINE* )*
block_elements?
{
X->endGarbageBlock(X);
}
)*
{
X->beginTable(X, $begin->custom);
}
;
Could for instance be rendered in html as: <div class="garbage"
title="Unsupported syntax: content between ..."> </div>.
/Andreas