MediaWiki makes a general contract that it won't allow "dangerous"
HTML tags in its output. It does this by making a final parse fairly
late in the process to clean HTML tag attributes, and to escape any
tags it doesn't like, and unrecognised &entities;.
Question is: should the parser attempt to do this, or assume the
existence of that function?
For example, in this code>
<pre>
preformatted text with <nasty><html><characters> and &entities;
</pre>
Should it just treat the string as valid, passing it out literally
(and letting the security code go to work), or should it keep parsing
characters, stripping them, and attempting to reproduce all the work
that is currently done?
Would the developers (or users, for that matter) be likely to trust a
pure parser solution? It seems to me that it's a lot easier simply to
scan the resulting output looking for bad bits, than it is to attempt
to predict and block off all the possible routes to producing nasty
code.
On the downside, if the HTML-stripping logic isn't present in the
grammar, then it doesn't exist in any non-PHP implementations...
What do people think?
Steve
The way the <nowiki> tag is currently implemented, any text inside the
tag is basically stripped out, held to one side, and reinserted at the
last minute. So this:
1: [[pipe.jpg|thumb|A <nowiki>|</nowiki> character]]
works because that stage of the parser never even sees the |
character, and it reappears magically after the text has been turned
into <div...><img...></div> (I think).
However, what it actually does in a given context is actually hard to
pin down. This doesn't work, for example:
2: [[Image:<nowiki>foo</nowiki>.jpg]] - the whole thing is rendered literally.
I was thinking perhaps it could be redefined thus:
"Text surrounded by a <nowiki> block is treated as a literal sequence
of characters with no special meaning ascribed to any character other
than its literal representation. A nowiki block is a token separator,
not whitespace."
That would mean example 2 above would render as if the nowiki tags
weren't there.
This would also work:
3: [[Image:foo.jpg|thumb|<nowiki>left</nowiki>]] (caption: "left")
This would be a tricky case:
4: <nowiki> <script badeviljavacode here> </nowiki>
This would render literally, because of the "token separator" aspect:
5: [<nowiki></nowiki>[not a link]].
It would be technically possible to link to pages with bad characters
in their names:
6: [[E<nowiki>|</nowiki>eet]]
Would any existing wikitext be broken by this redefinition? I'm not
really trying to change the meaning of nowiki, I'm trying to set it
down in words, given that the existing definition ("stuff gets
stripped out, then replaced at various times") is not really
implementable.
Steve
On 11/20/07, Mark Clements <gmane(a)kennel17.co.uk> wrote:
> Really, this isn't really about logic in that sense, it's about changing the
> way refer to things. Part of this is to make a logical distinction between
> the 3 different entities that have a different syntax and a different
> purpose (parser directives, built-in variables and automatic links).
Put like that, it's a compelling argument: those three terms are very
descriptive.
> links, so the term 'automatic links' might be better, and again removes a
> bit of the mystery.
Yep. It could possibly be even better, but I'm not sure how. "Implicit
link"? "Bracketless link"? Dunno. I hate them anyway. :) ("Evil
bracketless link"?...hmm)
> Yes. It occurred to me, shortly after writing that post, that they should
> not be referred to as "built-in variables", but rather as "built-in
> templates". That is the term I shall use from now on.
Some of them really behave like variables in a template, but with two
braces instead of 3. "Template" really implies a calculation of some
kind to me. But either is ok.
> > Can "built-in variables" take arguments, and if so, how are they treated?
>
> This is resolved if we refer to them as "built-in templates" instead.
So, they can take arguments, and the syntax is with a pipe like a
normal template. So what is {{DEFAULTSORT:foo}}?
> Hmmm.... well here you've found an exception (and there may be others).
> DEFAULTSORT is a parser directive, but it uses the template-style syntax.
No, it uses a colon. It just happens that Wikipedia has a template
called {{DEFAULTSORT}} which calls {{DEFAULTSORT:{{{1}}}}}. *groan*
> This is partly a symptom of the problem I am describing (lack of a formal
> definition of these items) but is also probably down to the fact that the __
> syntax doesn't support arguments as easily (though I don't see why not).
>
> We need to think about how to resolve this e.g. can we re-define
> DEFAULTSORT:
> __DEFAULTSORT|Sort key__ seems plausible.
> __DEFAULTSORT|{{PAGENAME}}__ seems a little, well, odd... but maybe that's
> just because it's new.
Well, the page name is the default sort key anyway. But, yes.
__FOO|Arg|Arg__ looks ok to me. Would <nowiki> work, if you need to
pass in a __ somewhere?
> Or do we change the syntax to remove all double-underscore directives, and
> change them all to use template syntax? In this case parser-directives
> could be distinguished by being prefixed with an underscore (e.g.
> {{_NOTOC}}). Or we could make them into built-in parser functions
> {{#NOTOC}}.
Hmm...well, we're not supposed to be changing anything at all at the
moment. Perhaps we could at least set up a list of all the magic words
at the moment, set up a proposed syntax change, and how they would all
map, and see what it looks like. Of course we could implement the
change in the current parser, to avoid the restriction on changing
syntax and parser at the same time :)
> These last ideas are just off the top of my head mind, and need further
> consideration. Any syntax changes would of course need to retain existing
> functionality (as 'deprecated' syntax) to preserve backwards compatability.
That's a problem. I was originally asking about "anything being a
magic word" because if anything can be, then parsing is harder.
Changing to a more uniform structure but supporting the old terms
doesn't help much. Though in practice I don't think it will prove to
be a huge problem.
Steve
---------- Forwarded message ----------
From: Tim Starling <tstarling(a)wikimedia.org>
Date: 21 Nov 2007 02:34
Subject: [Wikitech-l] New preprocessor
To: wikitech-l(a)lists.wikimedia.org
Brion said to me a couple of weeks ago "the parser is slow for large
articles, fix it". So along these lines, I have rewritten the preprocessor
phase to make it faster in PHP. I also have plans for further speed
improvement via a partial port to C.
This work was planned and started before the recent parser discussions on
wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to
improve my productivity. Apologies if I'm stepping on any toes.
I'll cover the technical side of this first, and then the impact for the
user in terms of wikitext syntax change.
This text is mostly adapted from my entry in RELEASE-NOTES.
== Technical viewpoint ==
The parser pass order has changed from
* Extension tag strip and render
* HTML normalisation and security
* Template expansion
* Main section...
to
* Template and extension tag parse to intermediate representation
* Template expansion and extension rendering
* HTML normalisation and security
* Main section...
The new two-pass preprocessor can skip "dead branches" in template
expansion, such as unfollowed #if cases and unused defaults for template
arguments. This provides a significant performance improvement in
template-heavy test cases taken from Wikipedia. Parser function hooks can
participate in this performance improvement by using the new
SFH_OBJECT_ARGS flag during registration.
The intermediate representation I have used is a DOM document tree, taking
advantage of PHP's standard access to libxml's efficient tree structures.
I construct the tree via an XML text stage, although it could be done
directly with DOM. My gut feeling was that the XML implementation would be
faster, but I've made the interfaces such that it could be done either
way. The XML form is not exposed.
One reason for using an intermediate representation is so that the parse
results for templates can be cached. The theory is that the cached results
can then be used to efficiently expand templates with changeable
arguments, such as {{cite web}}. ( There's also an expansion cache for
templates expanded with no arguments, such as {{•}}. )
Another reason is that I couldn't see any efficient (O(N) worst-case time
order) way to implement dead branch elimination without an intermediate
representation.
The pre-expand include size limit has been removed, since there's no
efficient way to calculate such a figure, and it would now be meaningless
for performance anyway. The "preprocessor node count" takes its place,
with a generous default limit.
The context in which XML-style extension tags are called has changed, so
extensions which make use of the parser state may need compatibility
changes. Since extension tags are now rendered simultaneously with
template expansion, there is a possibility for future improvement of the
extension tag interface. For example, we could have
preprocessor-transparent tags which act like parser functions, and we
could give extension tags access to the template arguments (i.e. triple
brace expansion).
== User viewpoint ==
The main effect of this for the user is that the rules for uncovered
syntax have changed.
Uncovered main-pass syntax, such as HTML tags, are now generally valid,
whereas previously in some cases they were escaped. For example, you could
have "<ta" in one template, and "ble>" in another template, and put them
together to make a valid <table> tag. Previously the result would have
been "<table>".
Uncovered preprocessor syntax is generally not recognised. For example, if
you have "{{a" in Template:A and "b}}" in Template:B, then "{{a}}{{b}}"
will be converted to a literal "{{ab}}" rather than the contents of
Template:Ab. This was the case previously in HTML output mode, and is now
uniformly the case in the other modes as well. HTML-style comments
uncovered by template expansion will not be recognised by the preprocessor
and hence will not prevent template expansion within them, but they will
be stripped by the following HTML security pass.
The rules for template expansion during message transformation were
counterintuitive, mostly accidental and buggy. There are a few small
changes in this version: for example, templates with dynamic names, as in
"{{ {{a}} }}", are fully expanded as they are in HTML mode, whereas
previously only the inner template was expanded. I'd like to make some
larger breaking changes to message transformation, after a review of
typical use cases.
The header identification routines for section edit and for numbering
section edit links have been merged. This removes a significant failure
mode and fixes a whole category of bugs (tracked by bug #4899). Wikitext
headings uncovered by template expansion or comment removal will still be
rendered into a heading tag, and will get an entry in the TOC, but will
not have a section edit link. HTML-style headings will also not have a
section edit link. Valid wikitext headings present in the template source
text will get a template section edit link. This is a major break from
previous behaviour, but I believe the effects are almost entirely beneficial.
-- Tim Starling
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello,
David Gerard a écrit :
> http://lists.wikimedia.org/mailman/listinfo/wikitext-l
>
> Wikitext-l was formed from a recent discussion on wikitech-l about the
> need to sanely reimplement the current parser, which is a Horrible
> Mess and pretty much impossible to reimplement in another language.
>
> The MediaWiki parser definition is literally "whatever the PHP parser
> does." Some of what it does is arguably very wrong, pathological,
> magical or just a Stupid Parser Trick. So the list has been formed to
> come up with a grammar that defines all the useful parts of the
> present parser, and so can be used by anyone to implement a MediaWiki
> wikitext parser. This will be useful for other software, for WYSIWYG
> editing extensions ... all manner of things.
>
> Some of what some people would think of as a "stupid parser trick" is
> in fact important - e.g. L'''uomo'' which renders as L<i>uomo</i>
> (necessary for French and Italian).
Actually, the proper French apostrophe should be ’ (Unicode : U2019,
Code HTML : ’) not '
On the French Wikisource, we systematically replace ' with ’ in all
articles and titles with bots (keeping redirects). So actually, '''
should be ’'' in proper French typography.
The issue is that ’ is not in the standard French keyboard, and it does
not exist in Latin1 (like œ for oe). There is also the problem with
broken softwares, like copy-paste in a non compliant Unicode editor,
etc. That's why it is so really used.
> - d.
Regards,
Yann
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres
hi,
some time ago ICANN put up test IDNs. they set up a wiki using
mediawiki : http://idn.icann.org
soon after that an issue concerning HTTP URLs for RTL scripts has been
raised (actually i think i raised it, to be honest) :
http://idn.icann.org/Talk:IDNwiki#RTL_scripts_URL_directionality_problem_.2…
basically, for RTL scripts URLs are represented with "http://" to the
left. this looks awkward for us RTL users
because it breaks the hierarchy in the URL (it's better explained in
the example above).
after making some investigation, i found that when you use a RTL
localized navigator, URLs in the address bar are represented with
"://http" to the right. which, i think, is an acceptable solution.
i think mediawiki parser should convert RTL IDN URLs to this format.
--
Slim Amamou
http://NoMemorySpace.wordpress.com
Here's another one, at the bottom of
http://www.mediawiki.org/wiki/User:Stevage
(note, mw_img_thumbnail means "the magic word 'img_thumbnail', however that
is defined".)
The problem I have here is the options for the image: you'd like the word
"thumbnail" to be a token, but then if you get a case like:
[[image:finger.jpg|Note the impressive thumbnails.]]
you get one token for "thumbnail" rather than "t" and "h" etc.
Solutions I can think of so far:
1) Explicitly make the match for text to be 'a'..'z' | 'A'..'Z'
| MW_img_thumbnail | ...
2) Make tokens for individual letters (Aa, Bb...) then make the parser
recognise a pattern like Tt + Hh + Uu + Mm...
3) Make a token which is
'|thumbnail', then use some trick to distinguish '|thumbnailblah' from
'|thumbnail|'.
4) Like 1), but use a localised lexer so that those words are only tokens in
this specific context.
5) Just match text, then use special markup at the parser level to look into
the text that was matched.
I've tried 1) and 2) and they both work. I'll probably try 5) next because
3) is just ugly.
Anyone have any comments or suggestions?
I really think writing the grammar in ANTLR is our best bet at this point.
Advantages:
1) We're talking about actual, parseable grammar in an actual syntax, rather
than the half-arsed EBNF/BNF we've done so far.
2) We can use ANTLRWorks to play with the grammar, visualise it etc.
3) One of the goals is to allow 3rd party parsers to
generate code in a variety of languages. ANTLR already has 5 code
targets and more (perhaps including PHP) are on the way.
Downsides:
1) ANTLR can't yet generate a parser in PHP. However, there may exist
Java->PHP or C->PHP translators or something.
Steve
That was quite amusing, I read the "Welcome to your new list" message before
the wikitech-l message. Anyway, a list just for parser discussion is good.
Here's a bit of ANTLR grammar I wrote to handle basic article structure:
paragraph blocks and "special blocks", where two consecutive blocks of the
same type need an extra linefeed. Since I haven't written any Lex or Yacc
before, I'm still wrestling a bit with what are probably fairly basic
problems. In this case, I found the requirement of an extra linefeed quite
challenging to implement without ambiguity problems.
As it is, this does work, but spews out a huge number of warnings and even
an apparently non-fatal "fatal error". I presume some of these problems can
be avoided through semantic and syntactic predicates, if not backtracking,
memoization (no, that's not a typo). Any ANTLR experts here?
Steve
--
grammar paras;
article : pseries? (sseries (EOF| pseries))*;
pseries : para (N+ para)* N*;
sseries : specialblock (N+ specialblock)* N*;
specialblock
: (spaceblock|listblock)+;
spaceblock
: spaceline+;
spaceline
: SPECIALCHAR char* N;
listblock
: (listitem)+;
listitem: (bulletitem | numberitem | indentitem | defitem);
bulletitem
: BULLETCHAR (listitem | (nonlistchar char*)? N);
numberitem
: NUMBERCHAR (listitem | (nonlistchar char*)? N);
indentitem
: INDENTCHAR (listitem | (nonlistchar char*)? N);
defitem
: DEFCHAR (nonindentchar)* (definition | INDENTCHAR? N );
definition
: ':' char+ N;
BULLETCHAR: '*';
NUMBERCHAR: '#';
INDENTCHAR: ':';
DEFCHAR : ';';
para : (nonspecialchar char* N)+;
listchar: BULLETCHAR | NUMBERCHAR | INDENTCHAR | DEFCHAR;
SPECIALCHAR
: ' ';
nonlistchar
: SPECIALCHAR | nonspecialchar;
char : nonlistchar | listchar;
nonindentchar
: nonlistchar | BULLETCHAR | NUMBERCHAR | DEFCHAR;
N : '\r'? '\n' ;
nowiki : NOWIKI;
NOWIKI : '<nowiki>'( options {greedy=false;} : . )*'</nowiki>';
nonspecialchar
: NONSPECIALCHAR | nowiki;
NONSPECIALCHAR
: ('A'..'Z'| 'a'..'z' | '0'..'9' | '\'' | '"' | '(' | ')')+;
--
PS you might notice the above grammar implements two "improvements" to the
;definition:term notation:
1. The ;definition has to be the last item in the list. Constructs like
##;## are worthless.
2. A trailing : is treated literally.
Just sent this to wikipedia-l and foundation-l - I figured they would
be good places to ask.
- d.
---------- Forwarded message ----------
From: David Gerard <dgerard(a)gmail.com>
Date: 17 Nov 2007 12:05
Subject: New parser in the works - please help
To: wikipedia-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l was formed from a recent discussion on wikitech-l about the
need to sanely reimplement the current parser, which is a Horrible
Mess and pretty much impossible to reimplement in another language.
The MediaWiki parser definition is literally "whatever the PHP parser
does." Some of what it does is arguably very wrong, pathological,
magical or just a Stupid Parser Trick. So the list has been formed to
come up with a grammar that defines all the useful parts of the
present parser, and so can be used by anyone to implement a MediaWiki
wikitext parser. This will be useful for other software, for WYSIWYG
editing extensions ... all manner of things.
Some of what some people would think of as a "stupid parser trick" is
in fact important - e.g. L'''uomo'' which renders as L<i>uomo</i>
(necessary for French and Italian).
So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.
This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn't be
obvious to an English-speaker going through the present parser code.
- d.