Wikitext-l

wikitext-l@lists.wikimedia.org

239 discussions

by Steve Bennett

MediaWiki makes a general contract that it won't allow "dangerous" HTML tags in its output. It does this by making a final parse fairly late in the process to clean HTML tag attributes, and to escape any tags it doesn't like, and unrecognised &entities;. Question is: should the parser attempt to do this, or assume the existence of that function? For example, in this code> <pre> preformatted text with <nasty><html><characters> and &entities; </pre> Should it just treat the string as valid, passing it out literally (and letting the security code go to work), or should it keep parsing characters, stripping them, and attempting to reproduce all the work that is currently done? Would the developers (or users, for that matter) be likely to trust a pure parser solution? It seems to me that it's a lot easier simply to scan the resulting output looking for bad bits, than it is to attempt to predict and block off all the possible routes to producing nasty code. On the downside, if the HTML-stripping logic isn't present in the grammar, then it doesn't exist in any non-PHP implementations... What do people think? Steve

16 years, 6 months

How to define <nowiki>?

by Steve Bennett

The way the <nowiki> tag is currently implemented, any text inside the tag is basically stripped out, held to one side, and reinserted at the last minute. So this: 1: [[pipe.jpg|thumb|A <nowiki>|</nowiki> character]] works because that stage of the parser never even sees the | character, and it reappears magically after the text has been turned into <div...><img...></div> (I think). However, what it actually does in a given context is actually hard to pin down. This doesn't work, for example: 2: [[Image:<nowiki>foo</nowiki>.jpg]] - the whole thing is rendered literally. I was thinking perhaps it could be redefined thus: "Text surrounded by a <nowiki> block is treated as a literal sequence of characters with no special meaning ascribed to any character other than its literal representation. A nowiki block is a token separator, not whitespace." That would mean example 2 above would render as if the nowiki tags weren't there. This would also work: 3: [[Image:foo.jpg|thumb|<nowiki>left</nowiki>]] (caption: "left") This would be a tricky case: 4: <nowiki> <script badeviljavacode here> </nowiki> This would render literally, because of the "token separator" aspect: 5: [<nowiki></nowiki>[not a link]]. It would be technically possible to link to pages with bad characters in their names: 6: [[E<nowiki>|</nowiki>eet]] Would any existing wikitext be broken by this redefinition? I'm not really trying to change the meaning of nowiki, I'm trying to set it down in words, given that the existing definition ("stuff gets stripped out, then replaced at various times") is not really implementable. Steve

16 years, 6 months

Parser: is *anything* a valid magic word? (from wikitech-l)

by Steve Bennett

On 11/20/07, Mark Clements <gmane(a)kennel17.co.uk> wrote: > Really, this isn't really about logic in that sense, it's about changing the > way refer to things. Part of this is to make a logical distinction between > the 3 different entities that have a different syntax and a different > purpose (parser directives, built-in variables and automatic links). Put like that, it's a compelling argument: those three terms are very descriptive. > links, so the term 'automatic links' might be better, and again removes a > bit of the mystery. Yep. It could possibly be even better, but I'm not sure how. "Implicit link"? "Bracketless link"? Dunno. I hate them anyway. :) ("Evil bracketless link"?...hmm) > Yes. It occurred to me, shortly after writing that post, that they should > not be referred to as "built-in variables", but rather as "built-in > templates". That is the term I shall use from now on. Some of them really behave like variables in a template, but with two braces instead of 3. "Template" really implies a calculation of some kind to me. But either is ok. > > Can "built-in variables" take arguments, and if so, how are they treated? > > This is resolved if we refer to them as "built-in templates" instead. So, they can take arguments, and the syntax is with a pipe like a normal template. So what is {{DEFAULTSORT:foo}}? > Hmmm.... well here you've found an exception (and there may be others). > DEFAULTSORT is a parser directive, but it uses the template-style syntax. No, it uses a colon. It just happens that Wikipedia has a template called {{DEFAULTSORT}} which calls {{DEFAULTSORT:{{{1}}}}}. *groan* > This is partly a symptom of the problem I am describing (lack of a formal > definition of these items) but is also probably down to the fact that the __ > syntax doesn't support arguments as easily (though I don't see why not). > > We need to think about how to resolve this e.g. can we re-define > DEFAULTSORT: > __DEFAULTSORT|Sort key__ seems plausible. > __DEFAULTSORT|{{PAGENAME}}__ seems a little, well, odd... but maybe that's > just because it's new. Well, the page name is the default sort key anyway. But, yes. __FOO|Arg|Arg__ looks ok to me. Would <nowiki> work, if you need to pass in a __ somewhere? > Or do we change the syntax to remove all double-underscore directives, and > change them all to use template syntax? In this case parser-directives > could be distinguished by being prefixed with an underscore (e.g. > {{_NOTOC}}). Or we could make them into built-in parser functions > {{#NOTOC}}. Hmm...well, we're not supposed to be changing anything at all at the moment. Perhaps we could at least set up a list of all the magic words at the moment, set up a proposed syntax change, and how they would all map, and see what it looks like. Of course we could implement the change in the current parser, to avoid the restriction on changing syntax and parser at the same time :) > These last ideas are just off the top of my head mind, and need further > consideration. Any syntax changes would of course need to retain existing > functionality (as 'deprecated' syntax) to preserve backwards compatability. That's a problem. I was originally asking about "anything being a magic word" because if anything can be, then parsing is harder. Changing to a more uniform structure but supporting the old terms doesn't help much. Though in practice I don't think it will prove to be a huge problem. Steve

16 years, 6 months

Fwd: [Wikitech-l] New preprocessor

by David Gerard

---------- Forwarded message ---------- From: Tim Starling <tstarling(a)wikimedia.org> Date: 21 Nov 2007 02:34 Subject: [Wikitech-l] New preprocessor To: wikitech-l(a)lists.wikimedia.org Brion said to me a couple of weeks ago "the parser is slow for large articles, fix it". So along these lines, I have rewritten the preprocessor phase to make it faster in PHP. I also have plans for further speed improvement via a partial port to C. This work was planned and started before the recent parser discussions on wikitech-l, by Steve Bennett et al. I chose to ignore those discussions to improve my productivity. Apologies if I'm stepping on any toes. I'll cover the technical side of this first, and then the impact for the user in terms of wikitext syntax change. This text is mostly adapted from my entry in RELEASE-NOTES. == Technical viewpoint == The parser pass order has changed from * Extension tag strip and render * HTML normalisation and security * Template expansion * Main section... to * Template and extension tag parse to intermediate representation * Template expansion and extension rendering * HTML normalisation and security * Main section... The new two-pass preprocessor can skip "dead branches" in template expansion, such as unfollowed #if cases and unused defaults for template arguments. This provides a significant performance improvement in template-heavy test cases taken from Wikipedia. Parser function hooks can participate in this performance improvement by using the new SFH_OBJECT_ARGS flag during registration. The intermediate representation I have used is a DOM document tree, taking advantage of PHP's standard access to libxml's efficient tree structures. I construct the tree via an XML text stage, although it could be done directly with DOM. My gut feeling was that the XML implementation would be faster, but I've made the interfaces such that it could be done either way. The XML form is not exposed. One reason for using an intermediate representation is so that the parse results for templates can be cached. The theory is that the cached results can then be used to efficiently expand templates with changeable arguments, such as {{cite web}}. ( There's also an expansion cache for templates expanded with no arguments, such as {{•}}. ) Another reason is that I couldn't see any efficient (O(N) worst-case time order) way to implement dead branch elimination without an intermediate representation. The pre-expand include size limit has been removed, since there's no efficient way to calculate such a figure, and it would now be meaningless for performance anyway. The "preprocessor node count" takes its place, with a generous default limit. The context in which XML-style extension tags are called has changed, so extensions which make use of the parser state may need compatibility changes. Since extension tags are now rendered simultaneously with template expansion, there is a possibility for future improvement of the extension tag interface. For example, we could have preprocessor-transparent tags which act like parser functions, and we could give extension tags access to the template arguments (i.e. triple brace expansion). == User viewpoint == The main effect of this for the user is that the rules for uncovered syntax have changed. Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas previously in some cases they were escaped. For example, you could have "<ta" in one template, and "ble>" in another template, and put them together to make a valid <table> tag. Previously the result would have been "<table>". Uncovered preprocessor syntax is generally not recognised. For example, if you have "{{a" in Template:A and "b}}" in Template:B, then "{{a}}{{b}}" will be converted to a literal "{{ab}}" rather than the contents of Template:Ab. This was the case previously in HTML output mode, and is now uniformly the case in the other modes as well. HTML-style comments uncovered by template expansion will not be recognised by the preprocessor and hence will not prevent template expansion within them, but they will be stripped by the following HTML security pass. The rules for template expansion during message transformation were counterintuitive, mostly accidental and buggy. There are a few small changes in this version: for example, templates with dynamic names, as in "{{ {{a}} }}", are fully expanded as they are in HTML mode, whereas previously only the inner template was expanded. I'd like to make some larger breaking changes to message transformation, after a review of typical use cases. The header identification routines for section edit and for numbering section edit links have been merged. This removes a significant failure mode and fixes a whole category of bugs (tracked by bug #4899). Wikitext headings uncovered by template expansion or comment removal will still be rendered into a heading tag, and will get an entry in the TOC, but will not have a section edit link. HTML-style headings will also not have a section edit link. Valid wikitext headings present in the template source text will get a template section edit link. This is a major break from previous behaviour, but I believe the effects are almost entirely beneficial. -- Tim Starling _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

16 years, 6 months

French apostrophe

by Yann Forget

Hello, David Gerard a écrit : > http://lists.wikimedia.org/mailman/listinfo/wikitext-l > > Wikitext-l was formed from a recent discussion on wikitech-l about the > need to sanely reimplement the current parser, which is a Horrible > Mess and pretty much impossible to reimplement in another language. > > The MediaWiki parser definition is literally "whatever the PHP parser > does." Some of what it does is arguably very wrong, pathological, > magical or just a Stupid Parser Trick. So the list has been formed to > come up with a grammar that defines all the useful parts of the > present parser, and so can be used by anyone to implement a MediaWiki > wikitext parser. This will be useful for other software, for WYSIWYG > editing extensions ... all manner of things. > > Some of what some people would think of as a "stupid parser trick" is > in fact important - e.g. L'''uomo'' which renders as L<i>uomo</i> > (necessary for French and Italian). Actually, the proper French apostrophe should be ’ (Unicode : U2019, Code HTML : ’) not ' On the French Wikisource, we systematically replace ' with ’ in all articles and titles with bots (keeping redirects). So actually, ''' should be ’'' in proper French typography. The issue is that ’ is not in the standard French keyboard, and it does not exist in Latin1 (like œ for oe). There is also the problem with broken softwares, like copy-paste in a non compliant Unicode editor, etc. That's why it is so really used. > - d. Regards, Yann -- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

16 years, 6 months

url representation in right to left scripts (arabic, hebrew, ...)

by Slim Amamou

hi, some time ago ICANN put up test IDNs. they set up a wiki using mediawiki : http://idn.icann.org soon after that an issue concerning HTTP URLs for RTL scripts has been raised (actually i think i raised it, to be honest) : http://idn.icann.org/Talk:IDNwiki#RTL_scripts_URL_directionality_problem_.2… basically, for RTL scripts URLs are represented with "http://" to the left. this looks awkward for us RTL users because it breaks the hierarchy in the URL (it's better explained in the example above). after making some investigation, i found that when you use a RTL localized navigator, URLs in the address bar are represented with "://http" to the right. which, i think, is an acceptable solution. i think mediawiki parser should convert RTL IDN URLs to this format. -- Slim Amamou http://NoMemorySpace.wordpress.com

16 years, 6 months

Image grammar

by Steve Bennett

Here's another one, at the bottom of http://www.mediawiki.org/wiki/User:Stevage (note, mw_img_thumbnail means "the magic word 'img_thumbnail', however that is defined".) The problem I have here is the options for the image: you'd like the word "thumbnail" to be a token, but then if you get a case like: [[image:finger.jpg|Note the impressive thumbnails.]] you get one token for "thumbnail" rather than "t" and "h" etc. Solutions I can think of so far: 1) Explicitly make the match for text to be 'a'..'z' | 'A'..'Z' | MW_img_thumbnail | ... 2) Make tokens for individual letters (Aa, Bb...) then make the parser recognise a pattern like Tt + Hh + Uu + Mm... 3) Make a token which is '|thumbnail', then use some trick to distinguish '|thumbnailblah' from '|thumbnail|'. 4) Like 1), but use a localised lexer so that those words are only tokens in this specific context. 5) Just match text, then use special markup at the parser level to look into the text that was matched. I've tried 1) and 2) and they both work. I'll probably try 5) next because 3) is just ugly. Anyone have any comments or suggestions? I really think writing the grammar in ANTLR is our best bet at this point. Advantages: 1) We're talking about actual, parseable grammar in an actual syntax, rather than the half-arsed EBNF/BNF we've done so far. 2) We can use ANTLRWorks to play with the grammar, visualise it etc. 3) One of the goals is to allow 3rd party parsers to generate code in a variety of languages. ANTLR already has 5 code targets and more (perhaps including PHP) are on the way. Downsides: 1) ANTLR can't yet generate a parser in PHP. However, there may exist Java->PHP or C->PHP translators or something. Steve

16 years, 6 months

Welcome!

by Steve Bennett

That was quite amusing, I read the "Welcome to your new list" message before the wikitech-l message. Anyway, a list just for parser discussion is good. Here's a bit of ANTLR grammar I wrote to handle basic article structure: paragraph blocks and "special blocks", where two consecutive blocks of the same type need an extra linefeed. Since I haven't written any Lex or Yacc before, I'm still wrestling a bit with what are probably fairly basic problems. In this case, I found the requirement of an extra linefeed quite challenging to implement without ambiguity problems. As it is, this does work, but spews out a huge number of warnings and even an apparently non-fatal "fatal error". I presume some of these problems can be avoided through semantic and syntactic predicates, if not backtracking, memoization (no, that's not a typo). Any ANTLR experts here? Steve -- grammar paras; article : pseries? (sseries (EOF| pseries))*; pseries : para (N+ para)* N*; sseries : specialblock (N+ specialblock)* N*; specialblock : (spaceblock|listblock)+; spaceblock : spaceline+; spaceline : SPECIALCHAR char* N; listblock : (listitem)+; listitem: (bulletitem | numberitem | indentitem | defitem); bulletitem : BULLETCHAR (listitem | (nonlistchar char*)? N); numberitem : NUMBERCHAR (listitem | (nonlistchar char*)? N); indentitem : INDENTCHAR (listitem | (nonlistchar char*)? N); defitem : DEFCHAR (nonindentchar)* (definition | INDENTCHAR? N ); definition : ':' char+ N; BULLETCHAR: '*'; NUMBERCHAR: '#'; INDENTCHAR: ':'; DEFCHAR : ';'; para : (nonspecialchar char* N)+; listchar: BULLETCHAR | NUMBERCHAR | INDENTCHAR | DEFCHAR; SPECIALCHAR : ' '; nonlistchar : SPECIALCHAR | nonspecialchar; char : nonlistchar | listchar; nonindentchar : nonlistchar | BULLETCHAR | NUMBERCHAR | DEFCHAR; N : '\r'? '\n' ; nowiki : NOWIKI; NOWIKI : '<nowiki>'( options {greedy=false;} : . )*'</nowiki>'; nonspecialchar : NONSPECIALCHAR | nowiki; NONSPECIALCHAR : ('A'..'Z'| 'a'..'z' | '0'..'9' | '\'' | '"' | '(' | ')')+; -- PS you might notice the above grammar implements two "improvements" to the ;definition:term notation: 1. The ;definition has to be the last item in the list. Constructs like ##;## are worthless. 2. A trailing : is treated literally.

16 years, 6 months

Fwd: New parser in the works - please help

by David Gerard

Just sent this to wikipedia-l and foundation-l - I figured they would be good places to ask. - d. ---------- Forwarded message ---------- From: David Gerard <dgerard(a)gmail.com> Date: 17 Nov 2007 12:05 Subject: New parser in the works - please help To: wikipedia-l(a)lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitext-l Wikitext-l was formed from a recent discussion on wikitech-l about the need to sanely reimplement the current parser, which is a Horrible Mess and pretty much impossible to reimplement in another language. The MediaWiki parser definition is literally "whatever the PHP parser does." Some of what it does is arguably very wrong, pathological, magical or just a Stupid Parser Trick. So the list has been formed to come up with a grammar that defines all the useful parts of the present parser, and so can be used by anyone to implement a MediaWiki wikitext parser. This will be useful for other software, for WYSIWYG editing extensions ... all manner of things. Some of what some people would think of as a "stupid parser trick" is in fact important - e.g. L'''uomo'' which renders as L<i>uomo</i> (necessary for French and Italian). So: we need to know what MediaWiki quirks are supporting important constructs in languages other than English (which is the language the list is in, and is the native language of most of the participants), and particularly in non-European languages. This list is unlikely to implement new features, e.g. (an example brought up by GerardM) the double-apostrophe in Neapolitan. But we really need to know about present important features that wouldn't be obvious to an English-speaker going through the present parser code. - d.

16 years, 6 months

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Wikitext-l