Wikitext-l November 2007

wikitext-l@lists.wikimedia.org

12 participants
18 discussions

From Wikitech-l (Determining the behavior of apostrophes)

by Steve Bennett

On 11/29/07, Mark Jaroski <mark(a)geekhive.net> wrote: > I thought code was an extension-like preprocessor tag. Afaik it's just a whitelisted HTML tag. It's treated that way by setupAttributeWhitelist() and removeHTMLtags() in Sanitizer.php. Interestingly, something (Tidy?) goes out of its way to support this behaviour. This code: --- fooblah blah --- renders as: ---- fooblah blah ---- So whereas an unclosed ''' is closed at the end of the paragraph (by doAllQuotes()), an unclosed is closed then explicitly re-opened in the next paragraph (and then closed, re-opened etc). I hadn't appreciated that difference between ''' and before. Steve

16 years, 4 months

So, a better algorithm for apostrophes?

by Steve Bennett

How about this: Word'''word -> always apostrophe+italics Word''''word -> always apostrophe+bold Advantages: * French and Italian examples work correctly all the time * You can parse it with single-token lookahead. * No need to count matched/mismatched bold/italics * Broken wikitext at the end of the line does not interfere with correct wikitext at the start of the line * Simpler to understand than the current rule. Disadvantages: * You lose the ability to easily apply bold mid-word. * The ambiguity will arise more often, so more people will have to know that ''' is not always bold. Not that it's always bold at the moment, but you know... Thoughts? Steve

16 years, 4 months

How it could be...

by Steve Bennett

[This has nothing to do with the parser grammar I'm working on. This is fantasy...] Quite aside from the ambiguity of problems of apostrophes, the difficulty of writing curved single quotes has been mentioned. Here's how a totally different, unrelated wiki program could work: //italics// **bold** ' - straight apostrophe 'single quotes'' '''backwards single quotes''' In other words, ' is rendered like the current apostrophe, and '' is a curved apostrophe that leans either way depending on what text is immediately adjacent. And for those cases where you want the quotes to lean the *other* way, ''' just does the opposite of whatever '' does. I wonder if there'd be the same problems: * I said: '''''twas the last thing we needed!'' (seems reasonable? The good news at least '' and ''' always render as exactly one character, and no matter what you do in one part of a paragraph, it would never affect anything else. Of course, there's no easy way to write combinations of straight and curved apostrophes, but that seems like a less likely situation than apostrophes and bold or italics. I guess an alternative could use the "backquote": ``Single quotes'' I said: ``''twas the last thing we needed'!'' Yay, unambiguous. Steve

16 years, 4 months

Fwd: Parser : italics and apostrophes

by David Gerard

Here's the word from a fr:wp user on apostrophes. Summary: I dunno if we *can* make it work "as expected" (more than once in a paragraph) but it appears it would be a win for French-language MediaWiki users. - d. ---------- Forwarded message ---------- From: Rémi Kaupp <kaupp.remi(a)gmail.com> Date: 27 Nov 2007 18:23 Subject: Parser : italics and apostrophes To: dgerard(a)gmail.com Hello, I have seen your message on foundation-l. I answer privately as I have not subscribed to this list. On fr.wikipedia, we have this use of mixing apostrophes and italics. This happens quite often, for instance when quoting book titles, ship names, etc. all of which are usually put in italics. You often find stuff such as L'''idée'', or L'''Etoile'' (a ship name), etc. This works fine unless you do it twice in a paragraph, as you can see on http://fr.wikipedia.org/wiki/Utilisateur:Korrigan/Brouillon6 . In such cases, our "workaround" is either to leave some space between the apostrophe and the italics (this is not satisfying, but this what beginners do by trial-and-error), or to use the curved apostrophe : ' . Ideally, we should always use this sort of apostrophe, but, as it is not on French keyboards (not on any keyboard I know, actually), 99% of contributors use the straight apostrophe. We place the curved apostrophe in the "special characters" box, and some people actually use it, but we prohibit it for page titles on fr.wikipedia, and leave up to the user to choose between normal and curved apostrophes for article content. Not all wikis do this : on fr.wiktionary, where typography is more important, the straight apostrophe has been disused in favor of the curved one. But on most other wikis, the straight apostrophe is the main one used. I notice that the curved apostrophe is used on fr.wikipedia in two main instances : 1) when someone creates an article on a word processor and then pastes it into WP, 2) when the straight apostrophe conflicts with the italics / bold markup. In MediaWiki messages, where I've contributed a lot for translation, I have used only curved apostrophes so that it does not conflict with PHP / wiki markup. Ideally, I think the parser should include this use of apostrophes + italics. I have seen that you (or other guys) are working on this for a new parser... good luck with this :-) Cheers, Rémi Kaupp (User:Korrigan)

16 years, 4 months

Intriguing parser discovery for the day.

by Steve Bennett

Ever wanted an easy way to get literal colons in a definition? I thought you had to do this: ;Notes<nowiki>:</nowiki> :And here's the definition. ;There's an easier way''': :That's right. Turning on bold disables the : behaviour. And you don't even have to close it. I don't really know why it works yet, but obviously it *is* turning on bold, you just can't see it because the definitions render in bold anyway (under the normal skin). I find this odd, because bold/unbold doesn't normally behave like a block, in this sort of code: Bold '''and [[link|now '''unbold]]. How confusing. I have to say though, I'm grateful that thus far, the only nesting language element I've discovered is [[image:]]. Fingers crossed. Steve

16 years, 4 months

A pathological case

by Steve Bennett

I'm pleased to report that my ANTLR grammar outperforms* the current mediawiki parser on the following pathological text: [[[[image:foo.jpg|thumb|[[[o]]][[foo||]]|[[image:bar.jpg|thumb|[[roo my doo|zoo|]]]]]]]]] It's really amazing what you discover about Wikitext when you sit down to analyse it like this. For example, a square bracket - [ - is: - the start of an external link, if the rest of it is present, and not in a context where external links are forbidden (notably, captions of internal links or other external links), and not inside a nowiki tag - part of the start of an internal link, as long as the rest is present, and it couldn't be interpreted as an internal link, and in an appropriate context - a literal otherwise - that is, in any non-linkable context, not followed by the appropriate tags to make it a link, or inside a nowiki A pipe - | - is: - an option separator for an image, provided that it's not within an embedded object such as internal link or another image, and provided that it's not within a nowiki - a link caption separator, provided that it's not in nowiki tags - any of a dozen other cases that I haven't dealt with yet, like tables, templates, parser functions, categories, ... - literal otherwise. It's fun! I think... Steve * The current parser gives up. ANTLR, after a monumental struggle involving 21 levels of method call and a bit of backtracking, parses it correctly.

16 years, 4 months

Spaces in French

by Yann Forget

Hello, Me again. ;o) In French there should be a non breaking space after « and before » ; : ! ? The current parser replaces the space with   Actually before ; it should be U+202F (NARROW NO-BREAK SPACE).   for spaces in "1 000 000" (equivalent to "1,000,000"). That should be taken into account. Regards, Yann -- http://www.non-violence.org/ | Site collaboratif sur la non-violence http://www.forget-me.net/ | Alternatives sur le Net http://fr.wikisource.org/ | Bibliothèque libre http://wikilivres.info | Documents libres

16 years, 5 months

Determining the behaviour of apostrophes

by Steve Bennett

I've written up an account of how the current parser treats apostrophes here: http://www.mediawiki.org/wiki/Markup_spec/BNF/Inline_text#Determining_the_b… All I've done is read the code of doAllQuotes() and translate it from a procedural style (first replace blah, then iterate through...) into a more declarative style (four apostrophes end up getting rendered as X if the following is the case...). The most interesting case is this one: Take ''''four''' apostrophes and then throw '''''five unclosed apostrophes at them. Normally, four apostrophes is treated as apostrophe followed by bold. But when the parser finds unbalanced bold *and* italics on the line, it goes looking for a bold to split. The first bold, which is now preceded by an apostrophe, is seen as a good candidate because it seems to be a single letter followed by a bold (as in the l'''idee'' case). So that bold gets split *again*. Meaning that the four apostrophes end up getting rendered as two apostrophes followed by italics. I suspect this was not planned behaviour. Steve

16 years, 5 months

Double-bracketed external links

by Steve Bennett

Here's a controversial proposal: allow external links to be wrapped in double brackets: [[http://foo.com]]. Rationale: My grammar (and the existing parser, I believe) have to go out of their way to detect these malformed external links, then explicitly treat them as a normal external link wrapped in extra brackets: [<a href="http://foo.com">[1]</a>] However, what the user almost certainly wanted was: <a href="http://foo.com">[1]</a> So this is a case of existing wikitext probably being written with a different intention to how the parser is actually rendering it. Since we have to explicitly detect this situation, wouldn't it make more sense to render it how the user wants it while we're at it? Steve PS This is a somewhat idle query - compared to the amount of work involved in rewriting the parser, this is a trivial change either way. I'm just curious what kinds of improvements could be made that fit within the "don't break existing wikitext"/"mimic what people are expecting" guidelines.

16 years, 5 months

HTML security

by Steve Bennett

MediaWiki makes a general contract that it won't allow "dangerous" HTML tags in its output. It does this by making a final parse fairly late in the process to clean HTML tag attributes, and to escape any tags it doesn't like, and unrecognised &entities;. Question is: should the parser attempt to do this, or assume the existence of that function? For example, in this code> <pre> preformatted text with <nasty><html><characters> and &entities; </pre> Should it just treat the string as valid, passing it out literally (and letting the security code go to work), or should it keep parsing characters, stripping them, and attempting to reproduce all the work that is currently done? Would the developers (or users, for that matter) be likely to trust a pure parser solution? It seems to me that it's a lot easier simply to scan the resulting output looking for bad bits, than it is to attempt to predict and block off all the possible routes to producing nasty code. On the downside, if the HTML-stripping logic isn't present in the grammar, then it doesn't exist in any non-PHP implementations... What do people think? Steve

16 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Wikitext-l November 2007