On 11/29/07, Mark Jaroski <mark(a)geekhive.net> wrote:
> I thought code was an extension-like preprocessor tag.
Afaik it's just a whitelisted HTML tag. It's treated that way by
setupAttributeWhitelist() and removeHTMLtags() in Sanitizer.php.
Interestingly, something (Tidy?) goes out of its way to support this
behaviour. This code:
---
foo<b>blah
blah
---
renders as:
----
<p>foo<b>blah</b></p>
<p><b>blah</b></p>
----
So whereas an unclosed ''' is closed at the end of the paragraph (by
doAllQuotes()), an unclosed <b> is closed then explicitly re-opened in
the next paragraph (and then closed, re-opened etc).
I hadn't appreciated that difference between ''' and <b> before.
Steve
How about this:
Word'''word -> always apostrophe+italics
Word''''word -> always apostrophe+bold
Advantages:
* French and Italian examples work correctly all the time
* You can parse it with single-token lookahead.
* No need to count matched/mismatched bold/italics
* Broken wikitext at the end of the line does not interfere with
correct wikitext at the start of the line
* Simpler to understand than the current rule.
Disadvantages:
* You lose the ability to easily apply bold mid-word.
* The ambiguity will arise more often, so more people will have to
know that ''' is not always bold. Not that it's always bold at the
moment, but you know...
Thoughts?
Steve
[This has nothing to do with the parser grammar I'm working on. This
is fantasy...]
Quite aside from the ambiguity of problems of apostrophes, the
difficulty of writing curved single quotes has been mentioned. Here's
how a totally different, unrelated wiki program could work:
//italics//
**bold**
' - straight apostrophe
'single quotes''
'''backwards single quotes'''
In other words, ' is rendered like the current apostrophe, and '' is a
curved apostrophe that leans either way depending on what text is
immediately adjacent. And for those cases where you want the quotes to
lean the *other* way, ''' just does the opposite of whatever '' does.
I wonder if there'd be the same problems:
* I said: '''''twas the last thing we needed!'' (seems reasonable?
The good news at least '' and ''' always render as exactly one
character, and no matter what you do in one part of a paragraph, it
would never affect anything else.
Of course, there's no easy way to write combinations of straight and
curved apostrophes, but that seems like a less likely situation than
apostrophes and bold or italics.
I guess an alternative could use the "backquote":
``Single quotes''
I said: ``''twas the last thing we needed'!''
Yay, unambiguous.
Steve
Here's the word from a fr:wp user on apostrophes.
Summary: I dunno if we *can* make it work "as expected" (more than
once in a paragraph) but it appears it would be a win for
French-language MediaWiki users.
- d.
---------- Forwarded message ----------
From: Rémi Kaupp <kaupp.remi(a)gmail.com>
Date: 27 Nov 2007 18:23
Subject: Parser : italics and apostrophes
To: dgerard(a)gmail.com
Hello,
I have seen your message on foundation-l. I answer privately as I have
not subscribed to this list.
On fr.wikipedia, we have this use of mixing apostrophes and italics.
This happens quite often, for instance when quoting book titles, ship
names, etc. all of which are usually put in italics. You often find
stuff such as L'''idée'', or L'''Etoile'' (a ship name), etc.
This works fine unless you do it twice in a paragraph, as you can see
on http://fr.wikipedia.org/wiki/Utilisateur:Korrigan/Brouillon6 . In
such cases, our "workaround" is either to leave some space between the
apostrophe and the italics (this is not satisfying, but this what
beginners do by trial-and-error), or to use the curved apostrophe : '
. Ideally, we should always use this sort of apostrophe, but, as it is
not on French keyboards (not on any keyboard I know, actually), 99% of
contributors use the straight apostrophe. We place the curved
apostrophe in the "special characters" box, and some people actually
use it, but we prohibit it for page titles on fr.wikipedia, and leave
up to the user to choose between normal and curved apostrophes for
article content.
Not all wikis do this : on fr.wiktionary, where typography is more
important, the straight apostrophe has been disused in favor of the
curved one. But on most other wikis, the straight apostrophe is the
main one used. I notice that the curved apostrophe is used on
fr.wikipedia in two main instances : 1) when someone creates an
article on a word processor and then pastes it into WP, 2) when the
straight apostrophe conflicts with the italics / bold markup. In
MediaWiki messages, where I've contributed a lot for translation, I
have used only curved apostrophes so that it does not conflict with
PHP / wiki markup.
Ideally, I think the parser should include this use of apostrophes +
italics. I have seen that you (or other guys) are working on this for
a new parser... good luck with this :-)
Cheers,
Rémi Kaupp
(User:Korrigan)
Ever wanted an easy way to get literal colons in a definition? I
thought you had to do this:
;Notes<nowiki>:</nowiki>
:And here's the definition.
;There's an easier way''':
:That's right. Turning on bold disables the : behaviour. And you don't
even have to close it.
I don't really know why it works yet, but obviously it *is* turning on
bold, you just can't see it because the definitions render in bold
anyway (under the normal skin).
I find this odd, because bold/unbold doesn't normally behave like a
block, in this sort of code:
Bold '''and [[link|now '''unbold]].
How confusing.
I have to say though, I'm grateful that thus far, the only nesting
language element I've discovered is [[image:]]. Fingers crossed.
Steve
I'm pleased to report that my ANTLR grammar outperforms* the current
mediawiki parser on the following pathological text:
[[[[image:foo.jpg|thumb|[[[o]]][[foo||]]|[[image:bar.jpg|thumb|[[roo
my doo|zoo|]]]]]]]]]
It's really amazing what you discover about Wikitext when you sit down
to analyse it like this. For example, a square bracket - [ - is:
- the start of an external link, if the rest of it is present, and not
in a context where external links are forbidden (notably, captions of
internal links or other external links), and not inside a nowiki tag
- part of the start of an internal link, as long as the rest is
present, and it couldn't be interpreted as an internal link, and in an
appropriate context
- a literal otherwise - that is, in any non-linkable context, not
followed by the appropriate tags to make it a link, or inside a nowiki
A pipe - | - is:
- an option separator for an image, provided that it's not within an
embedded object such as internal link or another image, and provided
that it's not within a nowiki
- a link caption separator, provided that it's not in nowiki tags
- any of a dozen other cases that I haven't dealt with yet, like
tables, templates, parser functions, categories, ...
- literal otherwise.
It's fun! I think...
Steve
* The current parser gives up. ANTLR, after a monumental struggle
involving 21 levels of method call and a bit of backtracking, parses
it correctly.
Hello,
Me again. ;o)
In French there should be a non breaking space
after « and before » ; : ! ?
The current parser replaces the space with  
Actually before ; it should be U+202F (NARROW NO-BREAK SPACE).
  for spaces in "1 000 000" (equivalent to "1,000,000").
That should be taken into account.
Regards,
Yann
--
http://www.non-violence.org/ | Site collaboratif sur la non-violence
http://www.forget-me.net/ | Alternatives sur le Net
http://fr.wikisource.org/ | Bibliothèque libre
http://wikilivres.info | Documents libres
I've written up an account of how the current parser treats apostrophes here:
http://www.mediawiki.org/wiki/Markup_spec/BNF/Inline_text#Determining_the_b…
All I've done is read the code of doAllQuotes() and translate it from
a procedural style (first replace blah, then iterate through...) into
a more declarative style (four apostrophes end up getting rendered as
X if the following is the case...).
The most interesting case is this one:
Take ''''four''' apostrophes and then throw '''''five unclosed
apostrophes at them.
Normally, four apostrophes is treated as apostrophe followed by bold.
But when the parser finds unbalanced bold *and* italics on the line,
it goes looking for a bold to split. The first bold, which is now
preceded by an apostrophe, is seen as a good candidate because it
seems to be a single letter followed by a bold (as in the l'''idee''
case). So that bold gets split *again*. Meaning that the four
apostrophes end up getting rendered as two apostrophes followed by
italics.
I suspect this was not planned behaviour.
Steve
Here's a controversial proposal: allow external links to be wrapped in
double brackets: [[http://foo.com]].
Rationale: My grammar (and the existing parser, I believe) have to go
out of their way to detect these malformed external links, then
explicitly treat them as a normal external link wrapped in extra
brackets:
[<a href="http://foo.com">[1]</a>]
However, what the user almost certainly wanted was:
<a href="http://foo.com">[1]</a>
So this is a case of existing wikitext probably being written with a
different intention to how the parser is actually rendering it.
Since we have to explicitly detect this situation, wouldn't it make
more sense to render it how the user wants it while we're at it?
Steve
PS This is a somewhat idle query - compared to the amount of work
involved in rewriting the parser, this is a trivial change either way.
I'm just curious what kinds of improvements could be made that fit
within the "don't break existing wikitext"/"mimic what people are
expecting" guidelines.
MediaWiki makes a general contract that it won't allow "dangerous"
HTML tags in its output. It does this by making a final parse fairly
late in the process to clean HTML tag attributes, and to escape any
tags it doesn't like, and unrecognised &entities;.
Question is: should the parser attempt to do this, or assume the
existence of that function?
For example, in this code>
<pre>
preformatted text with <nasty><html><characters> and &entities;
</pre>
Should it just treat the string as valid, passing it out literally
(and letting the security code go to work), or should it keep parsing
characters, stripping them, and attempting to reproduce all the work
that is currently done?
Would the developers (or users, for that matter) be likely to trust a
pure parser solution? It seems to me that it's a lot easier simply to
scan the resulting output looking for bad bits, than it is to attempt
to predict and block off all the possible routes to producing nasty
code.
On the downside, if the HTML-stripping logic isn't present in the
grammar, then it doesn't exist in any non-PHP implementations...
What do people think?
Steve