On 5/13/05, Timwi <timwi(a)gmx.net> wrote:
Andrew Rodland wrote:
I had guessed that it might find some use in fr -- It's too bad to
hear that it's "widely" used. However, I should note that it's not
_required_.
Of *course* it's required. By saying it isn't, you're thinking too
technically. Humans aren't like that, humans just want to write their
text and not ugly tags and syntax elements just for a single apostrophe.
I didn't know that it was too technical of me to think that "required"
should mean "required".
<nowiki>
resolves the ambiguity nicely.
Again -- "nicely" only in the technical sense, but not in the human
usability sense.
It's nice in the human-usability sense that you can say exactly what
you mean, instead of having to guess how it's going to be interpreted
(speaking of which, can you show me any document, preferably in
English, which explains this behavior?). I agree that <nowiki> is
rather unwieldy, but that in itself doesn't make the existing solution
a good one.
The
workaround, on the other hand, does bad things to the language,
and makes the implementation of a more advanced parser exceedingly
difficult.
You are making two assumptions here that are both false.
Firstly, you are assuming that the language becomes more ambiguous this
way. This is false, because by handling this case explicitly, I have
actually made it *less* ambiguous. Previously, it was only a side-effect
of the way regular expressions match text that three apostrophes were
rendered as <i> followed by an apostrophe. Now I have specifically
written code to define three apostrophes to mean "an apostrophe followed
by open-italics, unless there is another triple-apostrophe in the line,
in which case it's open-bold". No ambiguity there.
How does "a side-effect of the way regular expressions match text"
turn the markup for bold into an apostrophe and the markup for italic?
The second assumption you are making (explicitly,
even) is that it is
more difficult to implement, when in fact you really just mean that you
found it harder because it is not the way regular expressions normally
work (and because you find the behaviour confusing because you don't
normally think of French). I didn't find this particularly difficult to
do -- neither in the current parser, nor in flexbisonparse.
If you had read my messages, you might have noticed that my reasoning
was based neither on anything to do with regexes at all, nor on
linguistic prejudice, but on a simple consideration. It is impossible,
at the time that the parser sees a ''', to resolve what type of token
it is, without looking ahead to the end of the line (an unbounded and
unknown distance away). _That's_ what I called ambiguity. The
alternative is that '' means '', and ''' means
'''. My current feeling
is that the "cleanest" solution to the problem would be to introduce a
separator which produces no output, but breaks up tokens; then you
could write (with ∙ as sequence operator) ' ∙ '', '' ∙ ', ' ∙
''', '''
∙ ', and even '' ∙ '' all you want, with no ambiguity to the parser
and no considerable hassle to the user. "Otherwise how is the computer
supposed to know what you mean?" is an argument anyone can understand.
The existing code in doQuotes() simply operates by logically
_separating_ the consecutive quotes, so automatic conversion wouldn't
be overly taxing, nor time-critical. I haven't seen flexbisonparse,
but the reason it's "easy" in the current parser is, as I'm sure you
know, that it makes N passes over the entire string, with the benefit
of unlimited lookahead. You're right that it _can_ be done -- I think
I've got it down. But it's still not pretty. And it's still, I think,
a violation of expectations. Nonetheless, I'll shut up about it.
Andrew