On 5/13/05, Timwi timwi@gmx.net wrote:
Andrew Rodland wrote:
I had guessed that it might find some use in fr -- It's too bad to hear that it's "widely" used. However, I should note that it's not _required_.
Of *course* it's required. By saying it isn't, you're thinking too technically. Humans aren't like that, humans just want to write their text and not ugly tags and syntax elements just for a single apostrophe.
I didn't know that it was too technical of me to think that "required" should mean "required".
<nowiki> resolves the ambiguity nicely.
Again -- "nicely" only in the technical sense, but not in the human usability sense.
It's nice in the human-usability sense that you can say exactly what you mean, instead of having to guess how it's going to be interpreted (speaking of which, can you show me any document, preferably in English, which explains this behavior?). I agree that <nowiki> is rather unwieldy, but that in itself doesn't make the existing solution a good one.
The workaround, on the other hand, does bad things to the language, and makes the implementation of a more advanced parser exceedingly difficult.
You are making two assumptions here that are both false.
Firstly, you are assuming that the language becomes more ambiguous this way. This is false, because by handling this case explicitly, I have actually made it *less* ambiguous. Previously, it was only a side-effect of the way regular expressions match text that three apostrophes were rendered as <i> followed by an apostrophe. Now I have specifically written code to define three apostrophes to mean "an apostrophe followed by open-italics, unless there is another triple-apostrophe in the line, in which case it's open-bold". No ambiguity there.
How does "a side-effect of the way regular expressions match text" turn the markup for bold into an apostrophe and the markup for italic?
The second assumption you are making (explicitly, even) is that it is more difficult to implement, when in fact you really just mean that you found it harder because it is not the way regular expressions normally work (and because you find the behaviour confusing because you don't normally think of French). I didn't find this particularly difficult to do -- neither in the current parser, nor in flexbisonparse.
If you had read my messages, you might have noticed that my reasoning was based neither on anything to do with regexes at all, nor on linguistic prejudice, but on a simple consideration. It is impossible, at the time that the parser sees a ''', to resolve what type of token it is, without looking ahead to the end of the line (an unbounded and unknown distance away). _That's_ what I called ambiguity. The alternative is that '' means '', and ''' means '''. My current feeling is that the "cleanest" solution to the problem would be to introduce a separator which produces no output, but breaks up tokens; then you could write (with ∙ as sequence operator) ' ∙ '', '' ∙ ', ' ∙ ''', ''' ∙ ', and even '' ∙ '' all you want, with no ambiguity to the parser and no considerable hassle to the user. "Otherwise how is the computer supposed to know what you mean?" is an argument anyone can understand. The existing code in doQuotes() simply operates by logically _separating_ the consecutive quotes, so automatic conversion wouldn't be overly taxing, nor time-critical. I haven't seen flexbisonparse, but the reason it's "easy" in the current parser is, as I'm sure you know, that it makes N passes over the entire string, with the benefit of unlimited lookahead. You're right that it _can_ be done -- I think I've got it down. But it's still not pretty. And it's still, I think, a violation of expectations. Nonetheless, I'll shut up about it.
Andrew