Steve Bennett has been writing a parser grammar, and investigating how the present parser *actually* works.
Turns out the apostrophe-italic combination only works once a para. Is this expected?
- d.
---------- Forwarded message ---------- From: Steve Bennett stevagewp@gmail.com Date: 27 Nov 2007 15:05 Subject: Re: [Wikitext-l] Determining the behaviour of apostrophes To: Wikitext-l wikitext-l@lists.wikimedia.org
On 11/28/07, Jared Williams jared.williams1@ntlworld.com wrote:
The code is still missing the searching for an single-letter preceeding a bold to split at. Seems none of the tests exercise that particular bit of code.
That's a relief. Now that I understand this rule, I think it's a complete load of bollocks, and should be removed from any notion of "correct" treatement of wikitext. Mismatched apostrophe groupings should be considered erroneous input whose rendering is undefined.
Why?
For starters, as discussed, the French wikipedia doesn't even use this construct. Worse, it only works *once* per paragraph. Look at how this renders:
* L'''amour'' is great the first time. But l'''amour'' fails the second time.
You guessed it, bold from the first ''' to the second ''', and italics from the first '' to the second ''. And why would it be any different?
The treatment of 4 apostrophes is much less offensive. This renders correctly: * L''''amour''' is bold the first time. And l''''amour''' is still bold the second time.
The 4 apostrophes -> apostrophe, bold rule is at least consistent, though it's still not intuitive that this: ''''blah'''' put the first apostrophe in normal text, while the second one is bold. Hard to believe the user really wants that...
Of course, the only time 4 apostrophes ever renders as anything *other* than apostrophe followed by bold is when the crazy rule above is invoked, turning it into two apostrophes followed by italics.
Steve (rambly late at night)
_______________________________________________ Wikitext-l mailing list Wikitext-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 11/28/07, David Gerard dgerard@gmail.com wrote:
Steve Bennett has been writing a parser grammar, and investigating how the present parser *actually* works.
Turns out the apostrophe-italic combination only works once a para. Is this expected?
To clarify, this behaviour (converting exactly one occurrence of three apostrophes to apostrophe+italics if the paragraph as a whole has mismatched italics/bold) is pretty evident from looking at the code:
# If there is a single-letter word, use it! if ($firstsingleletterword > -1) { $arr [ $firstsingleletterword ] = "''"; $arr [ $firstsingleletterword-1 ] .= "'"; }
So, the writer of this code (Magnus?) definitely knows about this limitation. The question is really:
1) Does anyone really use this construct? We've heard that the French use a curved apostrophe instead of the straight one in this situation. It's hard to believe anyone relies on it as it's so flaky: once per paragraph only? Eep. 2) Can it either be removed from the current parser or not implemented in the spac/future parser?
It's particularly noxious as there is no way to parse it in any reasonable fashion. Four apostrophes is always apostrophe+bold (parseable), except that this rule means that if at the end of the paragraph you encounter other unclosed italics and bold, you have to go back to the start and convert one of these new "apostrophe+bold" sequences into "apostrophe+apostrophe+italics" (nightmare).
I should also point out that whenever this situation (bold and italics both unbalanced) arises, the parser always attempts to recover by converting a bold into an italics, not just if there is a single letter word - that's just the one it splits first.
Steve (not subscribed to foundation-l)
It's particularly noxious as there is no way to parse it in any reasonable fashion. Four apostrophes is always apostrophe+bold (parseable), except that this rule means that if at the end of the paragraph you encounter other unclosed italics and bold, you have to go back to the start and convert one of these new "apostrophe+bold" sequences into "apostrophe+apostrophe+italics" (nightmare).
Would it help to change the definition of "single letter word" to not include a single apostrophe? It would just be a small change to one line of code and can be made straight away (rather than waiting for the new parser), as far as I can tell, and would fix at least one of the problems. Someone still needs to be done about the feature in the long term, but in the short term, I see no reason for an apostrophe counting as a word.
On 11/28/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Would it help to change the definition of "single letter word" to not include a single apostrophe? It would just be a small change to one line of code and can be made straight away (rather than waiting for the new parser), as far as I can tell, and would fix at least one of the problems. Someone still needs to be done about the feature in the long term, but in the short term, I see no reason for an apostrophe counting as a word.
It needs to go slightly further than that. In the following sentence:
Blah four'''' then three''' then '''''five
The four will still be split, even though it's not next to a single letter word. So the change should be from: - split the first bold after a single letter word, else after a multi-letter word, else anywhere to: - split the first bold after a single letter word that is not apostrophe, else after a multi-letter word that does not end in apostrophe, else anywhere that is not immediately after an apostrophe.
Basically, any bold that is immediately after an apostrophe could only arise from it already being split once, and splitting it *twice* is definitely not helping anyone.
Hmm, actually I tried this, and it's not really a great improvement. The line: * Take ''''four''' apostrophes and then throw '''''five unclosed apostrophes at them.
Previously rendered as something like: Take ''<i>four<b>...
Now you get: Take '<b>four' <i>apostrophes...
But at least '''' never renders as two apostrophes, I guess. Here's the code change:
if ($x1 == ' ') { if ($firstspace == -1) $firstspace = $i; } else if ($x2 == ' ' && $x1 != "'") { if ($firstsingleletterword == -1) $firstsingleletterword = $i; } else if ($x1 != "'") { if ($firstmultiletterword == -1) $firstmultiletterword = $i; } else { /* Bold already split from '''', don't split again. */ } Steve
On Wed, Nov 28, 2007 at 11:47:56AM +1100, Steve Bennett wrote:
So, the writer of this code (Magnus?) definitely knows about this limitation. The question is really:
- Does anyone really use this construct? We've heard that the French
use a curved apostrophe instead of the straight one in this situation.
Yup; that's what I keep saying. You're making a list of the "does anyone actually use this"'s, right?
Cheers, -- jra
On Nov 28, 2007 3:47 AM, Jay R. Ashworth jra@baylink.com wrote:
On Wed, Nov 28, 2007 at 11:47:56AM +1100, Steve Bennett wrote:
So, the writer of this code (Magnus?) definitely knows about this limitation. The question is really:
- Does anyone really use this construct? We've heard that the French
use a curved apostrophe instead of the straight one in this situation.
Yup; that's what I keep saying. You're making a list of the "does anyone actually use this"'s, right?
It's a very common construct in Italian. See for example the first sentence of this article:
http://it.wikipedia.org/wiki/Amore
Apostrophes can be used to truncate articles and some other compound words:
un' l' dell' dall'
...etc. Any of these can be combined with italics or bold:
dell'''amore'' (apostrophe + italics) dell''''amore''' (apostrophe + bold)
Italian rules actually prescribe the curved apostrophe, but it's rarely used because it's not found in normal Italian keyboards. The Italian version of MS word automagically transform the regular apostrophe in the curved one in the right places, and sometimes you'll find articles in the Italian wikipedia which were copy&pasted from a word file and contain a curved apostrophe. But most of the time, the regular one is used in "online" text.
Alfio
On 11/28/07, Alfio Puglisi alfio.puglisi@gmail.com wrote:
It's a very common construct in Italian. See for example the first sentence of this article:
http://it.wikipedia.org/wiki/Amore
Apostrophes can be used to truncate articles and some other compound words:
On the Wikitext list I proposed a simple change which would make parsing easier, and make this construct always work, irrespective of how many times it's used per paragraph:
Word'''word -> always Word'<i>word (or </i>...) Word''''word -> always Word'<b>word (or </b>...)
Would this not be better than the current rule, which only allows that construct if the total number of bolds and italics is otherwise unbalanced? That is, at present, it doesn't work in these cases:
L''''amore''' e blah blah l''''informazione'''... L''''amore''' e .... blah '' blah...
I'm not immediately sure how one would represent word<b>word</b> with this rule. Perhaps using actual <b> tags, I guess. Or perhaps this would work: word<nowiki></nowiki>'''word'''. Hmm. Either way, it must be less common than the Italian/French apostrophe situation?
Steve
On Nov 28, 2007 12:30 PM, Steve Bennett stevagewp@gmail.com wrote:
On 11/28/07, Alfio Puglisi alfio.puglisi@gmail.com wrote:
It's a very common construct in Italian. See for example the first sentence of this article:
http://it.wikipedia.org/wiki/Amore
Apostrophes can be used to truncate articles and some other compound words:
On the Wikitext list I proposed a simple change which would make parsing easier, and make this construct always work, irrespective of how many times it's used per paragraph:
Word'''word -> always Word'<i>word (or </i>...) Word''''word -> always Word'<b>word (or </b>...)
Would this not be better than the current rule, which only allows that construct if the total number of bolds and italics is otherwise unbalanced? That is, at present, it doesn't work in these cases:
L''''amore''' e blah blah l''''informazione'''... L''''amore''' e .... blah '' blah...
Actually, those examples work correctly :-) The three words are in bold, the right apostrophes appear and the last "blah" is italics.
Alfio
L''''amore''' e blah blah l''''informazione'''... L''''amore''' e .... blah '' blah...
Actually, those examples work correctly :-) The three words are in bold, the right apostrophes appear and the last "blah" is italics.
I think bold normally works fine, it's italics that are a problem because "apostrophe-italic" looks just like "bold". "apostrophe-bold" doesn't look like anything else.
On 11/29/07, Thomas Dalton thomas.dalton@gmail.com wrote:
L''''amore''' e blah blah l''''informazione'''... L''''amore''' e .... blah '' blah...
Actually, those examples work correctly :-) The three words are in bold, the right apostrophes appear and the last "blah" is italics.
I think bold normally works fine, it's italics that are a problem because "apostrophe-italic" looks just like "bold". "apostrophe-bold" doesn't look like anything else.
Err, yeah, that's probably what I meant. Replace my 4-apostrophes with 3-apostrophes above, and the 3s with 2s.
Here are two sentences to demonstrate the fragility of the current system:
L''''amore''' e blah blah l'''informazione'' oeui ...
(works)
L''''amore''' e blah blah del'''informazione'' oeui ...
(explodes, turning the very first set of 4 apostrophes into two apostrophes and italics...)
Steve (with apologies for my total lack of understanding of Italian grammar)
David Gerard wrote:
Steve Bennett has been writing a parser grammar, and investigating how the present parser *actually* works.
Turns out the apostrophe-italic combination only works once a para. Is this expected?
Usually bugs are reported at bugzilla.wikimedia.org, not cross-posted to two mailing lists, one unrelated and the other so tired of reading Steve Bennett's posts that we gave him his own list.
-- Tim Starling
On 28/11/2007, Tim Starling tstarling@wikimedia.org wrote:
two mailing lists, one unrelated and the other so tired of reading Steve Bennett's posts that we gave him his own list.
Um, hooray. So is this a declaration that the parser grammar effort is officially a waste of time?
- d.
David Gerard wrote:
On 28/11/2007, Tim Starling tstarling@wikimedia.org wrote:
two mailing lists, one unrelated and the other so tired of reading Steve Bennett's posts that we gave him his own list.
Um, hooray. So is this a declaration that the parser grammar effort is officially a waste of time?
Whether or not you think it's a waste of time, there's no excuse for broadcasting every parser bug you find to three mailing lists. There's no shortage of parser bugs, and no need to act surprised when you find one.
If we want to talk about the parser grammar effort, we all know which list to subscribe to.
-- Tim Starling
On 29/11/2007, Tim Starling tstarling@wikimedia.org wrote:
Whether or not you think it's a waste of time,
No, I was asking if you were declaring it was one.
there's no excuse for broadcasting every parser bug you find to three mailing lists. There's no shortage of parser bugs, and no need to act surprised when you find one.
It's hardly every bug, and in this case it was something which was widely touted as an important behaviour which turned out not to be as advertised; there's then a question as to whether or not it is in fact a bug. Since any replacement parser would have to implement useful quirks of the present parser, then bug-for-bug compatibilty is actually important. What course of action would you suggest for such cases? (Speaking as one of those moving the parser behaviour goalposts.)
- d.
David Gerard wrote:
On 29/11/2007, Tim Starling tstarling@wikimedia.org wrote:
Whether or not you think it's a waste of time,
No, I was asking if you were declaring it was one.
there's no excuse for broadcasting every parser bug you find to three mailing lists. There's no shortage of parser bugs, and no need to act surprised when you find one.
It's hardly every bug, and in this case it was something which was widely touted as an important behaviour which turned out not to be as advertised; there's then a question as to whether or not it is in fact a bug. Since any replacement parser would have to implement useful quirks of the present parser, then bug-for-bug compatibilty is actually important. What course of action would you suggest for such cases? (Speaking as one of those moving the parser behaviour goalposts.)
I'll take this offlist. It might seem hypocritical have a public flame war about excessive posting.
-- Tim Starling
wikitech-l@lists.wikimedia.org