Earlier: "... Whether or not you think it's a waste of time, there's no excuse for broadcasting every parser bug you find to three mailing lists. There's no shortage of parser bugs, and no need to act surprised when you find one ... If we want to talk about the parser grammar effort, we all know which list to subscribe to ...
Peter Blaise responds: Oh? Which one? I do not know, and you do not mention it in your post, so, help me out here, please - which list? If you're gonna type something, why not make it unambiguously accurate and complete, anyway? Otherwise, what's the point?
Additionally, I personally find cross posting very important. Of course anyone NOT interested can just scroll on or delete - there's no such thing as too much information in my book (on topic - I'm not talking about spam or off topic posts). Parser behavior = wiki tech in my book. I very often I find spirited discussions ensue because of cross-posted ideas - it tends to freshen otherwise stale meeting places.
More to the point here, wiki markup parser wise, my point is twofold:
One, I would prefer NOT to have my wiki end users get error messages when they edit. I'd prefer that any editing just go in and be saved, and later we'll deal with formatting surprises. I'm a firm believer in separating the tasks of content creation and content presentation. Someone adding content to a wiki should never be delayed by presentation formatting error messages. Let the text land however it lands, clean it up later.
Two, we tend to discover how things work in spite of erroneous, presumptive, naive instructions. I already have begun to discard the "rule" that bold happened between three apostrophes. Instead, I've discovered a hierarchy of toggles. Three apostrophes toggle bold to the other state. Two apostrophes toggle italics to the other state. The parser makes it decisions on how to interpret duplicate punctuation at the END of any code that matches it's look-up-table, or at the first "word barrier" transition. Or does it? Cut and paste this into any sandbox page and explore:
'1text = apostrophe one text; no duplicate punctuation, no wiki markup.
''2text = italics two text; duplicate punctuation matched wiki markup, and the parser toggles the state of the matching function, here, italics.
'''3text = bold three text; duplicate punctuation matched wiki markup, and the parser toggles the state of the matching function, here, bold.
''''4text = bold apostrophe four text; duplicate punctuation matched wiki markup, and the parser toggles the state of the matching function, here, bold (3 apostrophes) was the superior interpretable state before the 4th apostrophe, so bold toggles (on or off), and the final apostrophe is interpreted as mere punctuation or text. Alternatively, four apostrophes could be considered as two toggles of the italics function. But, since the first word barrier occurs only after the 4th apostrophe, and there is no text between the apostrophes, that interpretation would not have any visible effect on the display. It probably makes sense to have the parser continue interpreting up to the third apostrophe before making a decision, and consider it a call for a bold-toggle, rather than consider the first two apostrophes as an italics-toggle, and then start looking for a subsequent wiki markup instruction. Otherwise, the parser would never find bold (3 apostrophes) if it always gave precedent to interpreting the first 2 apostrophes as italics. The parser seems to reads left to right, and interprets according to (what we hope are ) discoverable hierarchies: word transitions, or, matching duplicate punctuation to wiki markup code, whichever it finds first, such as knowing that three apostrophes toggles bold.
'''''5text = italics bold five text; which toggled first? Who knows? I presume bold toggles first, then italics toggled. Let's test:
'''bold '''''5text = italics five text (no bold); implying the five apostrophes were interpreted bold as highest in the hierarchy, so, of the five apostrophes, the first three were considered a bold-toggle, and final two were considered an italics-toggle.
''italics '''''5text = bold five text (no italics); implying bold again wins, and the subsequent italics-toggle turns off italics as expected, the pattern is, so far, predictable. But let's revisit four apostrophes:
'''bold ''''4text = bold apostrophe normal four text, this makes no sense. In the four apostrophe grouping, the first three should have toggled bold, and the final should have been interpreted as text, displaying '4text normal, but when actually displayed, the ' was bold and the 4text was normal. Huh? THERE'S THE BUG!
''italics ''''4text = normal two apostrophe four text, again, in the four apostrophe group, the first three should have been a bold toggle and the subsequent apostrophe should have been text. Apparently the parser holds the existing state of wiki markup toggle in it's head and raises that in the hierarchy. Who does this programming, anyway? Let's test italics on and off first, not just on first:
''real italics'' ''''4text = normal apostrophe, bold four text. This SHOULD be the same as avobe, but isn't. Apparently we need to add one more item to our expected parser function hierarchy, THIS IS WHAT THE PARSER SEEMS TO ASK:
1 - is there a bold or italics toggle ON outstanding? (This surprised me, I thought toggles ON and OFF were hierarchically equivalent, but apparently a toggle ON creates a pressing need to look for a toggle OFF before interpreting anything else!)
Then:
2 - have we reached the superior matching wiki markup text? (On other words, ''' is superior to '' in the look-up-table.)
3 - have we reached a text word barrier or paragraph barrier? (Supposedly, paragraph markers reset all toggles to OFF, but apparently some wiki markup survives paragraph markers, or is it only HTML-style markup using <markup></markup>-style coding that ignores paragraph markers?)
Continuing the test:
''''''6text = italics bold apostrophe six text
'''''''7text = italics bold 2 apostrophe seven text
''''''''8text = italics bold 3 apostrophe eight text
... and so on.
Monahon, Peter B. wrote :
Earlier: "... Whether or not you think it's a waste of time, there's no excuse for broadcasting every parser bug you find to three mailing lists. There's no shortage of parser bugs, and no need to act surprised when you find one ... If we want to talk about the parser grammar effort, we all know which list to subscribe to ...
Peter Blaise responds: Oh? Which one? I do not know, and you do not mention it in your post, so, help me out here, please - which list? If you're gonna type something, why not make it unambiguously accurate and complete, anyway? Otherwise, what's the point?
see http://lists.wikimedia.org/pipermail/wikitech-l/2007-November/035050.html
On 12/1/07, Monahon, Peter B. Peter.Monahon@uspto.gov wrote:
'''bold ''''4text = bold apostrophe normal four text, this makes no sense. In the four apostrophe grouping, the first three should have toggled bold, and the final should have been interpreted as text, displaying '4text normal, but when actually displayed, the ' was bold and the 4text was normal. Huh? THERE'S THE BUG!
4 apostrophes is always converted to 1 apostrophe then bold. (Then possibly reconverted again, which is probably a bug, but it's hard to define what a bug is in this context...)
THIS IS WHAT THE PARSER SEEMS TO ASK:
I think you will find it more enlightening just to read the source code (parser.php). It's well written and nicely commented. Failing that, I think I've captured the behaviour at http://www.mediawiki.org/wiki/Markup_spec/BNF/Inline_text .
If you'd like to keep discussing this, take it to wikitext-l.
Steve
On Fri, Nov 30, 2007 at 10:52:28AM -0500, Monahon, Peter B. wrote:
Two, we tend to discover how things work in spite of erroneous, presumptive, naive instructions. I already have begun to discard the "rule" that bold happened between three apostrophes. Instead, I've discovered a hierarchy of toggles. Three apostrophes toggle bold to the other state. Two apostrophes toggle italics to the other state. The parser makes it decisions on how to interpret duplicate punctuation at the END of any code that matches it's look-up-table, or at the first "word barrier" transition. Or does it? Cut and paste this into any sandbox page and explore:
And here, I believe you pin down preciselt what Steve's on about (me, too): decreasing the effort necessary for users to build a mental model of how Wikitext works, which will make it easier for them to use.
That this will make it easier to parse as well, is merely a side effect.
My personal assertion, which Steve was wise enough to stay clear of for the moment, was that the number of people who *will* learn wikitext far exceeds the number who already have, and that therefore this regularization should be much steeper than might otherwise be indicated... but this thought hasn't carried the day.
Yet. :-)
Cheers, -- jra
On 12/3/07, Jay R. Ashworth jra@baylink.com wrote:
My personal assertion, which Steve was wise enough to stay clear of for the moment, was that the number of people who *will* learn wikitext far exceeds the number who already have, and that therefore this regularization should be much steeper than might otherwise be indicated... but this thought hasn't carried the day.
It comes down to this:
1. It would be good to have a well defined grammar for the wikitext recognised by the parser in use. 2. It would be good for wikitext to be a sensible grammar, easy to use. 3. It would be good for MediaWiki to use a recursive descent parser rather than the current one. 4. It would be bad if MediaWiki could not recognise all the existing wikitext on the WMF sites.
As Brion has pointed out, it would be very difficult to achieve all these aims simultaneously. It turns out that 3 is hard to achieve without 1. 3 is actually easiest to achieve with 2 and 1, but without 4. So the order of implementation has to be 1, 3, 2.
Steve
On Mon, Dec 03, 2007 at 12:16:51PM +1100, Steve Bennett wrote:
On 12/3/07, Jay R. Ashworth jra@baylink.com wrote:
My personal assertion, which Steve was wise enough to stay clear of for the moment, was that the number of people who *will* learn wikitext far exceeds the number who already have, and that therefore this regularization should be much steeper than might otherwise be indicated... but this thought hasn't carried the day.
It comes down to this:
- It would be good to have a well defined grammar for the wikitext
recognised by the parser in use. 2. It would be good for wikitext to be a sensible grammar, easy to use. 3. It would be good for MediaWiki to use a recursive descent parser rather than the current one. 4. It would be bad if MediaWiki could not recognise all the existing wikitext on the WMF sites.
As Brion has pointed out, it would be very difficult to achieve all these aims simultaneously. It turns out that 3 is hard to achieve without 1. 3 is actually easiest to achieve with 2 and 1, but without 4. So the order of implementation has to be 1, 3, 2.
I'm down with that. :-)
Cheers, -- jra
wikitech-l@lists.wikimedia.org