Parser practicum

List overview All Threads
Download

newer

older

live mirror

Re: [Wikitech-l] [MediaWiki-CVS]...

Jay R. Ashworth

11 Nov 2007 11 Nov '07

3:55 p.m.

It has been proposed, informally, that wikitext be modified to prefer, and then eventually require, new markers for bold and italic text inline.

Two suggested, and very similar approaches, are currently on the table:

//italics//, **bold**, //**bold** italics//

and

/italics/, *bold*, /*bold* italics/

Could you each please post your personal favorite hobby-horse counter case which you feel would make parsing these constructs difficult so we can all pick it apart?

I'll start:

Everyone says that *bold* (ie: using the single character versions in general) would conflict with the use of asterisks for list marking. To see how big a problem this would actually be entails finding out how many bold markings occur at the beginning of hard parapgraphs, since list items *must* be at the beginning of a hard paragraph, and then determining how hard it would be to distinguish them.

I see three cases:

*List item

Easy: only one asterisk, beginning of graf. Obviously list item.

*Bold sentence.*

Also easy, asterisk at beginning of graf is matched by one that's just before white space. This one's probably the hardest, you have to look ahead a fair piece to find the matching bold-off to be sure.

*list item with a *bold* word

Similarly easy; the bold word tags are matched. This one would be harder if list items were regularly very long; in my experience, they're not.

No, four:

The only thing that makes this difficult, as far as I can see, is if you want to permit turning off bold mid-word, like this:

But can you really call it *truth*iness?

I know we probably permit that now, but it does deprive us of "bold-off is an asterisk followed by a \W token" rule that makes other things easy.

So again: is "turning bold and italics off between two alphanumeric characters" a thing which actually *happens*, much?

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Show replies by date

Thomas Dalton

11 Nov 11 Nov

4:08 p.m.

...

So again: is "turning bold and italics off between two alphanumeric characters" a thing which actually *happens*, much?

Two alphanumeric characters, probably not very often in English. Two non-whitespace characters, more likely - for example, having the first part of a hyphenated word in bold.

More importantly - what about languages without whitespace between words?

Steve Bennett

11:59 p.m.

On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:

...

Could you each please post your personal favorite hobby-horse counter case which you feel would make parsing these constructs difficult so we can all pick it apart?

IMHO it's utter madness to replace one set of ambigous, difficult-to-parse formatting markers with a different set of ambiguous, difficult-to-parse formatting markers.

"Jet* is an Australian airline. Jet* was founded in 2000 as a spin-off of Qantas."

"The biggest problems are bold and/or italics, and parsing and/or regular expressions".

"files are most often written to the /etc or /bin directories"

"The sounds /z/ and /s/ are distinct phonemes in English, but allophones in Spanish."

"Quasi-emoticons like *smiles* and *hugs* are used to..."

I think even if you can make rules that can tell them apart, you're adding unnecessary complexity both to the parser and to the user. Compare:

1) All text between ** and ** is shown in bold. 2) All text between * and * is shown in bold. Except if the first * is at the start of the line, in which case it's a list. Or if the * is in the middle of a word, in which case it's shown literally. Of course, if you actually do want bold in the middle of a word, do X...

Actually there's a flaw here: ** at the start of a line is going to be ambiguous as well. Bugger.

Strangely enough, with my 2 line hack to parser.php, the current text renders exactly correctly:

**Melbourne** is a great city.

**This is a list.

...

So again: is "turning bold and italics off between two alphanumeric characters" a thing which actually *happens*, much?

Dunno. My parents' company used to be spelt with the first part of the word in bold and the second part in italics, no space in between. There are bound to be a few techie companies spelt like that.

Steve

Steve Bennett

12 Nov 12 Nov

12:24 a.m.

On 11/12/07, Steve Bennett stevagewp@gmail.com wrote:

...

Strangely enough, with my 2 line hack to parser.php, the current text renders exactly correctly:

**Melbourne** is a great city.

**This is a list.

But it totally mangles:

** **Melbourne** is a great city.

Oh well.

Steve

Jay R. Ashworth

4:40 a.m.

On Mon, Nov 12, 2007 at 10:59:49AM +1100, Steve Bennett wrote:

...

On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:

...
Could you each please post your personal favorite hobby-horse counter case which you feel would make parsing these constructs difficult so we can all pick it apart?

IMHO it's utter madness to replace one set of ambigous, difficult-to-parse formatting markers with a different set of ambiguous, difficult-to-parse formatting markers.

"Jet* is an Australian airline. Jet* was founded in 2000 as a spin-off of Qantas."

The common solution to Tradenames with silly spelling or rendering is to do your best once, and then ignore them for the rest of the article, IME.

...

"The biggest problems are bold and/or italics, and parsing and/or regular expressions".

"files are most often written to the /etc or /bin directories"

"The sounds /z/ and /s/ are distinct phonemes in English, but allophones in Spanish."

"Quasi-emoticons like *smiles* and *hugs* are used to..."

Ok; you've convinced me: the singletons are too ambiguous. :-)

...

I think even if you can make rules that can tell them apart, you're adding unnecessary complexity both to the parser and to the user. Compare:

All text between ** and ** is shown in bold.

All text between * and * is shown in bold. Except if the first * is at

the start of the line, in which case it's a list. Or if the * is in the middle of a word, in which case it's shown literally. Of course, if you actually do want bold in the middle of a word, do X...

Actually there's a flaw here: ** at the start of a line is going to be ambiguous as well. Bugger.

And 2**5 (exponentiation() is a potential problem as well, yes.

Any in-band approach will have this problem; the trick is to choose a token that reduces it to an acceptable level -- where by "acceptable" I mean "causes fewer problems in the Real World than What We Have Now".

:-)

...

Strangely enough, with my 2 line hack to parser.php, the current text renders exactly correctly:

**Melbourne** is a great city.

**This is a list.

Well, an unadorned second level list item renders poorly just now anyway, right?

...

...
So again: is "turning bold and italics off between two alphanumeric characters" a thing which actually *happens*, much?

Dunno. My parents' company used to be spelt with the first part of the word in bold and the second part in italics, no space in between. There are bound to be a few techie companies spelt like that.

That's not "spelling". That's "rendering", and a policy decision has to be taken as to how much of that is required to be representable.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Steve Bennett

1:29 p.m.

On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:

...

The common solution to Tradenames with silly spelling or rendering is to do your best once, and then ignore them for the rest of the article, IME.

Yes, well I cheated anyway. It's usually spelt JetStar. The programming language Brainf*ck suddenly comes to mind though...

...

And 2**5 (exponentiation() is a potential problem as well, yes.

Sure. But that kind of sequence is really only likely to occur when quoting programming source, which pretty much has to be nowiki'ed by definition.

Any in-band approach will have this problem; the trick is to choose a

...

token that reduces it to an acceptable level -- where by "acceptable" I mean "causes fewer problems in the Real World than What We Have Now".

Yeah. I think ** and // will do a lot better than ''' and ''.

...

**Melbourne** is a great city.

...
**This is a list.

Well, an unadorned second level list item renders poorly just now anyway, right?

Yes, but I don't like the way you're thinking. If you're thinking that the parser should render this:

**This** is bold

whereas

*Foo **This** is the word This followed by two asterisks...

Well...let's not do that. This might be an acceptable disambiguation rule:

**This** is always bold because there is no space.

** This** is a second-level list because there is a space.

Then again, why not just make the rule that it's *always* bold:

**This** is bold.

**This<nowiki>**</nowiki> is a second-level list.

That's what people will do anyway when they see the problem. It definitely could arise, if the ** is some sort of footnotey thing, but it's going to be pretty rare.

That's not "spelling". That's "rendering", and a policy decision has

...

to be taken as to how much of that is required to be representable.

Yeah, but let's not even think about Wikipedia policy yet. Keep it technical...

Steve

Jay R. Ashworth

2:21 p.m.

On Tue, Nov 13, 2007 at 12:29:42AM +1100, Steve Bennett wrote:

...

On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:

...
The common solution to Tradenames with silly spelling or rendering is to do your best once, and then ignore them for the rest of the article, IME.

Yes, well I cheated anyway. It's usually spelt JetStar. The programming language Brainf*ck suddenly comes to mind though...

[[Brainfuck]] strongly suggests, though not saying it outright, that that's not the formal name of the language. That page *does* seem to need the technical limitations tag, though.

...

...
And 2**5 (exponentiation() is a potential problem as well, yes.

Sure. But that kind of sequence is really only likely to occur when quoting programming source, which pretty much has to be nowiki'ed by definition.

Exactly.

...

...
Any in-band approach will have this problem; the trick is to choose a token that reduces it to an acceptable level -- where by "acceptable" I mean "causes fewer problems in the Real World than What We Have Now".

Yeah. I think ** and // will do a lot better than ''' and ''.

The Neapolitans will certainly think so -- and that's an excellent point: we're currently appropriating a character sequence *which is a valid punctuation character in a non-pictographic language which we serve* as a markup sequence.

...

...
**Melbourne** is a great city.

...
**This is a list.

Well, an unadorned second level list item renders poorly just now anyway, right?

Yes, but I don't like the way you're thinking. If you're thinking that the parser should render this:

**This** is bold

whereas

*Foo **This** is the word This followed by two asterisks...

Well...let's not do that. This might be an acceptable disambiguation rule:

**This** is always bold because there is no space.

** This** is a second-level list because there is a space.

Again: I would be perfectly comfortable ruling that "list markup *must* be followed by whitespace.

IE: <nl>**<sp>item

...

Then again, why not just make the rule that it's *always* bold:

**This** is bold.

**This<nowiki>**</nowiki> is a second-level list.

That's what people will do anyway when they see the problem. It definitely could arise, if the ** is some sort of footnotey thing, but it's going to be pretty rare.

That's consonant with my thinking.

...

...
That's not "spelling". That's "rendering", and a policy decision has to be taken as to how much of that is required to be representable.

Yeah, but let's not even think about Wikipedia policy yet. Keep it technical...

I'm trying, Steve. :-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Brion Vibber

13 Nov 13 Nov

7:02 p.m.

Jay R. Ashworth wrote:

...

It has been proposed, informally, that wikitext be modified to prefer, and then eventually require, new markers for bold and italic text inline.

I would recommend against considering this at this time (if ever).

Hopping around changing basic syntax is probably not the thing to do when in the middle of changing the parser mechanics.

-- brion vibber (brion @ wikimedia.org)

Jay R. Ashworth

14 Nov 14 Nov

12:27 a.m.

On Tue, Nov 13, 2007 at 02:02:12PM -0500, Brion Vibber wrote:

...

Jay R. Ashworth wrote:

...
It has been proposed, informally, that wikitext be modified to prefer, and then eventually require, new markers for bold and italic text inline.

I would recommend against considering this at this time (if ever).

Hopping around changing basic syntax is probably not the thing to do when in the middle of changing the parser mechanics.

Since, as nearly as I can determine, *reliably* parsing the apostrophe related markup is unspecifiable in a formal fashion, this pretty much kills the idea completely, as far as I can see.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Steve Sanbeg

5:03 p.m.

On Tue, 13 Nov 2007 19:27:41 -0500, Jay R. Ashworth wrote:

...

On Tue, Nov 13, 2007 at 02:02:12PM -0500, Brion Vibber wrote:

...
Jay R. Ashworth wrote:

...
It has been proposed, informally, that wikitext be modified to prefer, and then eventually require, new markers for bold and italic text inline.

I would recommend against considering this at this time (if ever).

Hopping around changing basic syntax is probably not the thing to do when in the middle of changing the parser mechanics.

Since, as nearly as I can determine, *reliably* parsing the apostrophe related markup is unspecifiable in a formal fashion, this pretty much kills the idea completely, as far as I can see.

Parsing of the pathological cases doesn't seem specifiable, but a simplified version probably would be.

What if we only allowed ''italic'', '''bold''' and ''''bold italic'''', and required a separator between consecutive markup. I.e. ''a''<s/>'''b''' => ab; ''a'''''b''' => a'''</a>b..?

What if we didn't allow nesting, so ''italic and '''bold''''' had to be written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

Thomas Dalton

5:17 p.m.

...

Parsing of the pathological cases doesn't seem specifiable, but a simplified version probably would be.

What if we only allowed ''italic'', '''bold''' and ''''bold italic'''', and required a separator between consecutive markup. I.e. ''a''<s/>'''b''' => ab; ''a'''''b''' => a'''</a>b..?

What if we didn't allow nesting, so ''italic and '''bold''''' had to be written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

I think it's been agreed that outright rejecting any wikitext is a bad idea. Error messages or not, the parser has to at least try.

Steve Sanbeg

9:24 p.m.

On Wed, 14 Nov 2007 17:17:48 +0000, Thomas Dalton wrote:

...

...
Parsing of the pathological cases doesn't seem specifiable, but a simplified version probably would be.

What if we only allowed ''italic'', '''bold''' and ''''bold italic'''', and required a separator between consecutive markup. I.e. ''a''<s/>'''b''' => ab; ''a'''''b''' => a'''</a>b..?

What if we didn't allow nesting, so ''italic and '''bold''''' had to be written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

I think it's been agreed that outright rejecting any wikitext is a bad idea. Error messages or not, the parser has to at least try.

We don't need error messages; just a way to interpret the syntax without too much lookahead. The combination of ambiguous syntax and nesting is what makes this hard. It's already been decided that we can't change the ambiguous syntax. But it seems like things that aren't much used, like the nesting, may still be on the table.

MinuteElectron

9:31 p.m.

Steve Sanbeg wrote:

...

On Wed, 14 Nov 2007 17:17:48 +0000, Thomas Dalton wrote:

...
...
Parsing of the pathological cases doesn't seem specifiable, but a simplified version probably would be.

What if we only allowed ''italic'', '''bold''' and ''''bold italic'''', and required a separator between consecutive markup. I.e. ''a''<s/>'''b''' => ab; ''a'''''b''' => a'''</a>b..?

What if we didn't allow nesting, so ''italic and '''bold''''' had to be written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

I think it's been agreed that outright rejecting any wikitext is a bad idea. Error messages or not, the parser has to at least try.

We don't need error messages; just a way to interpret the syntax without too much lookahead. The combination of ambiguous syntax and nesting is what makes this hard. It's already been decided that we can't change the ambiguous syntax. But it seems like things that aren't much used, like the nesting, may still be on the table.

Why would you remove the nesting, it is highly useful, saves a lot of time, and forcing it to be done without nesting would confuse non-technical users. What would be the purpose of removing a useful feature? The discussions seams to be swaying more towards ease of documentation\programming rather than usability which should be the primary goal.

MinuteElectron.

Mark Clements

15 Nov 15 Nov

2:32 p.m.

"MinuteElectron" minuteelectron@googlemail.com wrote in message news:473B6914.1050305@googlemail.com...

...

Steve Sanbeg wrote:

...
On Wed, 14 Nov 2007 17:17:48 +0000, Thomas Dalton wrote:

...
...
Parsing of the pathological cases doesn't seem specifiable, but a simplified version probably would be.

What if we only allowed ''italic'', '''bold''' and ''''bold

italic'''',

...

...
...
...
and required a separator between consecutive markup. I.e. ''a''<s/>'''b''' => ab; ''a'''''b''' =>

a'''</a>b..?

...

...
...
...
What if we didn't allow nesting, so ''italic and '''bold''''' had to

...

...
...
...
written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

I think it's been agreed that outright rejecting any wikitext is a bad idea. Error messages or not, the parser has to at least try.

We don't need error messages; just a way to interpret the syntax without too much lookahead. The combination of ambiguous syntax and nesting is what makes this hard. It's already been decided that we can't change

the

...

...
ambiguous syntax. But it seems like things that aren't much used, like the nesting, may still be on the table.

Why would you remove the nesting, it is highly useful, saves a lot of time, and forcing it to be done without nesting would confuse non-technical users. What would be the purpose of removing a useful feature? The discussions seams to be swaying more towards ease of documentation\programming rather than usability which should be the primary goal.

Agreed. Nesting is used a lot. E.g.

'''''Note:''' this has not yet been verified.''

Removing it is, imho, unthinkable.

- Mark Clements (HappyDog)

Thomas Dalton

2:49 p.m.

...

Agreed. Nesting is used a lot. E.g.

'''''Note:''' this has not yet been verified.''

Removing it is, imho, unthinkable.

That's the kind of thing my idea of a tidier could deal with - it just has to add four apostrophes before the word "this" and all is solved. (Maybe with some separators - depends on the details of the new parser.)

Jay R. Ashworth

4:49 p.m.

On Thu, Nov 15, 2007 at 02:49:49PM +0000, Thomas Dalton wrote:

...

...
Agreed. Nesting is used a lot. E.g.

'''''Note:''' this has not yet been verified.''

Removing it is, imho, unthinkable.

That's the kind of thing my idea of a tidier could deal with - it just has to add four apostrophes before the word "this" and all is solved.

And that's... um, better? ;-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Thomas Dalton

5:25 p.m.

On 15/11/2007, Jay R. Ashworth jra@baylink.com wrote:

...

On Thu, Nov 15, 2007 at 02:49:49PM +0000, Thomas Dalton wrote:

...
...
Agreed. Nesting is used a lot. E.g.

'''''Note:''' this has not yet been verified.''

Removing it is, imho, unthinkable.

That's the kind of thing my idea of a tidier could deal with - it just has to add four apostrophes before the word "this" and all is solved.

And that's... um, better? ;-)

For a user, no, for a parser, yes. So a tidier is perfect.

Steve Bennett

16 Nov 16 Nov

12:04 a.m.

On 11/16/07, Mark Clements gmane@kennel17.co.uk wrote:

...

Agreed. Nesting is used a lot. E.g.

'''''Note:''' this has not yet been verified.''

Removing it is, imho, unthinkable.

Fortunately, no one is thinking about removing it. Can we move on?

Steve

Mark Clements

9:11 a.m.

"Steve Bennett" stevagewp@gmail.com wrote in message news:b8ceeef70711151604m234243b2v321b420adffc75b@mail.gmail.com...

...

On 11/16/07, Mark Clements gmane@kennel17.co.uk

wrote:

...

...
Agreed. Nesting is used a lot. E.g. '''''Note:''' this has not yet been verified.'' Removing it is, imho, unthinkable.

Fortunately, no one is thinking about removing it. Can we move on?

You clearly haven't been reading closeley enough:

"Steve Sanbeg" ssanbeg@ask.com wrote in message news:pan.2007.11.14.17.03.36.769684@ask.com...

...

What if we didn't allow nesting, so ''italic and '''bold''''' had to be written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

- Mark Clements (HappyDog)

George Herbert

14 Nov 14 Nov

8:47 p.m.

On Nov 14, 2007 9:03 AM, Steve Sanbeg ssanbeg@ask.com wrote:

...

On Tue, 13 Nov 2007 19:27:41 -0500, Jay R. Ashworth wrote:

...
On Tue, Nov 13, 2007 at 02:02:12PM -0500, Brion Vibber wrote:

...
Jay R. Ashworth wrote:

...
It has been proposed, informally, that wikitext be modified to

prefer,

...
...
...
and then eventually require, new markers for bold and italic text inline.

I would recommend against considering this at this time (if ever).

Hopping around changing basic syntax is probably not the thing to do when in the middle of changing the parser mechanics.

Since, as nearly as I can determine, *reliably* parsing the apostrophe related markup is unspecifiable in a formal fashion, this pretty much kills the idea completely, as far as I can see.

Parsing of the pathological cases doesn't seem specifiable, but a simplified version probably would be.

The current syntax is a balance of "user friendly" and "easy to render".

We can introduce unambiguous syntax, but it's probably hard for people to use.

This has been a repeating problem with all embedded text based markups - there aren't enough keys on a keyboard. You can put in a rapidly parsable unambiguous syntax like SGML or HTML or so forth, or any of the other markup languages, but they're all a lot harder to teach people to use.

What if we only allowed ''italic'', '''bold''' and ''''bold italic'''',

...

and required a separator between consecutive markup. I.e. ''a''<s/>'''b''' => ab; ''a'''''b''' => a'''</a>b..?

What if we didn't allow nesting, so ''italic and '''bold''''' had to be written as ''italic and ''<s/>''''bold''''?

That would probably go along way toward making it specifiable, without affecting 99% of the current text.

Useful thoughts.

-- -george william herbert george.herbert@gmail.com

Steve Bennett

12:57 a.m.

On 11/14/07, Brion Vibber brion@wikimedia.org wrote:

...

I would recommend against considering this at this time (if ever).

Hopping around changing basic syntax is probably not the thing to do when in the middle of changing the parser mechanics.

I would say this:

Some text with '''bold''' and some ''italics'' and even some '''''bold italics'''''.

is basic syntax. We're not changing that.

This:

Some text with '''''bold italics''' then just italics''. Oh and I did I mention ''''bold preceded by apostrophes'''' and who knows, some ''''''random'''' combinations''''' of '''' apostrophes ''''''''' and bold/italics''' that noone ''''' can ''''''''''predict the behaviour '''of'...

is not basic syntax. It can't be EBNF'ed. It can't be translated exactly according to the whims of the current parser.

I could accept that the first sentence of my second part is "basic syntax". But not this kind of madness: # If there is an odd number of both bold and italics, it is likely # that one of the bold ones was meant to be an apostrophe followed # by italics. Which one we cannot know for certain, but it is more # likely to be one that has a single-letter word before it.

That's why we're proposing *adding* ** and //, to provide alternative mechanisms for these complicated situations.

A perhaps simpler solution would be to add a _ character which is rendered as nothing if found in an apostrophe jungle. '''_'' is definitely bold then italics. '''_' is definitely bold beginning with an apostrophe. '''_''' is definitely not going to render the way you think...

Steve

Brion Vibber

4:33 p.m.

Steve Bennett wrote:

...

On 11/14/07, Brion Vibber brion@wikimedia.org wrote:

...
I would recommend against considering this at this time (if ever).

Hopping around changing basic syntax is probably not the thing to do when in the middle of changing the parser mechanics.

I would say this:

Some text with '''bold''' and some ''italics'' and even some '''''bold italics'''''.

is basic syntax. We're not changing that.

This:

Some text with '''''bold italics''' then just italics''. Oh and I did I mention ''''bold preceded by apostrophes'''' and who knows, some ''''''random'''' combinations''''' of '''' apostrophes ''''''''' and bold/italics''' that noone ''''' can ''''''''''predict the behaviour '''of'...

is not basic syntax. It can't be EBNF'ed. It can't be translated exactly according to the whims of the current parser.

Note that EBNF is not necessarily desired or desirable; if EBNF can't describe the grammar of the language, then it's not a suitable tool.

Note also that it *is* a requirement to have sane behavior with this sort of construction:

L'''idée'' <- apostrophe followed by italics L''''idée''' <- apostrophe followed by bold

That's a *requirement* to continue to properly handle French and Italian text. The current apostrophe pass handler uses I believe a lookahead and then goes backwards, which is a fairly sane way of doing this. If EBNF can't handle it, then forget EBNF.

...

I could accept that the first sentence of my second part is "basic syntax". But not this kind of madness: # If there is an odd number of both bold and italics, it is likely # that one of the bold ones was meant to be an apostrophe followed # by italics. Which one we cannot know for certain, but it is more # likely to be one that has a single-letter word before it.

That's why we're proposing *adding* ** and //, to provide alternative mechanisms for these complicated situations.

Let me be very very clear here.

Whether or not we ever add ** and // as bold and italic syntax is completely unrelated to the actual task of rebuilding the parser or speccing out a grammar for the wiki syntax.

If you want to play with alternate syntax (adding different markup such as "**" or "//" or "$*^#&*^"), feel free to do so on your own, but please don't mix it into any discussion or work or planning or decision-making about the parser.

New alternates aren't even needed; old alternates already exist ( and ; use of <nowiki></nowiki> as a hidden separator, etc). Other sorts of magic characters might also be neat additions, but they should not be considered at this time because it's just going to sidetrack things.

Don't complicate the situation by tossing in new stuff. Then the conversation goes from something manageable (does this proposed parser technically accomplish the job?) to something unmanageable (should we make a large number of changes to markup?) and we'll never get anywhere.

-- brion vibber (brion @ wikimedia.org)

Steve Bennett

15 Nov 15 Nov

2:49 a.m.

On 11/15/07, Brion Vibber brion@wikimedia.org wrote:

...

Note also that it *is* a requirement to have sane behavior with this sort of construction:

L'''idée'' <- apostrophe followed by italics L''''idée''' <- apostrophe followed by bold

That's a *requirement* to continue to properly handle French and Italian

Excellent, actual concrete requirements. I wish we could have more of these. What other combinations of bold/

text. The current apostrophe pass handler uses I believe a lookahead and

...

then goes backwards, which is a fairly sane way of doing this.

If EBNF can't handle it, then forget EBNF.

EBNF recognises its own limitations. I believe the correct approach is "If EBNF can't handle it, then extend EBNF or handle that part of the grammar another way".

EBNF describes context-free grammars, and Wikitext is not context-free. We said from the start that EBNF would never be able to describe the whole thing.

Let me be very very clear here.

...

Whether or not we ever add ** and // as bold and italic syntax is completely unrelated to the actual task of rebuilding the parser or speccing out a grammar for the wiki syntax.

I understand. I mention them only as a possible disambiguating solution to the weird corner cases that the current parser does not handle well.

If you say that L''''arc de triomphe''' has to parse as apostrophe+bold, that's good. If you say that every possible combination of apstrophes has to render in exactly the same way as whatever random, unplanned, arbitrary, unpredictable way the current does, I'm going to protest.

If you want to play with alternate syntax (adding different markup such

...

as "**" or "//" or "$*^#&*^"), feel free to do so on your own, but please don't mix it into any discussion or work or planning or decision-making about the parser.

Ok. You're the boss :)

New alternates aren't even needed; old alternates already exist ( and

...

; use of <nowiki></nowiki> as a hidden separator, etc). Other sorts of magic characters might also be neat additions, but they should not be considered at this time because it's just going to sidetrack things.

You did once say, "A formal grammar is something we really need (and it may require fixes to the grammar as well)", and it's a very obvious thing to attempt to improve a grammar when one is analysing and describing it. I'll keep those improvements out of the main discussion though.

Steve

Jay R. Ashworth

4:09 a.m.

On Wed, Nov 14, 2007 at 11:33:39AM -0500, Brion Vibber wrote:

...

Note also that it *is* a requirement to have sane behavior with this sort of construction:

L'''idée'' <- apostrophe followed by italics L''''idée''' <- apostrophe followed by bold

That's a *requirement* to continue to properly handle French and Italian text. The current apostrophe pass handler uses I believe a lookahead and then goes backwards, which is a fairly sane way of doing this. If EBNF can't handle it, then forget EBNF.

Can someone tell me why bold and italics are considered *part of the spelling of the word* (which seems to be what you're implying here)?

I've never seen that to be the case in any character-based natural language.

...

...
I could accept that the first sentence of my second part is "basic syntax". But not this kind of madness: # If there is an odd number of both bold and italics, it is likely # that one of the bold ones was meant to be an apostrophe followed # by italics. Which one we cannot know for certain, but it is more # likely to be one that has a single-letter word before it.

That's why we're proposing *adding* ** and //, to provide alternative mechanisms for these complicated situations.

Let me be very very clear here.

Whether or not we ever add ** and // as bold and italic syntax is completely unrelated to the actual task of rebuilding the parser or speccing out a grammar for the wiki syntax.

If you want to play with alternate syntax (adding different markup such as "**" or "//" or "$*^#&*^"), feel free to do so on your own, but please don't mix it into any discussion or work or planning or decision-making about the parser.

New alternates aren't even needed; old alternates already exist ( and ; use of <nowiki></nowiki> as a hidden separator, etc). Other sorts of magic characters might also be neat additions, but they should not be considered at this time because it's just going to sidetrack things.

You're correct.

Failing to replace the apostrophe mangle with something less ambiguous pretty much dooms the entire project of attempting to build a formal spec of the language, which seems prerequisite to implementing a newer, less complex parser therefore.

...

Don't complicate the situation by tossing in new stuff. Then the conversation goes from something manageable (does this proposed parser technically accomplish the job?) to something unmanageable (should we make a large number of changes to markup?) and we'll never get anywhere.

Circular argument: we cannot decide if a replacement parser "technically accomplishes the job" until we decide *what the job is*, which requires *some sort of formal specification of the markup*, whether it be in BNF, EBNF, or something more complicated than that.

Since it does not appear to be possible to *create* such a spec to describe even the *desired* interpretation of the current behaviour, much less what the current parser actually *does*, I'd say we're back to dead in the water.

But let's be clear on why, shall we?

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Steve Bennett

4:56 a.m.

On 11/15/07, Jay R. Ashworth jra@baylink.com wrote:

...

...
L'''idée'' <- apostrophe followed by italics L''''idée''' <- apostrophe followed by bold

That's a *requirement* to continue to properly handle French and Italian text. The current apostrophe pass handler uses I believe a lookahead and then goes backwards, which is a fairly sane way of doing this. If EBNF can't handle it, then forget EBNF.

Can someone tell me why bold and italics are considered *part of the spelling of the word* (which seems to be what you're implying here)?

I've never seen that to be the case in any character-based natural language.

I think it's more that L''''idee'' is commonly used idiom. It's not part of the "spelling of the word", whatever that means.

I wonder whether it's possible to handle some of these idioms directly:

Foo'''bar: definitely a bold-toggle. A'''bar: definitely an apostrophe followed by an italic-toggle.

That means that [ '''hello a'''bar ] will render as hello a'bar which is surprising, but if no one currently uses that construct, maybe we can get away with it.

Similarly, it might be worth investigating exactly what mid-word multi-apostrophic constructs are used (yes, Jay, like you suggested...). In French, d'* and l'* are used, and I guess an arbitrary number of others with diminishing likelihood: qu'*, jusqu'*, s'*, and even m'*, t'*, etc.

I hate the parser's (doQuotes()) current approach of trying to second-guess what the user wants: we should be dictating the grammar, and either they are using a rule we specify, or they aren't. I don't really care how complicated the rules get, but we should be able to define them, stick them on a wall, and tell people: if you're not using one of these rules, you're going to get garbage.

Anyway. If we follow the approach I mentioned earlier, then all we have to do is parse apostrophe clumps as a single unit, then sort them out in a second step. Hopefully we can call that function something cute like resolveApostrophalypticChaos() or something. It doesn't really have much impact on the grammar apart from that, so we've probably discussed it enough.

Steve

Jay R. Ashworth

1:44 p.m.

On Thu, Nov 15, 2007 at 03:56:21PM +1100, Steve Bennett wrote:

...

On 11/15/07, Jay R. Ashworth jra@baylink.com wrote:

...
...
L'''idée'' <- apostrophe followed by italics L''''idée''' <- apostrophe followed by bold

That's a *requirement* to continue to properly handle French and Italian text. The current apostrophe pass handler uses I believe a lookahead and then goes backwards, which is a fairly sane way of doing this. If EBNF can't handle it, then forget EBNF.

Can someone tell me why bold and italics are considered *part of the spelling of the word* (which seems to be what you're implying here)?

I've never seen that to be the case in any character-based natural language.

I think it's more that L''''idee'' is commonly used idiom. It's not part of the "spelling of the word", whatever that means.

If it's a *requirement* that we be able to produce a certain text rendering of a word, then it is no longer merely a rendering, it's part of the spelling of the word -- sometihing without which it's not the same word.

...

Similarly, it might be worth investigating exactly what mid-word multi-apostrophic constructs are used (yes, Jay, like you suggested...). In French, d'* and l'* are used, and I guess an arbitrary number of others with diminishing likelihood: qu'*, jusqu'*, s'*, and even m'*, t'*, etc.

I hate the parser's (doQuotes()) current approach of trying to second-guess what the user wants: we should be dictating the grammar, and either they are using a rule we specify, or they aren't. I don't really care how complicated the rules get, but we should be able to define them, stick them on a wall, and tell people: if you're not using one of these rules, you're going to get garbage.

Well, it will be interesting to see how that plays in Peoria, yes. :-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Brion Vibber

17 Nov 17 Nov

1:41 a.m.

Jay R. Ashworth wrote:

...

On Wed, Nov 14, 2007 at 11:33:39AM -0500, Brion Vibber wrote:

...
Note also that it *is* a requirement to have sane behavior with this sort of construction:

L'''idée'' <- apostrophe followed by italics L''''idée''' <- apostrophe followed by bold

That's a *requirement* to continue to properly handle French and Italian text. The current apostrophe pass handler uses I believe a lookahead and then goes backwards, which is a fairly sane way of doing this. If EBNF can't handle it, then forget EBNF.

Can someone tell me why bold and italics are considered *part of the spelling of the word* (which seems to be what you're implying here)?

I think you'll find what I'm referring to is the fact that the apostrophe is used in languages such as French and Italian to elide the vowels and space between a definite article and a following substantive.

An example:

L'idée ("The idea")

Further, it's frequent for formatting on the substantive to *not* apply to its preceding article.

This means that if we want "idée" italicized because it's important, or a title, or a ship name, or whatever; we'd format it in HTML something like this:

L'idée

When using the double-apostrophe italics markup (inherited from Ward Cunningham's WikiWikiWeb via UseModWiki), this leads to a need to handle markup that looks like this:

L'''idée''

Perhaps Ward wouldn't have picked this syntax if his original wiki were in French or Italian, since this case doesn't come up as often in English (though it can, with contractions and possessives), but he did pick that and we ended up with it, and over the years we've tweaked the implementation to handle these sorts of common cases pretty well.

So, we have a markup, and we have an implementation which uses a fairly straightforward back-facing search to produce behavior which handles the important common cases the way we want.

I strongly doubt that it's impossible to make a specification of that algorithm.

-- brion vibber (brion @ wikimedia.org)

Steve Bennett

4:41 a.m.

On 11/17/07, Brion Vibber brion@wikimedia.org wrote:

...

I strongly doubt that it's impossible to make a specification of that algorithm.

Me too. And the more I read about ANTLR, the more convinced I am that we can actually resolve it in a single pass using a recursive descent parser.

(Yes, that's a very big claim to make. But ANTLR has a lot of very cool ways of dealing with context sensitivity.)

In any case, even if we can't duplicate exactly all the behaviour of the current parser in this area, I think we can cover all the important behaviour. L''''idée''' is clearly important behaviour.

Steve

Jay R. Ashworth

2:20 p.m.

New subject: Parser practicum: OOPS :-)

On Sat, Nov 17, 2007 at 03:41:20PM +1100, Steve Bennett wrote:

...

On 11/17/07, Brion Vibber brion@wikimedia.org wrote:

...
I strongly doubt that it's impossible to make a specification of that algorithm.

Me too. And the more I read about ANTLR, the more convinced I am that we can actually resolve it in a single pass using a recursive descent parser.

[ reads Steve's reply ]

[ rereads Brion's posting ]

[ realizes he missed an "im-" ]

Sorry, Brion; nevermind. :-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Jay R. Ashworth

2:19 p.m.

New subject: Parser practicum: PINCH

On Fri, Nov 16, 2007 at 08:41:06PM -0500, Brion Vibber wrote:

...

Jay R. Ashworth wrote:

...
Can someone tell me why bold and italics are considered *part of the spelling of the word* (which seems to be what you're implying here)?

I think you'll find what I'm referring to is the fact that the apostrophe is used in languages such as French and Italian to elide the vowels and space between a definite article and a following substantive.

An example:

L'idée ("The idea")

Further, it's frequent for formatting on the substantive to *not* apply to its preceding article.

Hey! Waitaminnit!

That's... an actual *reason*!

You can't do that here! :-)

...

This means that if we want "idée" italicized because it's important, or a title, or a ship name, or whatever; we'd format it in HTML something like this:

L'idée

When using the double-apostrophe italics markup (inherited from Ward Cunningham's WikiWikiWeb via UseModWiki), this leads to a need to handle markup that looks like this:

L'''idée''

In other words, the issue isn't that the bold or ital is part of the spelling, the issue is that if you *need* to emphasize the word, you *don't* emphasize the prefix, by convention.

Got it now. Thanks.

...

Perhaps Ward wouldn't have picked this syntax if his original wiki were in French or Italian, since this case doesn't come up as often in English (though it can, with contractions and possessives), but he did pick that and we ended up with it, and over the years we've tweaked the implementation to handle these sorts of common cases pretty well.

So, we have a markup, and we have an implementation which uses a fairly straightforward back-facing search to produce behavior which handles the important common cases the way we want.

I strongly doubt that it's impossible to make a specification of that algorithm.

Well, happily, it sounds as if Steve Bennett may think otherwise; let's see how he comes along with that.

Cheers, -- jr 'PINCH' a

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

6088

Age (days ago)

6094

Last active (days ago)

wikitech-l@lists.wikimedia.org

29 comments

8 participants

tags (0)

participants (8)

Brion Vibber
George Herbert
Jay R. Ashworth
Mark Clements
MinuteElectron
Steve Bennett
Steve Sanbeg
Thomas Dalton