Parsing italics/bold

List overview All Threads
Download

newer

older

Is template information stored in...

Re: [Wikitech-l] A Modest Proposal...

Steve Bennett

13 Nov 2007 13 Nov '07

8:05 a.m.

What's the best way to approach parsing a long string of formatted text: 1) Treat each incidence of ''' or '' as an element to be translated into , , , or , using state ("context"?) to determine which 2) Have a rule that treats an entire run of '''........''' as a single element, to be transformed into ........ I'm not even considering the much-discussed ambiguities of apostrophes. Assuming simple, possibly well-formed but at least not pathological input, which way is best? A lot of our assumptions about how to parse come from parsing programming languages, but I can't think of an analogous programming language feature: ''' doesn't nest, so it's not like an if-block, and its contents has to be parsed, so it's not like a comment. At best it seems vaguely like an inline compiler flag, a #DEFINE/#UNDEFINE in C, or an OPTION BASE statement in VB, all of which clearly change state and don't require block terminators. The downside of 1) is it seems to tie us to HTML, and rely on this external entity (the browser) to make sense of the begin/end tokens we spit out. It also requires keeping track of state... The downside of 2) is it seems difficult to fail gracefully if there is no closing token or if overlapping bold/italics are found. At best, a section of text might have to be parsed twice. At worst, it will be much more pedantic than our current parser, and will ignore improper bold/italics altogether. Suggestions? Steve

Show replies by thread

Steve Bennett

13 Nov 13 Nov

1:27 p.m.

On 11/13/07, Steve Bennett <stevagewp(a)gmail.com> wrote:

...

To answer my own question, I don't think 2) is possible, due to the legitimacy of constructs like: Here is some ''italics with a [[link|that switches ''off]] the italics. I think '' and ''' will have to be parsed as rather ambiguous "toggle state of bold/italics" tokens, whose meaning can be made more clear by walking the AST afterwards. It's a pity, because the existing work on the EBNF assumed that they could be treated as blocks. http://www.mediawiki.org/wiki/Markup_spec (was at meta) Unless someone wants to jump in and claim that the above construct is a mistake and that ''..'' *should* be a block of some kind. Steve PS http://www.usemod.com/cgi-bin/mb.pl?ConsumeParseRenderVsMatchTransform is useful for describing the parser transfomation we're trying to achieve. Apparently we're trying to convert a "match-transform" parser into a "consume-parse-render" parser.

Jay R. Ashworth

2:35 p.m.

On Wed, Nov 14, 2007 at 12:27:51AM +1100, Steve Bennett wrote:

...

On 11/13/07, Steve Bennett <stevagewp(a)gmail.com> wrote:

Right here, Steve, you're hitting on the underlying problem with this project: some behavior of the current parser is defined and intentional, and some of it is probably an accident of the implementation. Distinguishing these is probably a) important and b) impossible. Cheers, -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Steve Bennett

3:56 p.m.

On 11/14/07, Jay R. Ashworth <jra(a)baylink.com> wrote:

...

Well, of course. The initial brief was to document the current behaviour of the parser precisely. I don't think that's either possible or desirable. What's far more useful is to document the current behaviour over a useful, used subset. I guess ultimately if the correctness of the parser is only going to be judged by the regression tests and the Wikipedia corpus, then a happy medium should be findable. Steve

Steve Sanbeg

5:38 p.m.

On Wed, 14 Nov 2007 00:27:51 +1100, Steve Bennett wrote:

...

On 11/13/07, Steve Bennett <stevagewp(a)gmail.com> wrote:

You can't treat it as a toggle unless you know what you're toggling, which depends on matching open/close delimiters over an entire paragraph, since the end of the paragraph implicitly ends any bold/italic. The behavior you're seeing is an artifact of the multi-stage parsing. that sort of thing really should go if we want to migrate to a recursive descent parser.

Thomas Dalton

5:59 p.m.

...

To answer my own question, I don't think 2) is possible, due to the legitimacy of constructs like: Here is some ''italics with a [[link|that switches ''off]] the italics.

That's a problem we're going to come across a lot. Most parsers solve it by requiring the syntax to be well-formed. Since wikitext is meant to be a foolproof as possible, we don't want to make that requirement, which means we have to write a parser that can understand a terrible mess of tokens. I think toggles is the only way to do it, although even then it's hard since the result isn't going to be a tree. The only alternative I can think of is running the wikitext through a tidier first that detects that kind of mess and adds the appropriate close and reopen tags. It requires an extra pass through the text, but might be unavoidable. Basically, we accept that wikitext can't be described by EBNF, so start by parsing the wikitext into a more restrictive form of wikitext which can be described by EBNF, and then parsing that. It's a mess, but it's probably better than what we have at the moment.

David Gerard

6:05 p.m.

On 13/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

> To answer my own question, I don't think 2) is possible, due to the > legitimacy of constructs like: > Here is some ''italics with a [[link|that switches ''off]] the italics.

That's arguably pathological and confusing even as wikitext. If I saw that I'd probably make it clearer by hand.

...

Yes. Throwing an error is absolutely unacceptable if we're going to put this in front of the technophobes who muddle through at present. All strings must be "valid", even if kept from doing wacky things.

...

The only alternative I can think of is running the wikitext through a tidier first that detects that kind of mess and adds the appropriate close and reopen tags. It requires an extra pass through the text, but might be unavoidable.

Definitely, and arguably enhances comprehension of the text. We need such a pass to keep [[text (bracket)|]] and ~~~~ expansion working in any case.

...

Basically, we accept that wikitext can't be described by EBNF, so start by parsing the wikitext into a more restrictive form of wikitext which can be described by EBNF, and then parsing that. It's a mess, but it's probably better than what we have at the moment.

Or that way around, yes :-) - d.

Thomas Dalton

6:16 p.m.

...

Definitely, and arguably enhances comprehension of the text. We need such a pass to keep [[text (bracket)|]] and ~~~~ expansion working in any case.

I wasn't intending the output of the tidier to replace the wikitext, I intended it as a 2-stage parsing process. As other people have said, expanding anything other than ~~~~ is a bad idea. (Proof by anecdote: I only found out about the [[text (bracket)|]] syntax a couple of days ago on this mailing list. Having never read the appropriately help files (I'm male - I don't read manuals ;)), I had no way to know it existed.)

Brion Vibber

7:16 p.m.

Steve Bennett wrote:

...

On 11/13/07, Steve Bennett <stevagewp(a)gmail.com> wrote:

To answer my own question, I don't think 2) is possible, due to the legitimacy of constructs like: Here is some ''italics with a [[link|that switches ''off]] the italics.

That's something of a scary-looking case we might label pathological, but we might well see something like this: ''See also [[HMS Pinafore|the operetta ''HMS Pinafore'']] for some stuff.'' where we want to toggle italic state _entirely_ within link text. Even if we only handle start-end pairs on the same level of the parse tree, it's necessary to keep track of the parent's state to know how to handle it in the child. An acceptable rendering might be: See also <a>the operetta HMS Pinafore<a> for some stuff. (That empty at the end of the link can then be elided.) Alternatively we can go totally crazy and use CSS to create some kind of anti-italic element.... >:D See also <a>the operetta HMS Pinafore<a> for some stuff. But that way may lead madness. -- brion vibber (brion @ wikimedia.org)

Jay R. Ashworth

14 Nov 14 Nov

12:28 a.m.

On Wed, Nov 14, 2007 at 02:56:41AM +1100, Steve Bennett wrote:

...

On 11/14/07, Jay R. Ashworth <jra(a)baylink.com> wrote:

Apparently not; see Brion's reply in the Practicum thread.

...

I guess ultimately if the correctness of the parser is only going to be judged by the regression tests and the Wikipedia corpus, then a happy medium should be findable.

Yeah; good luck with that. Cheers, -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Jay R. Ashworth

12:31 a.m.

On Tue, Nov 13, 2007 at 05:59:59PM +0000, Thomas Dalton wrote:

...

I think toggles is the only way to do it, although even then it's hard since the result isn't going to be a tree. The only alternative I can think of is running the wikitext through a tidier first that detects that kind of mess and adds the appropriate close and reopen tags. It requires an extra pass through the text, but might be unavoidable. Basically, we accept that wikitext can't be described by EBNF, so start by parsing the wikitext into a more restrictive form of wikitext which can be described by EBNF, and then parsing that. It's a mess, but it's probably better than what we have at the moment.

Note that this totally screws the people who are hoping for a clean WT-XML parser. Cheers, -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Steve Bennett

1 a.m.

On 11/14/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

...

You can't treat it as a toggle unless you know what you're toggling, which

I don't understand. depends on matching open/close delimiters over an entire paragraph, since

...

the end of the paragraph implicitly ends any bold/italic.

What I'm suggesting is simply storing it in the tree as a toggle - neither "bold on" nor "bold off", but just "bold toggle". Then a secondary stage walks the tree and matches them. And obviously in the walking it would only walk within a paragraph block. Steve

Steve Bennett

1:05 a.m.

On 11/14/07, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

I think this is a terrible solution. Running a tidier *first* means we have to parse all the text, and I mean *all* of it, twice. All the nowikis, all the pre's, all the places that ''' and '' *don't* get interpreted - all of that logic will have to be coded in the pre-tidier as well as in the parser proper. Much better IMHO to do your bold/italic (and apostrophe if necessary) logic *afterwards* when you have a logical tree. You could even treat all sequences of 2+ apostrophes (if found in appropriate places) as "to be dealt with later", storing the number of apostrophes in the tree. In a secondary walk, you could probably even reproduce the current pathological behaviour - if we really wanted to. Basically, we accept that wikitext can't be

...

described by EBNF,

Yes. so start by parsing the wikitext into a more

...

restrictive form of wikitext

No. It's a mess, Yes. but it's probably better than what we have

...

at the moment.

Barely. Steve

Steve Bennett

1:07 a.m.

On 11/14/07, David Gerard <dgerard(a)gmail.com> wrote:

...

Definitely, and arguably enhances comprehension of the text. We need such a pass to keep [[text (bracket)|]] and ~~~~ expansion working in any case.

That stuff happens at save time. I assume Thomas was still talking about render time? Personally I'd be inclined to move [[Text (pipetrick)|]] to the real parser, but there was some reason not to (compatibility with older parsers or something). Steve

Thomas Dalton

3:37 p.m.

...

Note that this totally screws the people who are hoping for a clean WT-XML parser.

No, it wouldn't. You would run the tidier before running any parser. You can then parse the tidied wikitext into whatever you like.

Thomas Dalton

3:51 p.m.

...

But that wouldn't be a tree. There is no way of storing toggles in a tree, at least not conceptually. You would end up with something like this: ....................wikitext ......___________|__________ .....|...................|..................| ....B................text...............B Where "B" means a bold toggle, and "text" is arbitrary text. Things at the same level of a tree shouldn't depend on each other, that's how trees work (and is why you can use CSS to move HTML div tags anywhere you like on the screen, regardless of the order they appear in the source). Your method would probably work, but it's just as much a mess as my idea. And they can't be stored as bold toggle and italic toggles, they'll have to be stored as "x apostrophes" in order for more complicated combinations to work. Your final walk of the tree is going to end up just as complicated as my first pass through the wikitext (it's easy to exclude the few places where bold and italics aren't parsed - it's just pre and nowiki as far as I know, the code just needs to be exploded in the right places in the same way the current parser works). In summary, the syntax is a complete mess, so both our solutions are complete messes. I'm really not sure which is better, but I don't think there's much in it. My idea does allow for saving the tidied version if people want (I'd prefer it to be an option, rather than happening automatically as someone else suggested), which would be a nice feature, but far from a vital one. It also allows for tidying more than just bold and italics if we find anything else that needs similar treatment (lists, perhaps). Does your idea have any similar added benefits?

David Gerard

4:34 p.m.

On 14/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

In summary, the syntax is a complete mess, so both our solutions are complete messes. I'm really not sure which is better, but I don't think there's much in it. My idea does allow for saving the tidied version if people want (I'd prefer it to be an option, rather than happening automatically as someone else suggested), which would be a nice feature, but far from a vital one. It also allows for tidying more than just bold and italics if we find anything else that needs similar treatment (lists, perhaps). Does your idea have any similar added benefits?

I'd pick the one that better enables third-party implementations from the spec. If there's nasty stuff to do as well, at least it can be specified nasty stuff, with public-domain pseudocode to work from and so forth. - d.

Steve Sanbeg

4:48 p.m.

On Wed, 14 Nov 2007 12:00:45 +1100, Steve Bennett wrote:

...

On 11/14/07, Steve Sanbeg <ssanbeg(a)ask.com> wrote:

You can't treat it as a toggle unless you know what you're toggling, which

I don't understand.

Yes, when you see a sequence of multiple apostrophes, that means to toggle something. Referring to is as a bold toggle assumes that you know what you're toggling, which you can't until you've read the paragraph. The bold toggle you're referring to could essentially be 4 things: * bold on * bold off * ' + italic on * ' + italic off Longer sequences only make things more complex.

Thomas Dalton

4:55 p.m.

...

I'd say they're equally good from that perspective. Steve's idea is probably slightly easier to specify simply. My idea would allow 3rd parties to simply not allow invalid syntax and then not worry about tidying it (basically, we have two specs, a strict one and a tolerant one). If their users are slightly more tech-savvy than ours, requiring they use well-formed wikisyntax would be acceptable, if not, then they'll have to do something just as messy as us.

David Gerard

5:02 p.m.

On 14/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

> I'd pick the one that better enables third-party implementations from > the spec. If there's nasty stuff to do as well, at least it can be > specified nasty stuff, with public-domain pseudocode to work from and > so forth.

...

Third party parsers will first be used by third party MediaWiki installations. For things like office intranets, whose users will make a technophobic Wikipedian look like a geek by comparison. I beg you, think of the sysadmins! ANY error will make more work for them. Then they'll install MoinMoin instead. Then their lives will suck. Does *any* wiki engine throw errors at bad wikitext, or just give glaringly malformed results? I really think glaringly malformed results are all the error message a user needs. And anything that doesn't cause a glaringly malformed result is not an "error" in the first place. That's the "tag soup is a feature" theory. I realise something that throws errors is easier to implement, but that's not enough justification IMO. - d.

Thomas Dalton

5:15 p.m.

...

I realise something that throws errors is easier to implement, but that's not enough justification IMO.

My suggestion is harder to implement than just making a best guess, since it includes just making a best guess, so I have no idea what you're talking about.

David Gerard

5:40 p.m.

On 14/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

> I realise something that throws errors is easier to implement, but > that's not enough justification IMO.

...

My suggestion is harder to implement than just making a best guess, since it includes just making a best guess, so I have no idea what you're talking about.

Then I have your idea backwards, sorry about that :-) - d.

Thomas Dalton

8:21 p.m.

On 14/11/2007, David Gerard <dgerard(a)gmail.com> wrote:

...

On 14/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

> I realise something that throws errors is easier to implement, but > that's not enough justification IMO.

My suggestion is harder to implement than just making a best guess, since it includes just making a best guess, so I have no idea what you're talking about.

Then I have your idea backwards, sorry about that :-)

Let me try and explain it again a little simpler: The parser does it's best guess at parsing everything, however when in debug mode (which it automatically would be when displaying the page after saving) it also displays inline warnings (I was calling them error messages, which isn't very accurate and may have been the cause of some of the confusion).

Steve Bennett

15 Nov 15 Nov

2:38 a.m.

On 11/15/07, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

I think I would propose storing them like that in the first pass, then replacing them with "open B" or "close B" in the second pass, then rendering those out the obvious way. Where "B" means a bold toggle, and "text" is arbitrary text. Things at

...

the same level of a tree shouldn't depend on each other, that's how trees work (and is why you can use CSS to move HTML div tags anywhere

Right, and *after* that second pass, that's how the tree will work. you like on the screen, regardless of the order they appear in the

...

source). Your method would probably work, but it's just as much a mess as my idea. And they can't be stored as bold toggle and italic

I suspect not, but I wouldn't stake my life on it. Parsing the bold/italics at the moment is 200 lines of PHP (see doAllQuotes() and doQuotes() ) - I think a pre-parse "tidy" is going to have to include all that logic, I don't see how you can avoid it. To my mind, walking a tree is much cleaner and simpler than parsing text, and probably faster, too. But I could just be biased :) toggles, they'll have to be stored as "x apostrophes" in order for

...

more complicated combinations to work. Your final walk of the tree is

Depends how many "complicated combinations" we support. going to end up just as complicated as my first pass through the

...

wikitext (it's easy to exclude the few places where bold and italics aren't parsed - it's just pre and nowiki as far as I know, the code

Yeah, but that does mean you're parsing pre and nowiki (and math, hiero, and possibly others) twice. In summary, the syntax is a complete mess, so both our solutions are

...

complete messes. I'm really not sure which is better, but I don't think there's much in it. My idea does allow for saving the tidied version if people want (I'd prefer it to be an option, rather than happening automatically as someone else suggested), which would be a

What would you tidy it to? At the moment, there is no unambiguous syntax for mixed apostrophes and bold/italics, other than <nowiki>. nice feature, but far from a vital one. It also allows for tidying

...

more than just bold and italics if we find anything else that needs similar treatment (lists, perhaps). Does your idea have any similar added benefits?

Nope, other than it doesn't require processing the text again, and is more akin to the model of context-free grammar we're theoretically aspiring to. It seems cleaner to me to clearly define the exception to the EBNF this way, but that could be a bias not based on much real evidence. The only reason it matters at this point which solution will be used is for the grammar. I'm currently working on the basis that ''' is just a bold toggle which is processed and finished with. If you already know that all the bolds and italics have been normalised, then it would be possible to create a '''...''' openbold/closebold block. In general I don't really like the idea of treating blocks as genuine blocks in wikitext because of the error-handling problem. In a programming language, if a block is malformed, you pretty much just abort the compile. We have to carry on as sensibly as possible, so it's better not to find yourself 500 characters into three nested blocks before suddenly discovering a non-permissible new line. Steve

Thomas Dalton

10:39 a.m.

...

Yeah, but that does mean you're parsing pre and nowiki (and math, hiero, and possibly others) twice.

I wouldn't call it parsing them. It's just one line of php...

...

What would you tidy it to? At the moment, there is no unambiguous syntax for mixed apostrophes and bold/italics, other than <nowiki>.

Well, things like: This '''bold text stops [[page|halfway through''' a link]]. Would become: This '''bold text stops '''[[page|'''halfway through''' a link]]. More complicated things might be a little harder, but pretty much anything is possible to describe unambiguously in wikisyntax, so once we determine what each weird sequence should mean, it can be rewritten more clearly (for example, don't allow any nesting at all, and treat bold italic as a separate 3rd format).

...

I think the amount is processing is the same either way. With my way, we do end up with a pure EBNF parser, just with something tacked on the beginning. With your way, the EBNF part and the exception is all mixed together. Any third opinions?

Jay R. Ashworth

1:50 p.m.

On Thu, Nov 15, 2007 at 10:39:55AM +0000, Thomas Dalton wrote:

...

More complicated things might be a little harder, but pretty much anything is possible to describe unambiguously in wikisyntax, so once we determine what each weird sequence should mean, it can be rewritten more clearly (for example, don't allow any nesting at all, and treat bold italic as a separate 3rd format).

So, to be clear, you're suggesting that we subst: complicated constructs by their clear, simple equivalents, and then define the grammar based on the target of that //subst//itution? Ok, yeah, that sounds like it might be slightly more possible. Helps pedagogically as well, as long as users are informed that the system "slightly modified their markup to make it easier to understand." Of course, some people might get annoyed by *that*, and they'll be power users... You really can't win, here, can you? Cheers, -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Thomas Dalton

2:19 p.m.

...

My original idea was not to save the tidied version but to tidy the page every time it's parsed. Saving it is an option. Whether it's a good idea or not probably depends on how much the markup gets changed - if it's a lot it might confuse users too much. It could easily be made a configuration option, though.

David Gerard

3:27 p.m.

On 15/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

...

Oh, definitely. Don't scare the n00bs (or corporate intranet users), but do make it an option for Wikipedians and other MediaWiki power users. A very nice one, I think. - d.

Thomas Dalton

5:21 p.m.

On 15/11/2007, David Gerard <dgerard(a)gmail.com> wrote:

...

On 15/11/2007, Thomas Dalton <thomas.dalton(a)gmail.com> wrote:

Oh, definitely. Don't scare the n00bs (or corporate intranet users), but do make it an option for Wikipedians and other MediaWiki power users. A very nice one, I think.

I was thinking a site config option rather than a user preference, but both ideas have merit.

6012

days inactive

6014

days old

wikitech-l@lists.wikimedia.org

Manage subscription

28 comments

6 participants

tags (0)

participants (6)

Brion Vibber
David Gerard
Jay R. Ashworth
Steve Bennett
Steve Sanbeg
Thomas Dalton