dividing front-end from back-end grammar and parsers

List overview All Threads
Download

newer

older

Parser madness

EBNF grammar project status?

William Allen Simpson

11 Nov 2007 11 Nov '07

3:42 a.m.

I've just read the past couple of days of discussion, and would like to agree with Merlijn.

One of the points missed is that the pipe trick and many of the other "end cases" are actually pre-processed, not stored in the database.

The easy examples being: * [[turkey (bird)|]] is stored as [[turkey (bird)|turkey]] * [[stuff]]ing is stored as [[stuff|stuffing]]

Other such behaviors could be regularized, and not affect the existing articles. Some years back, I made some suggestions in this wise, but they were not accepted.

A case I was concerned with at the time was normalized pre-processing of [[stuff:]] versus [[:stuff]], and [[|stuff]] versus [[stuff|]], and their combinations -- [[:stuff (action)|]]. This is the kind of thing that could most easily be formalized.

In regularizing the grammar, think about how the back-end data could be normalized to a new grammar for editing, and then stored again in the back-end form. For example, the // and ** ideas we've talked about multiple times over the years. No reason that the database couldn't continue to store them as '' and '''. Or better as <i> and <b>!

If we stick to just front-end parsing, the project might be doable in our lifetimes.

===

And as a final note for the computer scientists, remember that we often use LR(1) and LALR(1) grammars, but RL(1) is also possible! MW syntax has often seemed to me more like RL....

(Yes, back in university we were all required to write a parser -- a year-long project. I've written several for later projects, too. But university was a very long time ago.)

Show replies by date

Simetrical

11 Nov 11 Nov

3:48 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/10/07, William Allen Simpson william.allen.simpson@gmail.com wrote:

...

I've just read the past couple of days of discussion, and would like to agree with Merlijn.

One of the points missed is that the pipe trick and many of the other "end cases" are actually pre-processed, not stored in the database.

The easy examples being:

[[turkey (bird)|]] is stored as [[turkey (bird)|turkey]]

[[stuff]]ing is stored as [[stuff|stuffing]]

Other such behaviors could be regularized, and not affect the existing articles. Some years back, I made some suggestions in this wise, but they were not accepted.

Because they tend to result in features that are hard to discover. Really, the pipe trick shouldn't be an on-save transform. About the only legitimate things that should be are substing (which is explicitly on-save) and signatures/timestamps.

Steve Bennett

7:17 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/11/07, William Allen Simpson william.allen.simpson@gmail.com wrote:

...

In regularizing the grammar, think about how the back-end data could be normalized to a new grammar for editing, and then stored again in the back-end form. For example, the // and ** ideas we've talked about multiple times over the years. No reason that the database couldn't continue to store them as '' and '''.

Heh, take an unambiguous syntax, and save it as an ambiguous syntax. That's genius...

Seriously though, the general idea is sound. IMHO it would apply equally to ISBN: it would make more sense to detect ISBN 13245789 and *at save time* (and possibly with user warning) replace it with{{ISBN|123456789}}, which would make it clear in the code that it's treated specially.

I think Simetrical is wrong that such "features...are hard to discover". How on earth would you ever discover the current ISBN behaviour?

Steve

Simetrical

5:32 p.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:

...

I think Simetrical is wrong that such "features...are hard to discover". How on earth would you ever discover the current ISBN behaviour?

By seeing an existing ISBN link, saying "Hey, how did they do that?", and looking at the page source. That's how people (at least techy people) normally learn how to use various languages, they do it by copying examples. If it got magically converted to a template, you would say "Oh, it's just a template", and use the template. Unless you happened to stumble across a specific discussion of the feature, you would never know about it.

There's a reason, you know, that talk pages tend to have a message at the top telling everyone about ~~~~.

Steve Bennett

12 Nov 12 Nov

12:21 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/12/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

By seeing an existing ISBN link, saying "Hey, how did they do that?", and looking at the page source. That's how people (at least techy people) normally learn how to use various languages, they do it by copying examples. If it got magically converted to a template, you would say "Oh, it's just a template", and use the template. Unless you happened to stumble across a specific discussion of the feature, you would never know about it.

If typing "ISBN xxx" is automagically converted to "{{ISBN|xxx}}" then there's no harm in people only knowing the latter form, is there...

The same goes for pipe tricks - ultimately all it can do is save you a few keystrokes. You don't get any new functionality by typing [[Foo (blah)|]] that you couldn't get just by typing [[Foo (blah)|Foo]].

There's a reason, you know, that talk pages tend to have a message at

...

the top telling everyone about ~~~~.

That one's a bit different. There's no other way to produce the signature

(short of manually typing your linked username and the date and time), and the transformed output bears no relation to the input.

Steve

Simetrical

12:24 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:

...

If typing "ISBN xxx" is automagically converted to "{{ISBN|xxx}}" then there's no harm in people only knowing the latter form, is there...

The same goes for pipe tricks - ultimately all it can do is save you a few keystrokes. You don't get any new functionality by typing [[Foo (blah)|]] that you couldn't get just by typing [[Foo (blah)|Foo]].

Right, but if the functionality exists, we should not go out of our way to obscure it by pre-save transformation. There's no reason not to leave it in the page text itself.

...

That one's a bit different. There's no other way to produce the signature (short of manually typing your linked username and the date and time), and the transformed output bears no relation to the input.

Exactly why it's impossible in that case to leave it in the page text. There are no other excuses, I don't think.

Steve Bennett

12:41 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/12/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

Right, but if the functionality exists, we should not go out of our way to obscure it by pre-save transformation. There's no reason not to leave it in the page text itself.

This thread is about separating the logic into two places. "The functionality" could exist in the save logic, or the rendering logic, or conceivably both. I'm suggesting we move it strictly to the save logic. Typing ISBN xxx looks and feels like shorthand, rather than some genuine syntactic rule. Making it a save-time feature akin to autocorrect would be appropriate.

Steve

Simetrical

1:16 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:

...

This thread is about separating the logic into two places. "The functionality" could exist in the save logic, or the rendering logic, or conceivably both. I'm suggesting we move it strictly to the save logic. Typing ISBN xxx looks and feels like shorthand, rather than some genuine syntactic rule. Making it a save-time feature akin to autocorrect would be appropriate.

I disagree that shorthand features should not be saved in the wikitext, and I know at least a couple of other devs have expressed vaguely similar sentiments in the past.

Steve Bennett

1:21 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/12/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

I disagree that shorthand features should not be saved in the wikitext, and I know at least a couple of other devs have expressed vaguely similar sentiments in the past.

It might simplify the task of people who want to reuse Wikipedia content in other forms if we could keep the grammar as small as possible.

Steve

Jim Wilson

9:50 p.m.

New subject: dividing front-end from back-end grammar and parsers

Simetrical schrieb:

...

I disagree that shorthand features should not be saved in the wikitext, and I know at least a couple of other devs have expressed vaguely similar sentiments in the past.

I agree with that. All wikitext is "shorthand" for something which could otherwise be represented in HTML. Pre-save transforms should be limited.

-- Jim R. Wilson (jimbojw)

On Nov 11, 2007 6:21 PM, Steve Bennett stevagewp@gmail.com wrote:

...

On 11/12/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...
I disagree that shorthand features should not be saved in the wikitext, and I know at least a couple of other devs have expressed vaguely similar sentiments in the past.

It might simplify the task of people who want to reuse Wikipedia content in other forms if we could keep the grammar as small as possible.

Steve

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Steve Bennett

13 Nov 13 Nov

1:14 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/13/07, Jim Wilson wilson.jim.r@gmail.com wrote:

...

I agree with that. All wikitext is "shorthand" for something which could otherwise be represented in HTML. Pre-save transforms should be limited.

That's not the "shorthand" sense I mean. Think about any of the language features, whether it's italics, lists, links, __TOC__, parser functions, <gallery>, tables, whatever. They all work on the basis that someone deliberately typed some weird punctuation to tell the parser to treat the text differently.

ISBN is different: the parser deliberately tries to detect text that the user typed naturally ("ISBN 123456789" being the normal, unmarked formatting used in the real world) and treat it specially. It's not a real grammatical feature, it's a deliberate effort to achieve markup with no effort.

The only other feature I can think of that works that way is bare urls: http://foo.com

Anyway, it's not a major issue. There are bigger fish to fry.

Steve

Platonides

2:05 p.m.

New subject: dividing front-end from back-end grammar and parsers

Steve Bennett wrote:

...

ISBN is different: the parser deliberately tries to detect text that the user typed naturally ("ISBN 123456789" being the normal, unmarked formatting used in the real world) and treat it specially. It's not a real grammatical feature, it's a deliberate effort to achieve markup with no effort.

The only other feature I can think of that works that way is bare urls: http://foo.com

Anyway, it's not a major issue. There are bigger fish to fry.

Steve

It also works for RFCs.

Jim Hu

14 Nov 14 Nov

6:16 p.m.

New subject: dividing front-end from back-end grammar and parsers

And PMIDs. The RFC bit me and I had to disable it. In E. coli we have a gene named rfc, which caused all sorts of interesting problems when we tried to make wiki pages about it. It would be nice if these were less Easter-egg like and more configurable for us non WMF sites.

Come to think of it, upgrades probably turn this back on. Sigh.

On Nov 13, 2007, at 7:05 AM, Platonides wrote:

...

Steve Bennett wrote:

...
ISBN is different: the parser deliberately tries to detect text that the user typed naturally ("ISBN 123456789" being the normal, unmarked formatting used in the real world) and treat it specially. It's not a real grammatical feature, it's a deliberate effort to achieve markup with no effort.

The only other feature I can think of that works that way is bare urls: http://foo.com

Anyway, it's not a major issue. There are bigger fish to fry.

Steve

It also works for RFCs.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Thomas Dalton

6:24 p.m.

New subject: dividing front-end from back-end grammar and parsers

On 14/11/2007, Jim Hu jimhu@tamu.edu wrote:

...

And PMIDs. The RFC bit me and I had to disable it. In E. coli we have a gene named rfc, which caused all sorts of interesting problems when we tried to make wiki pages about it. It would be nice if these were less Easter-egg like and more configurable for us non WMF sites.

Come to think of it, upgrades probably turn this back on. Sigh.

Ouch! I just looked at that part of the parser - I hadn't realised the strings were hardcoded... they should be in the messages files like any other string...

Steve Bennett

15 Nov 15 Nov

7:42 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/15/07, Jim Hu jimhu@tamu.edu wrote:

...

And PMIDs. The RFC bit me and I had to disable it. In E. coli we have a gene named rfc, which caused all sorts of interesting problems when we tried to make wiki pages about it. It would be nice if these were less Easter-egg like and more configurable for us non WMF sites.

Yep. There's all sorts of reasons it's a bad idea. Take a look at this page:

http://en.wikipedia.org/wiki/Private_network

Here the RFC magic link is working to its full exent. But notice: - Every RFC is linked, even the ones that are used more than once - The RFC's are automatically linked like this: [http:... RFC 1234], rather than RFC 1234[http:...] or even RFC 1234<ref>http...</ref>

Having a piece of hardcoded parser magic dictate style and presentation over the Manula of Style is crappy. And that's in a best-case scenario. We have lots of other templates that do a similar job ({{imdb}} for instance), so it makes very little sense to me to give ISBN's, RFC's and PMID's (whatever they are) this special treatment.

</rant>

Our new parser will dutifully recognise them, of course. :)

Steve

Simetrical

16 Nov 16 Nov

7:28 p.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/15/07, Steve Bennett stevagewp@gmail.com wrote:

...

Yep. There's all sorts of reasons it's a bad idea. Take a look at this page:

http://en.wikipedia.org/wiki/Private_network

Here the RFC magic link is working to its full exent. But notice:

Every RFC is linked, even the ones that are used more than once

The RFC's are automatically linked like this: [http:... RFC

1234], rather than RFC 1234[http:...] or even RFC 1234<ref>http...</ref>

Having a piece of hardcoded parser magic dictate style and presentation over the Manula of Style is crappy. And that's in a best-case scenario. We have lots of other templates that do a similar job ({{imdb}} for instance), so it makes very little sense to me to give ISBN's, RFC's and PMID's (whatever they are) this special treatment.

</rant>

Would anyone object if I just deleted this functionality, so people could use templates like for everything else? Or would that be a little too drastic? As a historical thing, this stuff was introduced well before templates, I believe, and was subsequently obsoleted by them.

Thomas Dalton

7:50 p.m.

New subject: dividing front-end from back-end grammar and parsers

...

Would anyone object if I just deleted this functionality, so people could use templates like for everything else? Or would that be a little too drastic? As a historical thing, this stuff was introduced well before templates, I believe, and was subsequently obsoleted by them.

I'm sure somebody somewhere would object. At the very least, you should run a bot over the WMF wikis switching any use of the feature to templates. Better yet, provide a conversion script so anyone can run it when updating their installations.

MinuteElectron

7:56 p.m.

New subject: dividing front-end from back-end grammar and parsers

Simetrical wrote:

...

Would anyone object if I just deleted this functionality, so people could use templates like for everything else? Or would that be a little too drastic? As a historical thing, this stuff was introduced well before templates, I believe, and was subsequently obsoleted by them.

While esoteric this feature is very nice and saves lots of work for users. Removing it makes no sense and will just generate more edits due to bots and people having to add such templates\links manually. At the very most introduce a global variable that disables this, however that seams too much effort for such a insignificant feature.

MinuteElectron.

Steve Bennett

17 Nov 17 Nov

6:35 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/17/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

Would anyone object if I just deleted this functionality, so people could use templates like for everything else? Or would that be a little too drastic? As a historical thing, this stuff was introduced well before templates, I believe, and was subsequently obsoleted by them.

I once created {{ISBN}} which wrappered the ISBN magic word on en. It was deleted for redundancy. If this does happen, you should give people a bit of warning...

I can't agree with MinuteElectron that "While esoteric this feature is very nice and saves lots of work for users." How much work can be saved by eliminating 4 keystrokes every time you've looked up and typed in an ESBN?

Steve

MinuteElectron

10:12 a.m.

New subject: dividing front-end from back-end grammar and parsers

Steve Bennett wrote:

...

users." How much work can be saved by eliminating 4 keystrokes every time you've looked up and typed in an ESBN

It is not neccesserily the ammount of work, but an inexperienced editor may neglect to put the ISBN template tag in at all, meaning there will either never be one, or someone else has to realise it and edit the article to add one. Many users do not know how to use templates and so by forcing them to use them for such basic functionality, many would either ignore the fact, or just not investigate how to fix it. Regardless why would one remove a feature from the parser, it makes no sense.

MinuteElectron.

Steve Bennett

2:19 p.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/17/07, MinuteElectron minuteelectron@googlemail.com wrote:

...

It is not neccesserily the ammount of work, but an inexperienced editor may neglect to put the ISBN template tag in at all, meaning there will either never be one, or someone else has to realise it and edit the

This is terrible justification. An experienced editor may forget to add: * Sources * Categories * Assertion of notability * Navbox * Infobox * Image * Link to Commons category * See also links * Links to other articles in the text. * Stub template * Footnotes * Unit conversion templates

The software can't magically fix those. Forgetting to add the *link* to a correctly cited ISBN? Pfft. The least of our worries.

...

Regardless why would one remove a feature from the parser, it makes no sense.

Because features that add complexity without a major benefit aren't good. Because you don't always want to link to absolutely every RFC and every ISBN.

Steve

Simetrical

18 Nov 18 Nov

1:04 a.m.

New subject: dividing front-end from back-end grammar and parsers

Regardless, Brion's answer was "not at this time".

Steve Bennett

3:07 a.m.

New subject: dividing front-end from back-end grammar and parsers

On 11/18/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

Regardless, Brion's answer was "not at this time".

Yep. It doesn't mean that *other* arguments in favour of that construct aren't lousy :) The reasons for retaining the construct are basically a) to avoid changing the grammar we're trying to define, b) to avoid breaking existing wikitext. Not c) because the construct itself is actually valuable.

Steve

Platonides

11 Nov 11 Nov

11:34 p.m.

New subject: dividing front-end from back-end grammar and parsers

William Allen Simpson wrote:

...

I've just read the past couple of days of discussion, and would like to agree with Merlijn.

One of the points missed is that the pipe trick and many of the other "end cases" are actually pre-processed, not stored in the database.

The easy examples being:

[[turkey (bird)|]] is stored as [[turkey (bird)|turkey]]

[[stuff]]ing is stored as [[stuff|stuffing]]

Ony the first is actually expanded on save. http://es.wikipedia.org/w/index.php?diff=12747472&oldid=12747041&dif...

6262

Age (days ago)

6269

Last active (days ago)

wikitech-l@lists.wikimedia.org

23 comments

8 participants

tags (0)

participants (8)

Jim Hu
Jim Wilson
MinuteElectron
Platonides
Simetrical
Steve Bennett
Thomas Dalton
William Allen Simpson