What's the status of the project to create a grammar for Wikitext in EBNF?
There are two pages: http://meta.wikimedia.org/wiki/Wikitext_Metasyntax http://www.mediawiki.org/wiki/Markup_spec
Nothing seems to have happened since January this year. Also the comments on the latter page seem to indicate a lack of clear goal: is this just a fun project, is it to improve the existing parser, or is it to facilititate a new parser? It's obviously a lot of work, so it needs to be of clear benefit. Brion requested the grammar IIRC (and there's a comment to that effect at http://bugzilla.wikimedia.org/show_bug.cgi?id=7 ), so I'm wondering what became of it.
Is there still a goal of replacing the parser? Or is there some alternative plan?
Steve
"Steve Bennett" stevagewp@gmail.com wrote in message news:b8ceeef70711071509q14ef0fbdx123cae126940b271@mail.gmail.com...
What's the status of the project to create a grammar for Wikitext in EBNF?
There are two pages: http://meta.wikimedia.org/wiki/Wikitext_Metasyntax http://www.mediawiki.org/wiki/Markup_spec
Nothing seems to have happened since January this year. Also the comments
on
the latter page seem to indicate a lack of clear goal: is this just a fun project, is it to improve the existing parser, or is it to facilititate a new parser? It's obviously a lot of work, so it needs to be of clear benefit. Brion requested the grammar IIRC (and there's a comment to that effect at http://bugzilla.wikimedia.org/show_bug.cgi?id=7 ), so I'm wondering what became of it.
Is there still a goal of replacing the parser? Or is there some
alternative
plan?
The plan is normally to document the parser so it is properly defined.
This is normally abandoned when it is realised that this is 'very hard' (impossible in BNF/EBNF form according to some people).
Hence several attempts with no current activity.
- Mark Clements (HappyDog)
On 11/8/07, Mark Clements gmane@kennel17.co.uk wrote:
This is normally abandoned when it is realised that this is 'very hard' (impossible in BNF/EBNF form according to some people).
Not to mention that BNF is not really suited to the task. BNF is supposed
to answer the question "does text A match grammar B?" However, essentially all wikitext is "valid" - so we're really looking for something that answers the question "how should text A be rendered" or "what is the meaning of text A" or even "how should text A be converted into a decorated* syntax tree".
For example, this text is "valid":
#foo ##boo #**moo ##woo
Is BNF capable of expressing that the second-level numbering restarts at "woo"? Do we even care - is our grammar attempting to describe an actual rendering of input text into output text, or does it consider the output to be HTML (<OL> etc).
So: - What do we mean by a "grammar" of Wikitext? - Are we also attempting to define the semantics of that grammar? - Do the semantics incorporate the semantics of the HTML elements themselves? - Why are we doing this anyway?
Steve
On 11/7/07, Steve Bennett stevagewp@gmail.com wrote:
What exactly is the "goal"? If it's just "formally defining whatever it is that the code currently does", is that a worthy goal?
Probably not. The best we can hope for is likely something like:
1) A BNF grammar is developed that fits almost all the commonly-used features in. This will probably require unlimited lookahead, but I do think (without, admittedly, much of any formal grounding in the theory of all this) it's possible if that's allowed, keeping in mind the "almost all" caveat.
2) Now that we have a grammar, a yacc parser is compiled, and appropriate rendering bits are added to get it to render to HTML.
3) The stuff the BNF grammar doesn't cover is tacked on with some other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
4) All of this breaks a thousand different corner cases and half the parser tests. The implementers carefully go through every failed parser test, rewrite it to the actual output, and carefully justify why this is the correct course of action. Or just assume it is, depending on the level of care.
5) A PHP implementation of the exact same grammar is implemented. How practical this is, I don't know, but it's critical unless we want pretty substantially different behavior for people using the PHP module versus not. It is not acceptable to force third parties to use a PHP module, nor to grind their parser to a halt (which a naive compilation of the grammar into PHP would probably do).
6) Everything is rolled out live. Pages break left and right. Large complaint threads are started on the Village Pump, people fix it, and everyone forgets about it. Developers get a warm fuzzy feeling for having finally succeeded at destroying Parser.php.
This is if it's to be done properly. A semi-formal specification that's not directly useful for parsing pages would involve a lot less work and perhaps correspondingly less benefit. It could still improve operability with third parties dramatically; perhaps that's the only goal other people have in mind, not the ability to compile a parser with some yacc equivalent. I don't know.
On 11/7/07, Steve Bennett stevagewp@gmail.com wrote:
Not to mention that BNF is not really suited to the task. BNF is supposed to answer the question "does text A match grammar B?" However, essentially all wikitext is "valid" - so we're really looking for something that answers the question "how should text A be rendered" or "what is the meaning of text A" or even "how should text A be converted into a decorated* syntax tree".
BNF does that. The *language* generated by a grammar is distinct from the grammar itself: two grammars can be different but generate the same language. In this case, the language might be the set of all strings, but applying the grammar to a string gets us a parse tree, which is what we want. Specifically, yacc and similar programs (e.g., bison) will execute provided code snippets every time they encounter a particular terminal symbol from the grammar, or something like that, I gather. This should be able to include appending to an HTML output string.
On 11/7/07, Simetrical Simetrical+wikilist@gmail.com wrote:
- A BNF grammar is developed that fits almost all the commonly-used
features in. This will probably require unlimited lookahead, but I do think (without, admittedly, much of any formal grounding in the theory of all this) it's possible if that's allowed, keeping in mind the "almost all" caveat.
Doing a little more idle reading, I see that bison at least evidently only allows one-token lookahead (and since it's supposed to be strictly superior to yacc, presumably that does too). On the other hand, maybe a lexing specification alone would be of considerable use. If you tokenize things like apostrophes correctly, it seems to me you're halfway done . . . but I've talked about considerably more than I understand, as usual, and should shut up now.
On Wed, 07 Nov 2007 22:43:31 -0500, Simetrical wrote:
On 11/7/07, Simetrical Simetrical+wikilist@gmail.com wrote:
- A BNF grammar is developed that fits almost all the commonly-used
features in. This will probably require unlimited lookahead, but I do think (without, admittedly, much of any formal grounding in the theory of all this) it's possible if that's allowed, keeping in mind the "almost all" caveat.
Doing a little more idle reading, I see that bison at least evidently only allows one-token lookahead (and since it's supposed to be strictly superior to yacc, presumably that does too). On the other hand, maybe a lexing specification alone would be of considerable use. If you tokenize things like apostrophes correctly, it seems to me you're halfway done . . . but I've talked about considerably more than I understand, as usual, and should shut up now.
I think that's true, if you tokenize correctly, that would go a long way. Unfortunately, there are a few constructs that make tokenization tricky. Apostrophe is the most obvious case; but {'s, and to a lesser extent ['s could have similar problems, since they would require substantial lookahead to tokenize.
On 11/8/07, Steve Sanbeg ssanbeg@ask.com wrote:
I think that's true, if you tokenize correctly, that would go a long way. Unfortunately, there are a few constructs that make tokenization tricky. Apostrophe is the most obvious case; but {'s, and to a lesser extent ['s could have similar problems, since they would require substantial lookahead to tokenize.
According to flex documentation, it's perfectly happy to accept any regex for tokens, and will use unlimited lookahead and backtracking if necessary. It provides debug info allowing you to check for and eliminate backtracking, if you want to speed it up, but that's optional. Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring stuff like ISBN handling that's less basic and more disposable.
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
According to flex documentation, it's perfectly happy to accept any regex for tokens, and will use unlimited lookahead and backtracking if necessary. It provides debug info allowing you to check for and eliminate backtracking, if you want to speed it up, but that's optional. Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring
Yes you can, if ====== is a token. Which at first glance, it should be. The fact that == looks like === looks like ==== is neither here nor there to the grammar - it's a handy mnemonic for humans, that's all.
stuff like ISBN handling that's less basic and more disposable.
What's wrong with ISBN handling? I don't see anything problematic in an "ISBN" token that consumes a following sequence of digits, possibly with hyphens and crap.
Is there a definitive list of the real problems with the current "grammar"? We've mentioned two so far: bold/italic apostrophes, and nested lists. Imagine much of the ambiguities relating to | in templates, parser functions, tables and the like would be one. What else is there?
Steve
"Steve Bennett" stevagewp@gmail.com wrote in message news:b8ceeef70711082024p38e9e612pf8063a4379114ca6@mail.gmail.com...
Is there a definitive list of the real problems with the current
"grammar"?
We've mentioned two so far: bold/italic apostrophes, and nested lists. Imagine much of the ambiguities relating to | in templates, parser functions, tables and the like would be one. What else is there?
[[link suffix]]es?
- Mark Clements (HappyDog)
Mark Clements schreef:
[[link suffix]]es?
For some obscure reason, the parser's behavior here is pretty weird as well. The suffix can only contain A-Za-z, nothing else. This means that:
[[link suffix]]es is rendered as <link suffixes> [[link suffix]]es are good is rendered as <link suffixes> are good [[Firefox 2]].0 is rendered as <Firefox 2>.0 (although I can use [[Firefox 2|Firefox 2.0]] if I really want to) [[Brion Vibber]]'s is rendered as <Brion Vibber>'s (although there's still the piped link option)
I'm not saying this is a big problem ('cause one can always use the pipe trick), but I fail to see the logic. Even digits can't be part of a suffix, just letters.
Roan Kattouw (Catrope)
On Nov 9, 2007 10:44 PM, Roan Kattouw roan.kattouw@home.nl wrote:
Mark Clements schreef:
[[link suffix]]es?
For some obscure reason, the parser's behavior here is pretty weird as well. The suffix can only contain A-Za-z, nothing else. This means that:
Well then, should it just take everything until the next whitespace?
On 11/9/07, Stephen Bain stephen.bain@gmail.com wrote:
For some obscure reason, the parser's behavior here is pretty weird as well. The suffix can only contain A-Za-z, nothing else. This means that:
Well then, should it just take everything until the next whitespace?
No, that's bad too, for this [[reason]]. Do you want
to link that last full stop? No? Why do you want to link the dot in [[Firefox 2]].0? Ok, so you want to link [[Brion]]'s apostrophe, but do you want to link a '[[quoting]]' apostrophe? Why not?
These types of syntax rules always run into grey area problems. Plural [[link]]s were a good but unless you can come up with a hard rule like "everything up until the next whitespace" as suggested you're always going to end up in this murky area.
Perhaps it would be better to have an unambiguous syntax like:
[[Link||ing]], [[Firefox 2||.0]], [[Brion||'s]] apostrophe etc. It's no wordier (save an extra pipe) than present. and you can exactly what is and isn't covered by the link. With a bit of effort we could even clean up the pipe trick:
[[Sydney|, Australia|]]. [[Nice| (programming language)|]]. Or something. Again, rather than relying on dodgy rules like detecting "(context)" and ", location", make the user specify it without ambiguity.
But I'm digressing a bit.
Steve
Perhaps it would be better to have an unambiguous syntax like:
[[Link||ing]], [[Firefox 2||.0]], [[Brion||'s]] apostrophe etc. It's no wordier (save an extra pipe) than present. and you can exactly what is and isn't covered by the link. With a bit of effort we could even clean up the pipe trick:
[[Sydney|, Australia|]]. [[Nice| (programming language)|]]. Or something. Again, rather than relying on dodgy rules like detecting "(context)" and ", location", make the user specify it without ambiguity.
But I'm digressing a bit.
Where a double pipe works like the current pipe trick, but includes the page title at the beginning, and having text inbetween two pipes is like that, but backwards - it includes the extra text in the page title, but not the link text? I like it. Probably not the best time to be introducing new syntax though - once we have the current syntax sorted out, it should be easy to add new stuff (at least that's the plan, as I understand it).
Steve Bennett wrote:
Perhaps it would be better to have an unambiguous syntax like:
[[Link||ing]], [[Firefox 2||.0]], [[Brion||'s]] apostrophe etc. It's no wordier (save an extra pipe) than present. and you can exactly what is and isn't covered by the link. With a bit of effort we could even clean up the pipe trick:
[[Sydney|, Australia|]]. [[Nice| (programming language)|]]. Or something. Again, rather than relying on dodgy rules like detecting "(context)" and ", location", make the user specify it without ambiguity.
I think work on a clean grammar and a slick parser are among the most important discussions I've ever read on here, and it's good to see it going somewhere. In particular I think the business with apostrophes is horrible and I have no idea how it ever got passed as intuitive. Nevertheless, please don't throw out the baby with the bathwater - the [[inner li]]nk syntax is one of the bits of the Wiki syntax that I think works really well, and I'd hate to see it get bogged down with yet more of the dreaded pipes, which honestly many users will not even know how to type.
Soo
Nevertheless, please don't throw out the baby with the bathwater - the [[inner li]]nk syntax is one of the bits of the Wiki syntax that I think works really well, and I'd hate to see it get bogged down with yet more of the dreaded pipes, which honestly many users will not even know how to type.
Very true. I should have been clearer when voicing my support - I'd like this new syntax *in addition* to the current one.
On 09/11/2007, Soo Reams soo@sooreams.com wrote:
I think work on a clean grammar and a slick parser are among the most important discussions I've ever read on here, and it's good to see it going somewhere. In particular I think the business with apostrophes is horrible and I have no idea how it ever got passed as intuitive.
Some of the apostrophe stuff is important in languages other than English, e.g. Italian, where a construct like l'''Uomo'' being parsed as l<i>Uomo</i> is expected and useful behaviour. Stuff like this in the parser has good reason to be there. Take care when deciding which bits are useful.
- d.
On Fri, Nov 09, 2007 at 06:21:42PM +0000, David Gerard wrote:
On 09/11/2007, Soo Reams soo@sooreams.com wrote:
I think work on a clean grammar and a slick parser are among the most important discussions I've ever read on here, and it's good to see it going somewhere. In particular I think the business with apostrophes is horrible and I have no idea how it ever got passed as intuitive.
Some of the apostrophe stuff is important in languages other than English, e.g. Italian, where a construct like l'''Uomo'' being parsed as l<i>Uomo</i> is expected and useful behaviour. Stuff like this in the parser has good reason to be there. Take care when deciding which bits are useful.
And, as has been pointed out one of the last 3 times we did this,
''
is a valid punctuation character in at least one language.
Cheers, -- jra
On 09/11/2007, Roan Kattouw roan.kattouw@home.nl wrote:
Mark Clements schreef:
[[link suffix]]es?
For some obscure reason, the parser's behavior here is pretty weird as well. The suffix can only contain A-Za-z, nothing else.
That's not true. The suffix syntax is dependent on the content language (because of e.g. accented letters in some languages), and it is defined by a regular expression in $linkTrail in MessagesXx.php.
(And BNF that! :-x)
-- [[cs:User:Mormegil | Petr Kadlec]]
I'm not saying this is a big problem ('cause one can always use the pipe trick), but I fail to see the logic. Even digits can't be part of a suffix, just letters.
I think to understand the logic, you need to understand the intention. I think the feature is intended to make plurals easier, and it works fine for those. Anything else should use the pipe trick, since it's too difficult for the parser to guess what is wanted.
On Fri, 09 Nov 2007 15:24:10 +1100, Steve Bennett wrote:
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
According to flex documentation, it's perfectly happy to accept any regex for tokens, and will use unlimited lookahead and backtracking if necessary. It provides debug info allowing you to check for and eliminate backtracking, if you want to speed it up, but that's optional. Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring
Yes you can, if ====== is a token. Which at first glance, it should be. The fact that == looks like === looks like ==== is neither here nor there to the grammar - it's a handy mnemonic for humans, that's all.
Well, that's exactly the point. At first glance, === is obviously a token, which will perfectly handle 99% of the headings out there. But if we want a complete grammar, we really need sane handling for the last 1%.
To get those into one token would require the tokenizer to do a bit of parsing to match things up; however, if the tokenizer just determines that it is a token, and passes a value to the parser, so the parser can deal with the values, that would probably be a cleaner implementation.
I'm not sure if there's a notation for values in EBNF, so to invent one for this example, treating
===head==
as "==" TEXT("=head") "==" would be nice, but tricky. as "="(3) TEXT("head") "="(2) would make for a cleaner lexer, and the parser should be able to handle that without too much trouble.
On Thu, 08 Nov 2007 20:26:33 -0500, Simetrical wrote:
On 11/8/07, Steve Sanbeg ssanbeg@ask.com wrote:
I think that's true, if you tokenize correctly, that would go a long way. Unfortunately, there are a few constructs that make tokenization tricky. Apostrophe is the most obvious case; but {'s, and to a lesser extent ['s could have similar problems, since they would require substantial lookahead to tokenize.
According to flex documentation, it's perfectly happy to accept any regex for tokens, and will use unlimited lookahead and backtracking if necessary. It provides debug info allowing you to check for and eliminate backtracking, if you want to speed it up, but that's optional. Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring stuff like ISBN handling that's less basic and more disposable.
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.
On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.
FSMs accept regular languages by definition, so the set of things an FSM can recognize is precisely equal to that which can be specified by a regex. :) Regardless, I take your point, and don't know enough about the subject matter to address it. I can look at the flexbisonparse lexer and parser:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikilex.l?rev... http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikiparse.y?r...
but I can't really understand what it does, or whether it works properly, at least not without figuring out how to install the thing and run the parser tests on it.
Simetrical wrote:
On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.
FSMs accept regular languages by definition, so the set of things an FSM can recognize is precisely equal to that which can be specified by a regex. :)
In fact regexes as seen in PHP etc are more powerful than FSMs, since they can include back references and suchlike. But I presume PHP compiles regexes down to efficient FSMs if they don't include such constructs, so it probably doesn't make much difference in performance terms.
Soo Reams
On Fri, 09 Nov 2007 19:23:44 +0000, Soo Reams wrote:
Simetrical wrote:
On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize, not a regex. Clearly, properly tokenizing bold/italics requires complex processing on an entire paragraph of text. Even templates and links are a little complex, but should be doable by maintaining states with a stack.
FSMs accept regular languages by definition, so the set of things an FSM can recognize is precisely equal to that which can be specified by a regex. :)
In fact regexes as seen in PHP etc are more powerful than FSMs, since they can include back references and suchlike. But I presume PHP compiles regexes down to efficient FSMs if they don't include such constructs, so it probably doesn't make much difference in performance terms.
Soo Reams
IIRC, accept means that if the language is tokenized correctly, it can give a yes/no whether the input stream is valid. I don't think this helps much when trying to tokenize it to begin with.
Wouldn't regexes always be compiled to FSMs, regardless of language or constructs?
On 11/9/07, Steve Sanbeg ssanbeg@ask.com wrote:
Wouldn't regexes always be compiled to FSMs, regardless of language or constructs?
Not if they include back-references, assertions, or other things not part of "real" regexes. A real regex can be defined as follows:
1) Any character is a regex.
2) Any regex followed by a * is a regex.
3) Any two regexes concatenated together are regexes.
4) Any two regexes concatenated together with a | character are regexes.
Stuff like [a-z] is just shorthand for (a|b|c|...|z), so that's still a proper regex, but things like back-references aren't possible to translate into those four rules, so you have to use something more sophisticated than a FSM.
Steve Sanbeg wrote:
IIRC, accept means that if the language is tokenized correctly, it can give a yes/no whether the input stream is valid. I don't think this helps much when trying to tokenize it to begin with.
That can only happen at a level above tokenization, i.e. parsing. To take a C example, "! &&;" is a perfectly legal set of tokens, but clearly not in the language. Also, as noted elsewhere, wikitext is basically the set of all strings, since we don't want to generate "compilation errors".
Wouldn't regexes always be compiled to FSMs, regardless of language or constructs?
Not FSMs, no. Perl-style regexes can do things that no FSM can do. For example, since FSMs are memoryless, they can't include back-references. I imagine they are compiled to something, but I couldn't say what. My argument was that PHP is probably smart enough to recognize regexes which don't include these extra features, and compile them to FSMs, since an FSM is such an efficient implementation.
Anyway this is getting off topic, since the discussion was over whether an FSM is adequate to tokenize wikitext. I don't think this question has been answered yet, but if the answer is yes then even a true regex (Kleene-style) is also powerful enough. So that's fine.
Soo Reams
Also, as noted elsewhere, wikitext is basically the set of all strings, since we don't want to generate "compilation errors".
Are we sure we don't? Making certain sequences simply invalid would solve quite a few problems. It would also be in keeping with the principle of least surprise - if we can't be sure what the user is trying to do, it's best to ask rather than guess and get it wrong.
On Fri, 09 Nov 2007 22:25:34 +0000, Thomas Dalton wrote:
Also, as noted elsewhere, wikitext is basically the set of all strings, since we don't want to generate "compilation errors".
Are we sure we don't? Making certain sequences simply invalid would solve quite a few problems. It would also be in keeping with the principle of least surprise - if we can't be sure what the user is trying to do, it's best to ask rather than guess and get it wrong.
There are some sequences that are invalid, such as unmatched XML tags. Currently, this just does the most natural thing, which causes the tag to grab the whole rest of the article. An error that caused the save to fail quickly and obviously, rather than the subtle failures during rendering, would solve a few problems.
On Nov 12, 2007 3:26 PM, Steve Sanbeg ssanbeg@ask.com wrote:
On Fri, 09 Nov 2007 22:25:34 +0000, Thomas Dalton wrote:
Also, as noted elsewhere, wikitext is basically the set of all strings, since we don't want to generate "compilation errors".
Are we sure we don't? Making certain sequences simply invalid would solve
There are some sequences that are invalid, such as unmatched XML tags.
Also unmatched HTML comments. <!-- like this
I use them all the time to quickly "comment out" the remainder of an article.
-- Jim
quite a few problems. It would also be in keeping with the principle of least surprise - if we can't be sure what the user is trying to do, it's best to ask rather than guess and get it wrong.
There are some sequences that are invalid, such as unmatched XML tags. Currently, this just does the most natural thing, which causes the tag to grab the whole rest of the article. An error that caused the save to fail quickly and obviously, rather than the subtle failures during rendering, would solve a few problems.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/10/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize
Our Noodly Master can tokenise anything. I think we manage to tokenise them ourselves in the current parser, though, so I don't think he's REQUIRED to tokenise them.
On Sat, 10 Nov 2007 10:28:32 +1100, Andrew Garrett wrote:
On 11/10/07, Steve Sanbeg ssanbeg@ask.com wrote:
But some constructs in MW require an FSM to tokenize
Our Noodly Master can tokenise anything. I think we manage to tokenise them ourselves in the current parser, though, so I don't think he's REQUIRED to tokenise them.
OK, but the point is that some constructs are too complicated for flex to parse them with simple regular expressions, mostly because it requires some state information to find the tokens. So adding an FSM seems like the most reasonable way to tokenize for a bison parser.
Yeah, maybe not required; technically, even flex isn't required.
Wikitext is impossible to formulate in EBNF (and therefor BNF too). That's because BNF can only handle languages falling under certain constraints, and Wikitext fails these contraints. To prove this:
The operating constraints are:
1. there are a limited number of basic syntax constructs 2. all valid syntax is either basic syntax, or complex syntax built from basic syntax which it can be broken down into
It's clear that BNF can't formulate outside these constraints because if (1) was false, you'd never get to the end of the specification, and (2) can't be false simply because it restates (1) but also allows for languages with nested syntax. If it were false that the nested/complex syntax could be broken down into basic syntax, then that complex syntax would simply BE basic syntax, and you are left with just rule (1).
Wikitext fails these constraints because the construct:
**bullet 1 **bullet 2
which are two successive level 2 bullet points, can't be broken down into less complex syntax. Therefore the "**" construct for a level 2 bullet point must be BASIC syntax, not complex syntax. But seen as Wikitext, just like XHTML etc., allows for bulleted lists with infinite nesting levels, there would be an infinite number of basic constructs (for level 2 list items, level 3, 4, 5, 17, 234, etc). We thus fail constraint number (1).
To prove that the above Wikitext can't be broken down, observe the following. If there is to be any breaking down of a level 2 bullet point, it will be comprised in some way of two syntax constructs for level 1 bullet points. For instance, the XHTML version of a level 2 bullet, "<ul><li><ul><li>level 2 bullet</li></ul></li></ul>", involves two sets of "<li>...</li>" pairs, each of which is the syntax for a level 1 bullet point. Now, whilst the Wikitext version, "**bullet" can similarly be broken down into a level 1 bullet containing a second level 1 bullet (*[*..]), this doesn't hold for the case where we have two successive level 2 bullets. "**bullet <newline> **bullet" means "ONE level 1 bullet, containing TWO level 2 bullets". But our previous breakdown of the "**" construct would have to interpret this as "one level 1 bullet containing ONE level 2 bullet, followed by ANOTHER level 1 bullet, containing a level 2 bullet." This is wrong. Our problem is simply that the syntax for adding a bullet at the SAME level has changed now that we're dealing with another level of bullets - at level 1, another bullet at the same level is "*", but at level 2, another bullet at the same level is "**" - but this means that each level of bullets has its own syntax, meaning each bullet level construct is a basic construct, and therefore that there are an infinite number of constructs.
This is the only major problem that makes Wikitext beyond BNF. To express Wikitext in BNF, we'd have to introduce a limitation on the number of levels of bullet points you could use - this might, of course, not be a problem. However, we would still be stuck on the yet trickier problem of expressing in BNF nested lists of different types. Thus if we wish to have an unordered bullet containing an ordered bullet containing an unordered bullet ("*#*"), the syntax for a further unordered bullet at the same level changes AGAIN to "*#*" rather than the expected "***". There is thus no hope of defining Wikitext in BNF unless we exhaustively specify every combination of bulleting constructs as basic constructs, but this would quickly become too long a specification to be useful as a markup validator. We have 4 list types (definition, ordered, unordered, and mere indentation) that can be combined - going up to just ten levels, we have 1,048,576 basic constructs.
What this means is that we can't use a basic EBNF parser in all the usual useful ways. We need new solutions.
On 11/13/07, Virgil Ierubino virgil.ierubino@gmail.com wrote:
It's clear that BNF can't formulate outside these constraints because if (1) was false, you'd never get to the end of the specification, and (2) can't be false simply because it restates (1) but also allows for languages with nested syntax. If it were false that the nested/complex syntax could be broken down into basic syntax, then that complex syntax would simply BE basic syntax, and you are left with just rule (1).
Wikitext fails these constraints because the construct:
**bullet 1 **bullet 2
which are two successive level 2 bullet points, can't be broken down into less complex syntax. Therefore the "**" construct for a level 2 bullet point must be BASIC syntax, not complex syntax. But seen as Wikitext, just like XHTML etc., allows for bulleted lists with infinite nesting levels, there would be an infinite number of basic constructs (for level 2 list items, level 3, 4, 5, 17, 234, etc). We thus fail constraint number (1).
This was my initial reaction. However, I don't think it's actually that important. Because in fact this:
**foo ##blah
*is* valid syntax. As is this: *foo **blah #*blah
Which means that each line can be parsed on its own merits, then a subsequent pass can perform the code generation. This will likely be the general story for the new parser: a traditional parser model with a couple of hacks to cope with the nuances of wikitext, as opposed to a parser built with hacks from the ground up.
wrong. Our problem is simply that the syntax for adding a bullet at the SAME
level has changed now that we're dealing with another level of bullets - at level 1, another bullet at the same level is "*", but at level 2, another bullet at the same level is "**" - but this means that each level of bullets has its own syntax, meaning each bullet level construct is a basic construct, and therefore that there are an infinite number of constructs.
I think honestly a list element will just be defined as an arbitrary sequence of :, # and *, followed by text. EBNF is incapable of expressing how that sequence should be rendered, but that's not a showstopper.
"*#*" rather than the expected "***". There is thus no hope of defining
Wikitext in BNF unless we exhaustively specify every combination of
There is no hope of *fully defining* Wikitext in BNF...
What this means is that we can't use a basic EBNF parser in all the usual
useful ways. We need new solutions.
What this means is that we can't use *just* a basic EBNF parser. We will need an EBNF parser with some hacks/tweaks/special cases.
Steve
n 11/13/07, Steve Bennett stevagewp@gmail.com wrote:
On 11/13/07, Virgil Ierubino virgil.ierubino@gmail.com wrote:
It's clear that BNF can't formulate outside these constraints because if (1) was false, you'd never get to the end of the specification, and (2) can't be false simply because it restates (1) but also allows for languages with nested syntax. If it were false that the nested/complex syntax could be broken down into basic syntax, then that complex syntax would simply BE basic syntax, and you are left with just rule (1).
Wikitext fails these constraints because the construct:
**bullet 1 **bullet 2
I think there's a pretty simple solution.
Given some input:
*a *#b *#c ***d *##e
First, naively turn *a into <ul><li>a</li></ul> etc on a line by line basis:
<ul><li>a</li></ul> <ul><ol><li>b</li></ol></ul> <ul><ol><li>c</li></ol></ul> <ul><ul><ul><li>d</li></ul></ul></ul> <ul><ol><ol><li>e</li></ol></ol></ul>
This can be done with a BNF-based parser, I think.
Then, simply repeatedly collapse adjacent pairs of </ul><ul> and </ol><ol>:
<ul><li>a</li> <ol><li>b</li> <li>c</li></ol> <ul><ul><li>d</li></ul></ul> <ol><ol><li>e</li></ol></ol></ul>
I think this will always yield the result we want. Not that I've tried to prove it or anything ;)
Interestingly, the second phase is only actually required for ordered lists. Unordered lists and indented lists (:) would probably render ok after the first phase.
Steve
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
I think there's a pretty simple solution.
Given some input:
*a *#b *#c ***d *##e
First, naively turn *a into <ul><li>a</li></ul> etc on a line by line basis:
<ul><li>a</li></ul> <ul><ol><li>b</li></ol></ul> <ul><ol><li>c</li></ol></ul> <ul><ul><ul><li>d</li></ul></ul></ul> <ul><ol><ol><li>e</li></ol></ol></ul>
This can be done with a BNF-based parser, I think.
Then, simply repeatedly collapse adjacent pairs of </ul><ul> and </ol><ol>
We already know there is a solution to the problem of "converting Wikitext into XHTML" - which is what you solve here. We already do it! The problem I am claiming is impossible to solve is that of *expressing Wikitext in EBNF". Whilst this is true, as you point out, it may not be a problem as long as we use a quirky parser that understands this quirk.
If we do manage to circumvent this problem of lists, then I am fairly confident that we DO in fact already have a complete expression of Wikitext in EBNF - to my knowledge this was the only major barrier. So imagining we do have this solution, what's next?
After a long and complicated analysis, Virgil Ierubino concluded:
What this means is that we can't use a basic EBNF parser in all the usual useful ways. We need new solutions.
And the answer is: No, we don't need any of this. Because the current solution works fine.
This whole discussion is without point. There is no need for change, is there? If it ain't broken, don't fix it.
Or, to put that logic in reverse: For this discussion to become meaningful, a purpose is required. Can you find a purpose? Bring it here. Show us a real problem that needs a solution.
Two years ago I had the idea that if I could just parse the wikitext of the database dumps, I could extract useful information in tabular form from the templates used as infoboxes. One year ago, I wrote a very simple Perl script with regexps to do this. It isn't the full wikitext parser, since it only cares for template calls, {{this|kind|of=syntax}}. It uses regexps rather than BNF. And even with this limited task, it only does a half-hearted job. My naive parser gets confused as soon as [[image: ...|thumb]] syntax is mixed with {{template}} syntax. So it's a pretty primitive (dumb!) parser for wikitext. But it does 98% of the job. It helps me find useful information and it helps me to locate errors that I can correct.
What I have is a purpose. I needed a parser, and I wrote one.
But you guys are all talking about creating a parser, the ultimate one, for which you really don't have any need. You were talking about this five years ago. You still haven't produced any useful parser. And you probably will discuss this five years from now.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
On 13/11/2007, Lars Aronsson lars@aronsson.se wrote:
For this discussion to become meaningful, a purpose is required. Can you find a purpose? Bring it here. Show us a real problem that needs a solution.
There is no clear purpose to expressing Wikitext in EBNF, simply because the possibilities of the use of such an expression are undefined and large. The EBNF expression could be used to create a validator, and a validator could be used to warn users when they've accidentally typed bad syntax - or simply to determine when a page is and is not written in Wikitext. It could be used to build a parser in a sense that generated a DOM (Document Tree) for an entire wikitext document, which would allow it to be very easily converted into XHTML without the need for a complex parser/compiler ad-hoc combination as is currently use. Furthermore, the generated DOM would allow easy conversion of the Wikitext into ANY language, particularly any XML language, facilitating easy syndication, mashups, presentation in other mediums, etc. Having a defined DOM would also allow things like very, very fine-grained editing or other actions (i.e. actions applied to very small segments of a page), DOM alteration similar to JavaScript in HTML, etc.
I'm no computer scientist, so my evaluation of the use of the EBNF is limited, but I can definitely see strong uses. EBNF is a well-recognised standard for defining languages, it has open-source tools already built around it (parsers, etc), and it's a step towards clearing up the mess that is currently wikitext.
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
I don't think it's actually that important. Because in fact this:
**foo ##blah
*is*
valid syntax. As is this: *foo **blah #*blah
[...]
I think honestly a list element will just be defined as an arbitrary sequence of :, # and *, followed by text
This is not true. It is not arbitrary what symbol come before the last one. Each symbol defines a list-type, and it's not only the final list-type that counts. A string of symbols beginning with an asterisk will output a nested list beginning with a bullet point - a string beginning with a hash will output a list beginning with a number.
Virgil Ierubino wrote:
There is no clear purpose to expressing Wikitext in EBNF, simply because the possibilities of the use of such an expression are undefined and large. The EBNF expression could be used to create a validator, and a validator could be used to warn users when they've accidentally typed bad syntax - or simply to determine when a page is and is not written in Wikitext.
This is a valid purpose or problem, but an EBNF parser is not a good solution to it. The kind of parsers you build with YACC or Bison, or even hand-written recursive descent parsers, are good at parsing correct language, but not very good at reporting syntax errors in a way that is useful for corrections. One difference between GCC (the GNU C compiler) and many (early) commercial C/C++ compilers is that GCC gives very useful error messages, because it knows what mistakes developers typically make. This wisdom is normally not encoded in a BNF grammar.
Suppose you wanted to solve the problem described above. You'd start by downloading a Wikipedia database dump containing the complete edit history. For practical reasons, you'd start with one of the smaller languages, such as Latin or Faroese. You'd then go through every edit to find patterns of the most common minor corrections. Perhaps mismatching ''' and '' or === and ==, which result in <i>'and</i> or <h2>=and</h2>. Then you'd go through a database dump of the current versions to find such error patterns. Obviously, regexp matching is superior to any BNF parser for this. You can either post a list of errors found, or make a toolserver application where a user can click for alternative corrections that are semi-automatically applied. This is hard work. It is useful work. You could be busy for a year doing this. And it would make Wikipedia better. The actual implementation would use whatever languages and tools that are best fit to solve the problem. You'd be the hero at the next Wikimania conference when you present your paper about this work. But the project doesn't start with creating an EBNF parser.
On 11/13/07, Lars Aronsson lars@aronsson.se wrote:
Bison, or even hand-written recursive descent parsers, are good at parsing correct language, but not very good at reporting syntax errors in a way that is useful for corrections. One difference
I'm curious how this will turn out in practice. As has been pointed out, there aren't really "syntax errors" in Wikitext, just text that renders differentlly from you expected. I can't really visualise what effect switching from a regexp-style parser to a recursive one will make on the appearance of errors. Hopefully not a great one.
I think my main motivation for working on a new and better parser is to give MediaWiki room to grow. It's currently really stifled by the many layers of text replacement that take place, and developers continually express a fear of "breaking the parser". That inhibits the development of new features.
Steve
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
I think my main motivation for working on a new and better parser is to give MediaWiki room to grow. It's currently really stifled by the many layers of text replacement that take place, and developers continually express a fear of "breaking the parser". That inhibits the development of new features.
IMO the main problem is a lack of interoperability and valid alternate implementations. The syntax is currently literally defined as whatever falls out of the PHP parser. That's not good enough for the big world and people are routinely horrified when I explain that's the reason MediaWiki doesn't have all manner of obvious useful things like a WYSIWYG editor, format converters, etc.
- d.
IMO the main problem is a lack of interoperability and valid alternate implementations. The syntax is currently literally defined as whatever falls out of the PHP parser. That's not good enough for the big world and people are routinely horrified when I explain that's the reason MediaWiki doesn't have all manner of obvious useful things like a WYSIWYG editor, format converters, etc.
Agreed. But I would also add that a nice (but low priority) characteristic of an alternate parser would be that if can be easily reused in other apps besides MediaWiki. Lots of web apps seems to reinvent some HTML-lite syntax, and it would be beneficial if multiple apps could just drop-in our parser (thus saving users for learning yet another syntax, and saving programmers from having to reinvent the wheel). Having experimented once a long long time ago with extracting the parser, and having found that it required lots of setup and global variables to work, I can understand why this doesn't happen currently. So a loosely-coupled parser that encouraged reuse could be nice, and is related somewhat to encouraging interoperability.
-- All the best, Nick.
On Wed, Nov 14, 2007 at 12:01:07PM +1100, Nick Jenkins wrote:
Agreed. But I would also add that a nice (but low priority) characteristic of an alternate parser would be that if can be easily reused in other apps besides MediaWiki. Lots of web apps seems to reinvent some HTML-lite syntax, and it would be beneficial if multiple apps could just drop-in our parser (thus saving users for learning yet another syntax, and saving programmers from having to reinvent the wheel). Having experimented once a long long time ago with extracting the parser, and having found that it required lots of setup and global variables to work, I can understand why this doesn't happen currently. So a loosely-coupled parser that encouraged reuse could be nice, and is related somewhat to encouraging interoperability.
Yup, it sure would be, wouldn't it?
Cheers, -- jra
On 11/13/07, Virgil Ierubino virgil.ierubino@gmail.com wrote:
This is not true. It is not arbitrary what symbol come before the last one. Each symbol defines a list-type, and it's not only the final list-type that counts. A string of symbols beginning with an asterisk will output a nested list beginning with a bullet point - a string beginning with a hash will output a list beginning with a number.
It's arbitrary in the sense that every combination is valid.
This text:
* foo #*#*#*#*#***#*###*# bloo * foo
may be odd, but it's valid and can be rendered.
We already know there is a solution to the problem of "converting Wikitext
into XHTML" - which is what you solve here. We already do it!
If you like, my solution could be considered a minimal departure from a straight EBNF-based parser. EBNF will come very close to parsing and treating it correctly, as follows:
<list-item> ::= <bullet-item> | <enumerated-item> | <indented-item> <bullet-item> ::= "*" (<list-item> | <text>) <enumerated-item> ::= "#" (<list-item> | <text>) <indented-item> ::= ":" (<list-item> | <text>)
Then just a tiny bit of cleanup later on :)
Steve
Steve Bennett wrote:
This text:
- foo
#*#*#*#*#***#*###*# bloo
- foo
may be odd, but it's valid and can be rendered.
But maybe a parser (EBNF or not) should warn against "odd" things rather than just categorizing wikitext as being valid/invalid. How odd is this syntax (47 points on the "odd" scale) and how many places in language X of Wikipedia have this level of oddness? Perhaps they should be replaced with something less odd?
It's easy enough to say "we should write a parser", but when you want to achieve a certain goal (e.g. suggest replacement wikitext that is more user/editor friendly), the problem becomes a lot more interesting.
On 11/13/07, Lars Aronsson lars@aronsson.se wrote:
Steve Bennett wrote:
It's easy enough to say "we should write a parser", but when you want to achieve a certain goal (e.g. suggest replacement wikitext that is more user/editor friendly), the problem becomes a lot more interesting.
I think *not* changing the grammar is the immediate goal. However, a decent parser with a well-defined grammar would make it easy to have constructive discussions about future changes to the grammar.
Steve
On Tue, Nov 13, 2007 at 11:49:45AM +0100, Lars Aronsson wrote:
It's easy enough to say "we should write a parser", but when you want to achieve a certain goal (e.g. suggest replacement wikitext that is more user/editor friendly), the problem becomes a lot more interesting.
And yet you question why we're interested in it? ;-)
Cheers, - jra
This whole discussion is without point. There is no need for change, is there? If it ain't broken, don't fix it.
The only reason the parser "ain't broke" is because we define the correct rendering of wikitext to be whatever the parser says it is. Under that definition, it's impossible for the parser to be broken, so your argument breaks down. (by something akin to falsifiability in science)
Hoi, Count me out when you say that "*we* define the correct rendering of wikitext to be whatever the parser says it is". For some languages it just does not work properly. Consequently, either you are right and it is not for me to suggest that the parser does not is 100% or there is indeed room for improvement. Thanks, GerardM
On Nov 13, 2007 1:33 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
This whole discussion is without point. There is no need for change, is there? If it ain't broken, don't fix it.
The only reason the parser "ain't broke" is because we define the correct rendering of wikitext to be whatever the parser says it is. Under that definition, it's impossible for the parser to be broken, so your argument breaks down. (by something akin to falsifiability in science)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 13/11/2007, GerardM gerard.meijssen@gmail.com wrote:
Hoi, Count me out when you say that "*we* define the correct rendering of wikitext to be whatever the parser says it is". For some languages it just does not work properly. Consequently, either you are right and it is not for me to suggest that the parser does not is 100% or there is indeed room for improvement. Thanks, GerardM
I never said it was a good definition, but it *is* the definition. That's why we're talking about changing that definition.
My point was that saying "The parser isn't broken" isn't a good argument because it is tautologous since the definition of a broken parser is one that doesn't render text according to the definition of how it should be rendered, and that definition in this case is simply how the parser renders it. So, it is true, but irrelevant, that the parser isn't broken, what's broken is the definition of wikitext.
(And yes, that paragraph is hard to understand - I struggle to describe circular logic without ending up with a circular description... Sorry.)
On Tue, Nov 13, 2007 at 01:01:51PM +0000, Thomas Dalton wrote:
On 13/11/2007, GerardM gerard.meijssen@gmail.com wrote:
Hoi, Count me out when you say that "*we* define the correct rendering of wikitext to be whatever the parser says it is". For some languages it just does not work properly. Consequently, either you are right and it is not for me to suggest that the parser does not is 100% or there is indeed room for improvement. Thanks, GerardM
I never said it was a good definition, but it *is* the definition. That's why we're talking about changing that definition.
My point was that saying "The parser isn't broken" isn't a good argument because it is tautologous since the definition of a broken parser is one that doesn't render text according to the definition of how it should be rendered, and that definition in this case is simply how the parser renders it. So, it is true, but irrelevant, that the parser isn't broken, what's broken is the definition of wikitext.
(And yes, that paragraph is hard to understand - I struggle to describe circular logic without ending up with a circular description... Sorry.)
Your assertion is that it is not possible to declare that the parser is broken because no formal standar exists to compare it against. That's pretty much akin to something I said last night, and that's the major reason people are positing EBNF: because it's a framework designed for the purpose of building such external standards as grammars.
Cheers, -- jra
On 13/11/2007, GerardM gerard.meijssen@gmail.com wrote:
Count me out when you say that "*we* define the correct rendering of wikitext to be whatever the parser says it is". For some languages it just does not work properly. Consequently, either you are right and it is not for me to suggest that the parser does not is 100% or there is indeed room for improvement.
You're conflating two things:
1. that the present parser "works", but is arguably not a robust or reusable design for to the problem of parsing wikitext. 2. that the present parser not letting you just type '' in Neapolitan is not ideal.
2. is completely inappropriate in a discussion of 1. And that it's important doesn't change that. A redesigned parser must present an almost-identical interface to the present implementation to get in; this is NOT the place to argue for syntax changes, and any attempts to change the syntax will in fact doom the effort.
If you want the parser changed, get it changed in the present version first. This is not the thread for you.
- d.
If you want the parser changed, get it changed in the present version first. This is not the thread for you.
Surely it's better to wait and get it changed in the new version. Once the parser is actually comprehensible it should be fairly easy to make changes. At the moment, changing the parser is akin to playing Russian Roulette.
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
If you want the parser changed, get it changed in the present version first. This is not the thread for you.
Surely it's better to wait and get it changed in the new version. Once the parser is actually comprehensible it should be fairly easy to make changes. At the moment, changing the parser is akin to playing Russian Roulette.
Possibly. However, that the arguable problems with wikitext for Neapolitan are completely disjoint and a distraction from doing a cleaner redesign of the back end remains the case.
- d.
On 11/14/07, David Gerard dgerard@gmail.com wrote:
- is completely inappropriate in a discussion of 1. And that it's
important doesn't change that. A redesigned parser must present an almost-identical interface to the present implementation to get in; this is NOT the place to argue for syntax changes, and any attempts to change the syntax will in fact doom the effort.
The except to that is where crazy, unuseful syntax is actually a hindrance to the definition of an EBNF grammar and its implementation, as we've discussed earlier.
I have to say, I'm finding some really whacky things that work. Try this code:
* * * o
Or how about an [[Image:foo.jpg|With an [[Image:foo.jpg]] in its caption...]] ?
Or even one with the table of contents: [[Image:foo.jpg|__TOC__]]
How do you think this <pretty> little piece of text renders?
Incidentally, I'm making good progress on the grammar. I've merged in most of what was at meta, so at least there is only *one* grammar now (though part of it is EBNF and the rest is BNF). http://www.mediawiki.org/wiki/Markup_spec/BNF/Article
A recurring question is who is actually going to write this parser though. Parser.php is 5000 lines and sanitizer.php another 1300. And probably other files I don't even know about. We're talking about months of work in the dark, with no feedback, and no guarantee that it will even get used. We're going to have to come up with a better coding process than "you write the code and when you're done we'll look at it".
Steve
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
The except to that is where crazy, unuseful syntax is actually a hindrance to the definition of an EBNF grammar and its implementation, as we've discussed earlier.
Oh yeah, that's wacky.
Incidentally, I'm making good progress on the grammar. I've merged in most of what was at meta, so at least there is only *one* grammar now (though part of it is EBNF and the rest is BNF). http://www.mediawiki.org/wiki/Markup_spec/BNF/Article A recurring question is who is actually going to write this parser though. Parser.php is 5000 lines and sanitizer.php another 1300. And probably other files I don't even know about. We're talking about months of work in the dark, with no feedback, and no guarantee that it will even get used. We're going to have to come up with a better coding process than "you write the code and when you're done we'll look at it".
If you can write a spec code can be generated from, that passes the current parser tests, people will *dive* upon it to do cool things with (WYSIWYG, C-based parsers, format converters, etc., etc.), even if it doesn't go into MediaWiki as used on Wikimedia. I've cc'ed this to mediawiki-l for greater outside interest.
- d.
At the moment, changing the parser is akin to playing Russian Roulette.
I tend to think of it as doing brain surgery with live electrical wires - but to each his own I suppose ;)
-- Jim
P.S. zzzzZZZAP!
On Nov 13, 2007 9:45 AM, David Gerard dgerard@gmail.com wrote:
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
The except to that is where crazy, unuseful syntax is actually a hindrance to the definition of an EBNF grammar and its implementation, as we've discussed earlier.
Oh yeah, that's wacky.
Incidentally, I'm making good progress on the grammar. I've merged in most of what was at meta, so at least there is only *one* grammar now (though part of it is EBNF and the rest is BNF). http://www.mediawiki.org/wiki/Markup_spec/BNF/Article A recurring question is who is actually going to write this parser though. Parser.php is 5000 lines and sanitizer.php another 1300. And probably other files I don't even know about. We're talking about months of work in the dark, with no feedback, and no guarantee that it will even get used. We're going to have to come up with a better coding process than "you write the code and when you're done we'll look at it".
If you can write a spec code can be generated from, that passes the current parser tests, people will *dive* upon it to do cool things with (WYSIWYG, C-based parsers, format converters, etc., etc.), even if it doesn't go into MediaWiki as used on Wikimedia. I've cc'ed this to mediawiki-l for greater outside interest.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
I originally wrote the EBNF situated at Meta for personal amusement - I had a vague idea that it might be useful to someone, but like I say I'm no computer scientist. I do believe an EBNF expression is impossible, but this appears to not be a problem considering our actual goals of such an expression.
I'm assuming our problem is this: currently we "parse" wikitext by immediately converting, via regex, into XHTML. This is not really "parsing", because parsing usually means the creation of an abstract Document Object Model which is then iterated through to generate XHTML, XML, FooBar or whatever (or so I have learnt). Because we're missing this DOM, Wikitext can't expand beyond being used by the current parser (so we can't do WYSIWYG, etc.). However, there appears to be no way of creating a DOM from Wikitext because this would be to standardise the way syntax converts to output, but any kind of standardisation will cause backwards incompatibility.
So our problem is the dilemma: either we standardise, and lose backwards compatibility, or we don't, and lose extensibility. And in the long run, I think the first option is better. However, in standardising the language we'd lose the feature of it that all syntax is valid (useful, as then people can't ever be presented with error messages on their pages) so ideally the move toward standardisation would have to be accompanied by a switch to WYSIWYG editing, such that the code becomes beyond reach and is forced to be always valid.
On the point of immutable validity, it is perhaps less useful for all text to be valid than for there to be "invalid markup" error messages. Whilst the former ensures users can never really "go wrong", it is still true that bad markup will lead to results they very much didn't intend - and it seems to me more useful to give them an error message than a wildly unintended result.
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
On 11/14/07, David Gerard dgerard@gmail.com wrote:
- is completely inappropriate in a discussion of 1. And that it's
important doesn't change that. A redesigned parser must present an almost-identical interface to the present implementation to get in; this is NOT the place to argue for syntax changes, and any attempts to change the syntax will in fact doom the effort.
The except to that is where crazy, unuseful syntax is actually a hindrance to the definition of an EBNF grammar and its implementation, as we've discussed earlier.
I have to say, I'm finding some really whacky things that work. Try this code:
o
Or how about an [[Image:foo.jpg|With an [[Image:foo.jpg]] in its caption...]] ?
Or even one with the table of contents: [[Image:foo.jpg|__TOC__]]
How do you think this <pretty> little piece of text renders?
Incidentally, I'm making good progress on the grammar. I've merged in most of what was at meta, so at least there is only *one* grammar now (though part of it is EBNF and the rest is BNF). http://www.mediawiki.org/wiki/Markup_spec/BNF/Article
A recurring question is who is actually going to write this parser though. Parser.php is 5000 lines and sanitizer.php another 1300. And probably other files I don't even know about. We're talking about months of work in the dark, with no feedback, and no guarantee that it will even get used. We're going to have to come up with a better coding process than "you write the code and when you're done we'll look at it".
Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 13/11/2007, Virgil Ierubino virgil.ierubino@gmail.com wrote:
However, in standardising the language we'd lose the feature of it that all syntax is valid (useful, as then people can't ever be presented with error messages on their pages) so ideally the move toward standardisation would have to be accompanied by a switch to WYSIWYG editing, such that the code becomes beyond reach and is forced to be always valid.
Let's assume this would doom the endeavour.
On the point of immutable validity, it is perhaps less useful for all text to be valid than for there to be "invalid markup" error messages. Whilst the former ensures users can never really "go wrong", it is still true that bad markup will lead to results they very much didn't intend - and it seems to me more useful to give them an error message than a wildly unintended result.
Not for editing by humans. Working tag soup is a *feature*, not a *bug*. Else we'd just make everyone use perfect XML.
- d.
On 11/14/07, Virgil Ierubino virgil.ierubino@gmail.com wrote:
I'm assuming our problem is this: currently we "parse" wikitext by immediately converting, via regex, into XHTML. This is not really "parsing", because parsing usually means the creation of an abstract Document Object Model which is then iterated through to generate XHTML, XML, FooBar or whatever (or so I have learnt). Because we're missing this DOM, Wikitext can't expand beyond being used by the current parser (so we can't do WYSIWYG, etc.). However, there appears to be no way of creating a DOM from Wikitext because this would be to standardise the way syntax converts to output, but any kind of standardisation will cause backwards incompatibility.
Your "DOM" is usually called an AST ("abstract syntax tree"). But yes. However, "backwards incompatibility" is not so much the issue as "sudden, drastic misrendering of existing wikitext".
I do think it's impossible to produce a meaningful traditional parser that could replicate exactly the behaviour of the current parser. I think it's very possible to produce a good parser that will cover all the most useful cases.
So our problem is the dilemma: either we standardise, and lose backwards
compatibility, or we don't, and lose extensibility. And in the long run, I think the first option is better. However, in standardising the language we'd lose the feature of it that all syntax is valid (useful, as then people can't ever be presented with error messages on their pages) so ideally the
The "all syntax is valid" thing really arises because of the nature of browsers rather than because of the parser itself. I don't think we're in danger of losing that - the parser will just have to fail gracefully when it comes up against malformed wikitext.
On the point of immutable validity, it is perhaps less useful for all text to be valid than for there to be "invalid markup" error messages. Whilst the former ensures users can never really "go wrong", it is still true that bad markup will lead to results they very much didn't intend - and it seems to me more useful to give them an error message than a wildly unintended result.
Wildly unintended is fine, at least they see that (or someone else does). What's more dangerous is when stuff silently breaks, making a sentence or two just disappear off the page.
Steve
Wildly unintended is fine, at least they see that (or someone else does). What's more dangerous is when stuff silently breaks, making a sentence or two just disappear off the page.
As long as they see it *and* can see how to fix it. Error messages (good ones, at least) tell you why something hasn't worked, not just that it hasn't (which you can usually see for yourself).
On 13/11/2007, Steve Bennett stevagewp@gmail.com wrote:
Wildly unintended is fine, at least they see that (or someone else does). What's more dangerous is when stuff silently breaks, making a sentence or two just disappear off the page.
Breakages wouldn't ever be silent - they'd produce error messages - either inline ones, ones that prevent the actual saving of the page, or fatal ones.
Breakages wouldn't ever be silent - they'd produce error messages - either inline ones, ones that prevent the actual saving of the page, or fatal ones.
Says who? What people are saying is that they don't want any error messages and want the parser to accept anything that's thrown at it and just do the best it can.
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
Breakages wouldn't ever be silent - they'd produce error messages - either inline ones, ones that prevent the actual saving of the page, or fatal ones.
Says who? What people are saying is that they don't want any error messages and want the parser to accept anything that's thrown at it and just do the best it can.
Yep. The present parser doesn't throw errors, it just gives you something odd. Throwing errors in a new parser would be unacceptable. "Everything I can't process just gets put through" would be acceptable, I'd think.
- d.
Yep. The present parser doesn't throw errors, it just gives you something odd. Throwing errors in a new parser would be unacceptable. "Everything I can't process just gets put through" would be acceptable, I'd think.
Why is it unacceptable? I would think users would prefer to be told what's wrong rather than have something unexpected happen. Just passing through the wikitext unchanged requires the parser to work out exactly which bit has gone wrong, which is often impossible (for example, in <tag><tag></tag>, which tag has been left unclosed? Should it output "<tag>" as the contents of a tag tag, or should it output "<tag>" followed by an empty tag tag? That's assuming it can find some way of working out that it isn't meant to swallow the rest of the article. A simple error message saying "tag tag not closed", either inline when displaying the page, or as an error on save, would be much easier and much clearer.)
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
Why is it unacceptable? I would think users would prefer to be told what's wrong rather than have something unexpected happen.
You are not a technophobe.
Just passing through the wikitext unchanged requires the parser to work out exactly which bit has gone wrong, which is often impossible (for example, in <tag><tag></tag>, which tag has been left unclosed? Should it output "<tag>" as the contents of a tag tag, or should it output "<tag>" followed by an empty tag tag? That's assuming it can find some way of working out that it isn't meant to swallow the rest of the article. A simple error message saying "tag tag not closed", either inline when displaying the page, or as an error on save, would be much easier and much clearer.)
And if they don't already understand the jargon and don't understand what it is they haven't done according to the definition of wikitext?
- d.
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
And if they don't already understand the jargon and don't understand what it is they haven't done according to the definition of wikitext?
That's why we write good error messages that don't use jargon and explain what is wrong in a way a typical user can understand.
Apart from en:wp's spectacular track record in instruction-creeping the hell out of each and every MediaWiki: space message it gets its hands on ...
The "good" error message is worse than no error message at all if at all avoidable.
We are not writing XML here and having token-soup be parsable as *something* is superior to that as a user interface.
- d.
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
We are not writing XML here and having token-soup be parsable as *something* is superior to that as a user interface.
Why is the parser outputting something other than what the user wants ever desirable?
You assume the user knows what they want before they start. They don't. They're not *programming* here.
- d.
On 13/11/2007, David Gerard dgerard@gmail.com wrote:
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
We are not writing XML here and having token-soup be parsable as *something* is superior to that as a user interface.
Why is the parser outputting something other than what the user wants ever desirable?
You assume the user knows what they want before they start. They don't. They're not *programming* here.
You're not making sense. Of course the user knows what they want... they want to put certain content with certain formatting into a page. If they get anything other than that, something has gone wrong. We may as well admit to it rather than pretending they wanted something else.
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
On 13/11/2007, David Gerard dgerard@gmail.com wrote:
You assume the user knows what they want before they start. They don't. They're not *programming* here.
You're not making sense. Of course the user knows what they want... they want to put certain content with certain formatting into a page. If they get anything other than that, something has gone wrong. We may as well admit to it rather than pretending they wanted something else.
A new parser that issues the user with error messages from what they typed in - rather than just producing rubbish as now - will, I predict, be considered unacceptable.
- d.
Thomas Dalton wrote:
Why is the parser outputting something other than what the user wants ever desirable?
You're talking about some very different things when you use the word "user" here. The current parser (PHP regexp) is used both when an editor saves an article and when a reader browses an article (unless it was still in the cache). Producing an error message for a reader isn't meaningful. Some editors only fix a single spelling error and are not interested in learning about syntax errors in other parts of the article. Any kind of error message must be opt-in, that is what "lint" was to old C programmers, or what "-Wall -pedantic" (activate all warning messages) is to the GCC compiler. This could be useful, though, which suggests the parser should be able to run in different modes: Just-show-it mode and lint mode.
http://en.wikipedia.org/wiki/Lint_(software)
On 13/11/2007, Lars Aronsson lars@aronsson.se wrote:
Producing an error message for a reader isn't meaningful.
It is much more meaningful than a load of gibberish being outputted by bad syntax.
Besides, I suggest that error messages be appended to an article (in some way) and not necessarily inserted inline where they occur.
On 13/11/2007, Lars Aronsson lars@aronsson.se wrote:
Thomas Dalton wrote:
Why is the parser outputting something other than what the user wants ever desirable?
You're talking about some very different things when you use the word "user" here. The current parser (PHP regexp) is used both when an editor saves an article and when a reader browses an article (unless it was still in the cache). Producing an error message for a reader isn't meaningful. Some editors only fix a single spelling error and are not interested in learning about syntax errors in other parts of the article. Any kind of error message must be opt-in, that is what "lint" was to old C programmers, or what "-Wall -pedantic" (activate all warning messages) is to the GCC compiler. This could be useful, though, which suggests the parser should be able to run in different modes: Just-show-it mode and lint mode.
How about always allowing saves (I think most people are agreed that refusing to save invalid syntax is a bad idea), but showing error messages when the page is displayed immediately after saving (I've thinking ?action=debug, this would allow people to get the error messages manually at other times as well, and shouldn't be too hard to implement). Normal readers of the page would just see the parser's best guess, but when someone clicks "save" they would see it with inline error messages (appending them just makes it impossible to tell where the error is - articles don't have line numbers), probably with a big red warning at the top to make sure people don't just move on without looking at the article they've just saved.
On 14/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
How about always allowing saves (I think most people are agreed that refusing to save invalid syntax is a bad idea), but showing error messages when the page is displayed immediately after saving (I've thinking ?action=debug, this would allow people to get the error messages manually at other times as well, and shouldn't be too hard to implement). Normal readers of the page would just see the parser's best guess, but when someone clicks "save" they would see it with inline error messages (appending them just makes it impossible to tell where the error is - articles don't have line numbers), probably with a big red warning at the top to make sure people don't just move on without looking at the article they've just saved.
I'm still utterly unconvinced "errors" are ever a good idea.
As I wrote on the subject to this list last year (re WikiCreole):
Every combination of wikitext has to be able to do something (even if that's "just let it through"), because it is a *language*.
If people who can't work computers put a character out of place and the wiki engine just spits back "INVALID CONTENT", are they going to edit an article ever again? *Hell* no.
HTML was invented as a human-editable page language. However, the humans it was invented for were nuclear physicists, rather than the general public.
When the general public started writing web pages, they didn't write well-formed HTML - they would bash out something, preview it in Netscape 1.1 and put it up. The geeks derided this as "tagsoup".
But I submit that the "tagsoup" approach was the natural one for people to use, because that's how people learn a language: test, change, test again, repeat. That's how people learn wikitext on Wikipedia: if you just write plain text, it'll more or less work. If you want to do something fancier, you'll look and see what other people do and try something like that and see if it works. No-one reads a syntax manual unless they're a geek.
People care much more about what they're putting on the page than constructing an immaculate piece of XML. If wikitext can't cope sensibly with *anything and everything* people throw at it, then we might as well just be requiring immaculate XML.
So "error messages" being output at all (except maybe in "action=debug", and even then that should be more "lint"-like) is fundamentally erroneous.
- d.
I'm still utterly unconvinced "errors" are ever a good idea.
[...]
So "error messages" being output at all (except maybe in "action=debug", and even then that should be more "lint"-like) is fundamentally erroneous.
That argument works for arguing against refusing a save invalid content, but I was explicitly not suggesting that. My suggestion was to let everything through, but display error messages (or, rather, warnings) so people can easily tell if something isn't displaying right.
On 14/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
So "error messages" being output at all (except maybe in "action=debug", and even then that should be more "lint"-like) is fundamentally erroneous.
That argument works for arguing against refusing a save invalid content, but I was explicitly not suggesting that. My suggestion was to let everything through, but display error messages (or, rather, warnings) so people can easily tell if something isn't displaying right.
No, outputting them at all except when expressly requested will scare the living crap out of people afraid of computers. Wikitext is hardly wonderful in this regard, but error messages will make it much worse.
- d.
No, outputting them at all except when expressly requested will scare the living crap out of people afraid of computers. Wikitext is hardly wonderful in this regard, but error messages will make it much worse.
Not if the people writing to error messages keep in mind who they are for. "ERROR 574B: ''' not expected" will send most users running for the hills, sure, but "Please put ''' where you want the bold text to finish" should just help people learn. The best way to learn a new language (spoken, programming, markup, whatever) is by trial and error - that doesn't work if you can't identify errors.
On 15/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
No, outputting them at all except when expressly requested will scare the living crap out of people afraid of computers. Wikitext is hardly wonderful in this regard, but error messages will make it much worse.
Not if the people writing to error messages keep in mind who they are for. "ERROR 574B: ''' not expected" will send most users running for the hills, sure, but "Please put ''' where you want the bold text to finish" should just help people learn.
Imagine if you find this error while editing a long article (say any en.wp country article). Is the error going to be smart enough to locate where the wrong symbols are? The edit box doesn't have line numbers. I have to agree with David on this issue.
BTW ''' is valid. ;) It bolds until the end of the line. I find this every time I try to bold two paragraphs:
'''Bold one
bold two'''
Paragraph 1 is all bolded, and Paragraph 2 has nothing.
regards, Brianna
Not if the people writing to error messages keep in mind who they are for. "ERROR 574B: ''' not expected" will send most users running for the hills, sure, but "Please put ''' where you want the bold text to finish" should just help people learn.
Imagine if you find this error while editing a long article (say any en.wp country article). Is the error going to be smart enough to locate where the wrong symbols are? The edit box doesn't have line numbers.
It can theoretically be done. Here's an example for the "Bold one / bold two" test: http://can-we-link-it.nickj.org/suggest-links/suggester.php?page=User:Nickj/... (i.e. it prints the start of the line with the problem - but it's still hard to find things in a long article)
For this malformed input: http://en.wikipedia.org/w/index.php?title=User:Nickj/sandbox&action=edit
... but whether something like this should be done or not, and if so how to present it to the user, is debatable.
BTW ''' is valid. ;)
Personally I would include being balanced in the definition of valid, which this example isn't - but it (and all input) is basically valid at the moment, since the parser never complains about anything (well, <math> extension tags can and do complain and generate red errors on bad input, but they're an extension and an exception, which does not affect most people).
I have to agree with David on this issue.
David's point is a good one. We already have technical and social barriers-to-entry, with guidelines, rules, processes, a particular style of doing things (etiquette, social norms, etc), combined with some knowledge of wikitext required to do things. The last thing we want is to make it even harder on new users. So any warnings about malformed input either have to be opt-in, or there needs to be no errors. Either seems fine to me, just don't foist technical warnings or scary-looking red messages onto people who don't want them, or know how to resolve them.
-- All the best, Nick.
On 11/15/07, Nick Jenkins nickpj@gmail.com wrote:
we want is to make it even harder on new users. So any warnings about malformed input either have to be opt-in, or there needs to be no errors. Either seems fine to me, just don't foist technical warnings or scary-looking red messages onto people who don't want them, or know how to resolve them.
Again, I'm not sure why we're even having this discussion as it seems to be tangential to the grammar/new parser project. But hypothetically if there *were* warnings, I would make them green and bubble shaped for the reasons you mention. Sort of a "hey, did you mean to leave that bold open? :) :) :)" rather than a "ZOMG U IJIT U DIDNT CLOSE UR BOLD DIE DIE DIE" type of thing.
Steve
So any warnings about malformed input either have to be opt-in, or there needs to be no errors. Either seems fine to me, just don't foist technical warnings or scary-looking red messages onto people who don't want them, or know how to resolve them.
That's why we make the warnings neither technical, nor red. And we include in them instructions (and links to more instructions) on how to resolve them.
Perhaps someone should talk to some real users about what they'd prefer...
BTW ''' is valid. ;)
Everything is valid, that's the point. My suggestion is just to warn. I imagine a significant amount of the time, people failing to close bolds is an accident, not an intentional use of the fact that they stop at the end of the paragraph, and a warning could be useful.
On Thu, Nov 15, 2007 at 10:53:09AM +0000, Thomas Dalton wrote:
BTW ''' is valid. ;)
Everything is valid, that's the point. My suggestion is just to warn.
If everything is valid, why then, there's nothing to warn *about*. :-)
Cheers, -- jra
On 15/11/2007, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Nov 15, 2007 at 10:53:09AM +0000, Thomas Dalton wrote:
BTW ''' is valid. ;)
Everything is valid, that's the point. My suggestion is just to warn.
If everything is valid, why then, there's nothing to warn *about*. :-)
By definition, a warning is about something which is valid, but probably unintentional. If it wasn't valid, you would get an error message and the parsing would be aborted.
On Thu, Nov 15, 2007 at 01:49:39PM +0000, Thomas Dalton wrote:
On 15/11/2007, Jay R. Ashworth jra@baylink.com wrote:
On Thu, Nov 15, 2007 at 10:53:09AM +0000, Thomas Dalton wrote:
BTW ''' is valid. ;)
Everything is valid, that's the point. My suggestion is just to warn.
If everything is valid, why then, there's nothing to warn *about*. :-)
By definition, a warning is about something which is valid, but probably unintentional. If it wasn't valid, you would get an error message and the parsing would be aborted.
Ah. Right. I spend most of my time in 4GL's where the line's more blurry.
Cheers, -- jra
Thomas Dalton wrote:
My suggestion is just to warn. I imagine a significant amount of the time, people failing to close bolds is an accident, not an intentional use of the fact that they stop at the end of the paragraph, and a warning could be useful.
There were significantly unbalanced <small> tags, where the closing tag was either missing or was an opening one instead of closing. The parser produced invalid syntax, and tidy then closed them at the end of the paragraph, which is imho the correct way.
Jay R. Ashworth wrote:
On Thu, Nov 15, 2007 at 10:53:09AM +0000, Thomas Dalton wrote:
BTW ''' is valid. ;)
Everything is valid, that's the point. My suggestion is just to warn.
If everything is valid, why then, there's nothing to warn *about*. :-)
Cheers, -- jra
Because then tidy was upgraded and decided to reopen the inline tags at the next block element, breaking quite a lot of pages, specially Village pump and user talk pages, needing to be fixed months after beign archived. This bug is still open. :(
On 11/15/07, David Gerard dgerard@gmail.com wrote:
I'm still utterly unconvinced "errors" are ever a good idea.
As I wrote on the subject to this list last year (re WikiCreole):
Every combination of wikitext has to be able to do something (even if that's "just let it through"), because it is a *language*.
If people who can't work computers put a character out of place and the wiki engine just spits back "INVALID CONTENT", are they going to edit an article ever again? *Hell* no.
David, you seem to be railing against some argument to suddenly change the behaviour of the parser from rendering everything, to suddenly balking on bad input and spitting out "INVALID CONTENT" instead.
Has anyone suggested such a thing? Whence cometh thy fury?
FWIW, I anticipate that the new parser will behave much like the current one, but it could also generate useful warning messages if requested. These would presumably be shown to the person saving the page, at preview time, and also to the viewer if some option were enabled.
Steve
On 15/11/2007, Steve Bennett stevagewp@gmail.com wrote:
David, you seem to be railing against some argument to suddenly change the behaviour of the parser from rendering everything, to suddenly balking on bad input and spitting out "INVALID CONTENT" instead. Has anyone suggested such a thing? Whence cometh thy fury?
You suggest it in the next paragraph:
FWIW, I anticipate that the new parser will behave much like the current one, but it could also generate useful warning messages if requested. These would presumably be shown to the person saving the page, at preview time, and also to the viewer if some option were enabled.
Spitting out warnings in any situation by default is bad. Any unrequested warning at all will come across as "INVALID CONTENT COMPUTER CALLS YOU FAILURE".
That said, (a) having a lint mode (b) having an option to switch it on in user prefs would be very nice.
- d.
- d.
On 11/15/07, David Gerard dgerard@gmail.com wrote:
You suggest it in the next paragraph:
Heh, I asked what you made you angry, and you reply "you do in the next paragraph"?
Spitting out warnings in any situation by default is bad. Any unrequested warning at all will come across as "INVALID CONTENT COMPUTER CALLS YOU FAILURE".
Ok, I think we got the point. Not that I think anyone was really suggesting this to begin with. Even my post was just "it could". Not "it should", "it will" etc etc.
(to semi-quote Brion, let's not lose focus with side issues here)
Steve
I second this suggestion!
On 14/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
On 13/11/2007, Lars Aronsson lars@aronsson.se wrote:
Thomas Dalton wrote:
Why is the parser outputting something other than what the user wants ever desirable?
You're talking about some very different things when you use the word "user" here. The current parser (PHP regexp) is used both when an editor saves an article and when a reader browses an article (unless it was still in the cache). Producing an error message for a reader isn't meaningful. Some editors only fix a single spelling error and are not interested in learning about syntax errors in other parts of the article. Any kind of error message must be opt-in, that is what "lint" was to old C programmers, or what "-Wall -pedantic" (activate all warning messages) is to the GCC compiler. This could be useful, though, which suggests the parser should be able to run in different modes: Just-show-it mode and lint mode.
How about always allowing saves (I think most people are agreed that refusing to save invalid syntax is a bad idea), but showing error messages when the page is displayed immediately after saving (I've thinking ?action=debug, this would allow people to get the error messages manually at other times as well, and shouldn't be too hard to implement). Normal readers of the page would just see the parser's best guess, but when someone clicks "save" they would see it with inline error messages (appending them just makes it impossible to tell where the error is - articles don't have line numbers), probably with a big red warning at the top to make sure people don't just move on without looking at the article they've just saved.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 13/11/2007, David Gerard dgerard@gmail.com wrote:
On 13/11/2007, Thomas Dalton thomas.dalton@gmail.com wrote:
Why is it unacceptable? I would think users would prefer to be told what's wrong rather than have something unexpected happen.
You are not a technophobe.
[...]
And if they don't already understand the jargon and don't understand what it is they haven't done according to the definition of wikitext?
Error messages can be verbose and helpful. Look at how helpful the W3C XHTML markup validator is when it finds errors.
A good solution here would be to always "do the best you can" and output something, no matter how unexpected, but APPEND to the output (in some fashion) some warning messages about invalid or likely-invalid syntax. Technophiles can clear these up, technophobes can ignore them.
Thomas Dalton wrote:
Why is it unacceptable?
My understanding of the wiki wiki idea is that we're trying to lower barriers to entry as far as possible, and capture every single good-faith contribution regardless of how "wrong" it might be. The "wantedness" or "unwantedness" of an edit is for fellow editors to decide, not the parser.
My understanding of the wiki wiki idea is that we're trying to lower barriers to entry as far as possible, and capture every single good-faith contribution regardless of how "wrong" it might be. The "wantedness" or "unwantedness" of an edit is for fellow editors to decide, not the parser.
That's why I'm suggesting warnings, not error messages that prevent saving.
On 11/16/07, Mark mark@geekhive.net wrote:
My understanding of the wiki wiki idea is that we're trying to lower barriers to entry as far as possible, and capture every single good-faith contribution regardless of how "wrong" it might be. The "wantedness" or "unwantedness" of an edit is for fellow editors to decide, not the parser.
Don't tell me that when my table has exploded due to an extraneous | somewhere...
If a user's "error" is to put an extra space where it technically doesn't matter, of course we wouldn't bug them about it. If a user's error makes half the article disappear or turn bold or something, they're going to want to know about. Let's not patronise here.
Steve
Steve Bennett wrote:
Don't tell me that when my table has exploded due to an extraneous | somewhere...
?? so? It's still your decision that the table is exploded.
If a user's error makes half the article disappear or turn bold or something, they're going to want to know about. Let's not patronise here.
It's not patronizing to use the wiki-wiki method, it's *human*. I realise that Wikipedia and some other Wikimedia projects have long since moved to a hybrid approach using some degree of machine intervention, but on the whole wiki-wiki works because of editorial vigilance, not machine error detection.
We don't know for a given editor whether seeing a warning or seeing a page blow-out is going to be more scary. Therefore it should be a setting. I think the default should be not to show the error, or perhaps to show a little widget that links to the warnings from text like "Why is this page messed up?".
Templates complicate the "is this valid" test. Multiple templates can contain invalid syntax which when used together form proper syntax.
Consider a template that starts off a table - say Template:TableStart which contains:
{| |- |
And a Templat:TableStop which contains
|}
The former contains invalid syntax since the table is not closed, the latter contains meaningless characters since the table wasn't started. But used together, the two form a table. For example:
{{TableStart}} Some text |- Perhaps another row {{TableStop}}
That example also contains individually meaningless markup characters (the |-), but after transclusion would render a complete table.
So where do you place the validation logic? If you validate the Template pages themselves, they'll fail - so do you omit Templates from conformance? If so, what about transcluding pages from other Namespaces?
I'm not trying to say that a validator is a bad thing, or that it wouldn't be useful. I'm just wondering how all the false positives can be accounted for.
-- Jim R. Wilson (jimbojw)
On Nov 13, 2007 12:01 PM, Thomas Dalton thomas.dalton@gmail.com wrote:
Breakages wouldn't ever be silent - they'd produce error messages - either inline ones, ones that prevent the actual saving of the page, or fatal ones.
Says who? What people are saying is that they don't want any error messages and want the parser to accept anything that's thrown at it and just do the best it can.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Jim Wilson wrote:
That example also contains individually meaningless markup characters (the |-), but after transclusion would render a complete table.
So where do you place the validation logic? If you validate the Template pages themselves, they'll fail - so do you omit Templates from conformance? If so, what about transcluding pages from other Namespaces?
What about <includeonly>|}</includeonly>?
I agree with you, also i'm not sure if users would be willing to "fix" a template which will never be used alone. Even if it works like that, changing tablestop to a different valid syntax would break the including pages. There, you can't forbid saving. Moreover, the editor is not going to notice that the change to template:tablestop made blinking red warnings on several different articles.
I agree with you, also i'm not sure if users would be willing to "fix" a template which will never be used alone. Even if it works like that, changing tablestop to a different valid syntax would break the including pages. There, you can't forbid saving. Moreover, the editor is not going to notice that the change to template:tablestop made blinking red warnings on several different articles.
Inline error messages which are generated *after* templates are included should work. Anyone viewing Template:Tablestop would see an error (and can ignore it), but pages correctly transcluding the template would be valid and would not have any error messages.
Templates complicate the "is this valid" test. Multiple templates can contain invalid syntax which when used together form proper syntax.
[..snip..]
The former contains invalid syntax since the table is not closed, the latter contains meaningless characters since the table wasn't started. But used together, the two form a table. For example:
{{TableStart}} Some text |- Perhaps another row {{TableStop}}
That example also contains individually meaningless markup characters (the |-), but after transclusion would render a complete table.
So where do you place the validation logic? If you validate the Template pages themselves, they'll fail - so do you omit Templates from conformance? If so, what about transcluding pages from other Namespaces?
I'm not trying to say that a validator is a bad thing, or that it wouldn't be useful. I'm just wondering how all the false positives can be accounted for.
We had this exact problem at http://en.wikipedia.org/wiki/WP:WS (a discontinued project for "validating" wiki syntax, for a very very simplistic definition of "validate").
It was all about balance - that if you opened something you should close it, or if you close something you should have opened it - including for templates, tables, headers, links, bold and italics.
It only looked at a single page (i.e. no template transclusion) so this would pass, since the number of "{{" equals number of "}}" :
{{TableStart}} Some text |- Perhaps another row {{TableStop}}
... and this would pass, since number of "{|" equals number of "|}" :
{| Some text |- Perhaps another row |}
... but this would fail, because number of "{|" does not equal number of "|}" :
{| Some text |- Perhaps another row {{TableStop}}
... and the resolution was to change it to one of the first two forms (i.e. you had to close a table in the same way that you opened it).
People forgetting to close tables did happen, but it was not a common problem.
Far more common was "[[unclosed brackets" or "an unclosed ''bold or '''italics".
And yes, for the record, false positives were a problem, such as for things like "[a,b)" as a mathematical notation, as opposed to someone mistyping an external link.
A simple error message saying "tag tag not closed", either inline when displaying the page, or as an error on save, would be much easier and much clearer.)
And if they don't already understand the jargon and don't understand what it is they haven't done according to the definition of wikitext?
The above project was opt-in, which worked well. So an integrated validator could maybe work as a preference that people had to explicitly enable - ( "[X] Show me verbose errors when saving invalid wiki text." )
Also rather than just saying "there was this error", what would be really nice would be if you could get a list of common solutions for a problem, with a point and click interface to apply a canned solution. Probably around 95% of the solutions to the problems found were quite mechanical transformations, and if these transformations were available from a pick-list, then that would be nice.
For example: Problem: Unclosed bold on line: "an unclosed ''bold on a line". Possible solutions (with radio boxes shown to select preferred solution) : 1) Remove unclosed bold from line (i.e. "an unclosed bold on a line".) 2) Close bold after first word (i.e. "an unclosed ''bold'' on a line".) 3) Close bold at end of line (i.e. "an unclosed ''bold on a line''".) 4) I will manually edit to resolve this.
Same thing could potentially be used for headings, link brackets, etc.
A new parser that issues the user with error messages from what they typed in - rather than just producing rubbish as now - will, I predict, be considered unacceptable.
If it's opt-in for the current behaviour + warning messages, and behaves the same as now by default, then surely that should be acceptable?
-- All the best, Nick.
On 14/11/2007, Platonides Platonides@gmail.com wrote:
very interesting. I wasn't aware that such thing existed.
Now, I'm all for minimal quoting, but I think you've taken it a little too far... what's interesting?
Thomas Dalton wrote:
On 14/11/2007, Platonides wrote:
very interesting. I wasn't aware that such thing existed.
Now, I'm all for minimal quoting, but I think you've taken it a little too far... what's interesting?
Nick's message - previous quoting :P. That validating wikiproject, validating approaches, that it seemed to work...
On 11/14/07, Jim Wilson wilson.jim.r@gmail.com wrote:
Templates complicate the "is this valid" test. Multiple templates can contain invalid syntax which when used together form proper syntax.
I don't think there's any question that templates will continue to be parsed by a preprocessor. Just like in C, there's really no way to do a one-pass parse if elements can expand to literally anything.
So where do you place the validation logic? If you validate the
Template pages themselves, they'll fail - so do you omit Templates from conformance? If so, what about transcluding pages from other Namespaces?
I'm not sure where all this discussion about a validator came from. The answer seems obvious though: don't validate template text, and validate a page after transcluding and parsing.
The discussion of error messages is a bit misplaced though - the parser could generate helpful information which the skin could display in a useful way as a warning to the user. I don't think you'd consider them error messages though.
Consider this: '''Some bold''' and then '''' what?
The three apostrophes at the end aren't an error. Nor does the parser really know what to do with them, so it'll guess (render literally) and warn.
Steve
On 11/13/07, Jim Wilson wilson.jim.r@gmail.com wrote:
Templates complicate the "is this valid" test. Multiple templates can contain invalid syntax which when used together form proper syntax.
The same applies to a variety of other constructs, like <noinclude> and probably some parser hooks. It's therefore already been concluded that the parser must be two-pass: one to substitute all conditional and remote stuff, like templates, and the second to convert to HTML (or whatever).
On 11/14/07, Simetrical Simetrical+wikilist@gmail.com wrote:
The same applies to a variety of other constructs, like <noinclude> and probably some parser hooks. It's therefore already been concluded that the parser must be two-pass: one to substitute all conditional and remote stuff, like templates, and the second to convert to HTML (or whatever).
I'm picturing 4 passes: 1) Preprocessor: substitute templates (and parser functions?) into the text stream 2) Parser: build an AST by reading the text stream 3) Contexter: manipulate the AST, disambiguating ''' and '' by using the built context. Tidy up nested lists. 4) Renderer: convert the AST into XHTML or whatever.
Steve
Steve Bennett wrote:
I'm picturing 4 passes:
- Preprocessor: substitute templates (and parser functions?) into the text
stream
And custom tags.
Actually I have a strong preference for using extractTagsAndParams as a *public* function in order to extract extension tags rather than adding parser Hooks.
- Parser: build an AST by reading the text stream
- Contexter: manipulate the AST, disambiguating ''' and '' by using the
built context. Tidy up nested lists. 4) Renderer: convert the AST into XHTML or whatever.
On Tue, Nov 13, 2007 at 01:17:24PM +0000, David Gerard wrote:
- is completely inappropriate in a discussion of 1. And that it's
important doesn't change that. A redesigned parser must present an almost-identical interface to the present implementation to get in; this is NOT the place to argue for syntax changes, and any attempts to change the syntax will in fact doom the effort.
Are you speculating, or speaking ex cathedra as a dev/staffer?
Cheers, -- jra
Jay R. Ashworth wrote:
On Tue, Nov 13, 2007 at 01:17:24PM +0000, David Gerard wrote:
- is completely inappropriate in a discussion of 1. And that it's
important doesn't change that. A redesigned parser must present an almost-identical interface to the present implementation to get in; this is NOT the place to argue for syntax changes, and any attempts to change the syntax will in fact doom the effort.
Are you speculating, or speaking ex cathedra as a dev/staffer?
It is common knowlege with those who follow MediaWiki development discussions and have done for a while that changing wikitext syntax is considered A Bad Thing (tm) and will most likely get any commits reverted.
MinuteElectron.
On 13/11/2007, Jay R. Ashworth jra@baylink.com wrote:
On Tue, Nov 13, 2007 at 01:17:24PM +0000, David Gerard wrote:
- is completely inappropriate in a discussion of 1. And that it's
important doesn't change that. A redesigned parser must present an almost-identical interface to the present implementation to get in; this is NOT the place to argue for syntax changes, and any attempts to change the syntax will in fact doom the effort.
Are you speculating, or speaking ex cathedra as a dev/staffer?
Speculating, since I'm neither of those ;-)
- d.
On Tue, Nov 13, 2007 at 05:58:30PM +0000, David Gerard wrote:
Are you speculating, or speaking ex cathedra as a dev/staffer?
Speculating, since I'm neither of those ;-)
Brion, who *was* speaking ex-cathedra, has in fact spoken.
I'm out.
Cheers, -- jra
On Tue, Nov 13, 2007 at 02:49:39AM +0100, Lars Aronsson wrote:
But you guys are all talking about creating a parser, the ultimate one, for which you really don't have any need. You were talking about this five years ago. You still haven't produced any useful parser. And you probably will discuss this five years from now.
"The current parser is a crawling horror, very difficult to maintain and impossible to extract for other purposes" seems like a perfectly serviceable motivation to me, and clearly, to Steve and some other people.
I gather that having a sufficiently clear spec to allow it to be implemented in a faster language than PHP for WMF would not be a bad thing either.
Cheers, -- jra
- there are a limited number of basic syntax constructs
- all valid syntax is either basic syntax, or complex syntax built
from basic syntax which it can be broken down into
While it's not ideal, I see no reason (possibly because I don't know what I'm talking about) that we can't extend (1) to:
1a) There are countably many basic syntax constructs, only finitely many of which are used in any valid syntax.
(See [[Countably infinite]] if that doesn't make sense to you)
That would then allow our list syntax. It makes the parser slightly more complicated (it requires defining tokens iteratively), but should be possible (well, the fact that we do it shows it's possible, I mean possible without straying too far from an EBNF parser).
On 11/8/07, Simetrical Simetrical+wikilist@gmail.com wrote:
- Now that we have a grammar, a yacc parser is compiled, and
appropriate rendering bits are added to get it to render to HTML.
People have already done this, at least once, haven't they? Do we have a list of attempts?
3) The stuff the BNF grammar doesn't cover is tacked on with some
other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
Ouch, now you're up to about 4 passes, which isn't far off the current version. Two passes would be good, like a C compiler: once for meta-markup (templates, parser functions), and once for content. Would it be possible to perhaps have an in-place pattern-based parser for the first phase, then a proper recursive descent for the content?
Unfortunately the deliberate apparent similarity of lots of very different language features ({{foo}} vs {{foo:blah}}, [[Project:Link]] vs [[Category:Link]] etc) makes much of this very complex.
I guess there's no possibility of making wholesale changes to the grammar then implementing a migration script?
4) All of this breaks a thousand different corner cases and half the
parser tests. The implementers carefully go through every failed parser test, rewrite it to the actual output, and carefully justify why this is the correct course of action. Or just assume it is, depending on the level of care.
Sounds good to me. I wonder also if there is any chance of implementing two parsers and migrating slowly from one to the next. Perhaps all Wikipedia pages starting with Ab... could be rendered with the new parser while others use the old? Pages using the new parser could have a warning displayed like "Are there problems with the way the content is displayed? Click here...". And wait for people to actually report perceived problems - as opposed to the page failing a regression test.
5) A PHP implementation of the exact same grammar is implemented. How
practical this is, I don't know, but it's critical unless we want pretty substantially different behavior for people using the PHP module versus not. It is not acceptable to force third parties to use a PHP module, nor to grind their parser to a halt (which a naive compilation of the grammar into PHP would probably do).
Wasn't there a move to get away from PHP for the parser? Is that not feasible?
6) Everything is rolled out live. Pages break left and right. Large
complaint threads are started on the Village Pump, people fix it, and everyone forgets about it. Developers get a warm fuzzy feeling for having finally succeeded at destroying Parser.php.
I have trouble picturing this. It could be horrendous. But if it could be managed so there were perhaps a few dozen complaints a day and not more, that might be doable.
This is if it's to be done properly. A semi-formal specification
that's not directly useful for parsing pages would involve a lot less work and perhaps correspondingly less benefit. It could still improve operability with third parties dramatically; perhaps that's the only goal other people have in mind, not the ability to compile a parser with some yacc equivalent. I don't know.
The parser moves though. I don't see a semi-formal grammar which isn't used for anything keeping pace.
Steve
Wouldn't the most sensible way to come up with a formal specification be to write a dirty great big page with all the details of the way certain constructs are parsed?
BNF/EBNF/Bison/etc isn't really suited to the task at all: as Steve has mentioned, it's designed to answer "Does text Y match grammar Z?", rather than "what output should be created when text Y is parsed with grammar Z?"
On 11/8/07, Andrew Garrett andrew@epstone.net wrote:
Wouldn't the most sensible way to come up with a formal specification be to write a dirty great big page with all the details of the way certain constructs are parsed?
What format are you thinking of? A set of rules with precedence like:
*x --> <OL><LI>x</LI></OL>
'''x''' -> <B>x</B>
perhaps? It has the advantage that it would more closely match the current parser implementation and would be easier to write. Which of course gives the disadvantage that it's less like a spec for a "real" parser, so is less helpful if/when we come to write one.
However it does look easier to mantain.
Steve
"Steve Bennett" stevagewp@gmail.com wrote in message news:b8ceeef70711071955m74ce9ac7q992cbe751fc9415d@mail.gmail.com...
On 11/8/07, Simetrical
Simetrical+wikilist@gmail.com wrote:
- All of this breaks a thousand different corner cases and half the
parser tests. The implementers carefully go through every failed parser test, rewrite it to the actual output, and carefully justify why this is the correct course of action. Or just assume it is, depending on the level of care.
Sounds good to me. I wonder also if there is any chance of implementing
two
parsers and migrating slowly from one to the next. Perhaps all Wikipedia pages starting with Ab... could be rendered with the new parser while
others
use the old? Pages using the new parser could have a warning displayed
like
"Are there problems with the way the content is displayed? Click here...". And wait for people to actually report perceived problems - as opposed to the page failing a regression test.
Well - we'll need some kind of flag in the DB to indicate which of the parsers should be used to render the page, otherwise the entire history will be borked. Therefore any conversion needn't be so formal.
E.g.
* Bot goes through existing pages and parses with the new parser and the old parser and passes results through tidy. Any identical pages get silently 'upgraded' to the new parser. * Any pages that don't matched get added to [[Category:Pages that need fixup for the new parser]] but don't get upgraded automatically. Possibly limit this so bot only runs when size of category is <1000 or something. * An option when saving to 'upgrade' to new parser (no reverse option) for pages still on the old one. In these cases it's up to the user to check the page - the preview function will show the result of the new parser. * We could also provide a tool to show the diff between the two different outputs (or to display the two versions for visual comparison).
- Mark Clements (HappyDog)
On 11/7/07, Steve Bennett stevagewp@gmail.com wrote:
- The stuff the BNF grammar doesn't cover is tacked on with some
other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
Ouch, now you're up to about 4 passes, which isn't far off the current version. Two passes would be good, like a C compiler: once for meta-markup (templates, parser functions), and once for content. Would it be possible to perhaps have an in-place pattern-based parser for the first phase, then a proper recursive descent for the content?
I don't know why it couldn't be two-pass. You obviously need some extra steps to insert things like the TOC, but those don't really count as a "pass". It would just require the parser to, say, parse all the stuff up to the TOC location as one string, then start a new string for stuff after the TOC should go, and concatenate them all together when it's finished actually parsing.
I suspect a major problem might arise if there are constructs that require more than one-token lookahead. There probably are, and apparently bison et al. can't parse those. But again, I would defer on this to someone who actually knows something. :) This kind of construct might be a good candidate for removal in any kind of grammar overhaul.
I guess there's no possibility of making wholesale changes to the grammar then implementing a migration script?
Honestly, that's what I'd like. I've suggested it before in the past. My suggestion was to use XML, and make up for the difficulty of direct editing with good WYSIWYG. A one-way conversion to XML should be relatively easy to make lossless, since we can just adapt the existing Parser with all its quirks. Writing a WYSIWYG editor good enough to effectively replace wikitext editing would be trickier, but also doable if it had the manpower. Of course it wouldn't work on non-JS-supporting clients, but XML is human-readable, right? :P Converting from XML back to wikitext losslessly is quite likely impossible, so editing in wikitext would no longer be an option. That, of course, met with substantial opposition.
Anyway, no one seems to have persevered with this task for long enough to present a list of recommended changes to make the grammar easier to parse, as far as I know. Although maybe I just haven't seen it. Brion has indicated in the past that reasonable changes to accommodate a formal grammar would be okay.
Wasn't there a move to get away from PHP for the parser? Is that not feasible?
Most people who use MediaWiki use it on shared hosting, and do not have the ability to install PHP modules or other non-PHP code. So some kind of PHP-based solution really needs to exist.
I have trouble picturing this. It could be horrendous. But if it could be managed so there were perhaps a few dozen complaints a day and not more, that might be doable.
I'm not worried. It's a wiki, it'll get fixed. As long as the effects don't impede the ability to read the text, which they probably won't, nothing will break down and die.
The parser moves though. I don't see a semi-formal grammar which isn't used for anything keeping pace.
The parser doesn't move very quickly. Anyone who fiddles with parser output without updating the parser tests gets a tongue-lashing already (as I know quite well); the same could be done for the parser description if one actually existed. But as of now, a comprehensive description of the current parser is probably impossible. My favorite symptom of how ad-hoc it is as a parser (I may have cited this before here) is this comment, line 1204:
# If there is an odd number of both bold and italics, it is likely # that one of the bold ones was meant to be an apostrophe followed # by italics. Which one we cannot know for certain, but it is more # likely to be one that has a single-letter word before it.
On 11/8/07, Andrew Garrett andrew@epstone.net wrote:
Wouldn't the most sensible way to come up with a formal specification be to write a dirty great big page with all the details of the way certain constructs are parsed?
No, because 1) it's not possible to be sufficiently comprehensive for reliable interoperability, and 2) even if it were, the resulting explanation would be so complicated that probably no one would bother.
BNF/EBNF/Bison/etc isn't really suited to the task at all: as Steve has mentioned, it's designed to answer "Does text Y match grammar Z?", rather than "what output should be created when text Y is parsed with grammar Z?"
Not BNF alone, but Bison sure is designed to create output of some kind. At least that's the very strong impression I got from reading the Bison manual: e.g., step 1 in using Bison is given as
"Formally specify the grammar in a form recognized by Bison (see section Bison Grammar Files). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements."
These C statements could, of course, include direct output, or preparation of some data structure for later use.
Anyway, maybe someone who's actually tried to start on something like this should chip in with their thoughts. timwi (with later stuff by Magnus) did something called flexbisonparse a couple of years ago:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/
The main Bison grammar/input file seems to be this:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/wikiparse.y?v...
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
I suspect a major problem might arise if there are constructs that require more than one-token lookahead. There probably are, and apparently bison et al. can't parse those. But again, I would defer
I think it would be a good idea to formalise and improve the grammar so that wasn't the case. Does any sane grammar need more than one token look ahead?
# If there is an odd number of both bold and italics, it is likely
# that one of the bold ones was meant to be an apostrophe followed # by italics. Which one we cannot know for certain, but it is more # likely to be one that has a single-letter word before it.
This is a good example. There is no grammar, therefore no spec, therefore the parser can do whatever it wants. However it tries to guess. No one has ever really defined the answer to the question: What is represented by the following string: '''''
There are many answers to the question, depending on the context. It's horrible. It shouldn't be like that. There are solutions:
- Distinct sequences for italics and bold (**this** being the obvious choice for bold) - Specific tokens for bold, italics, and bold-italics, so that this: '''''Some''' word'' is no longer valid. Instead you would write '''''Some''''' ''word''. - Strong escaping mechanisms such that the parser deliberately gives up very early on, and if you want bold-italic apostrophes, you're going to have to escape them. Making ''''''foo'''''' deliberately render as bold-italics 'foo' is madness*. Cute for a lolcode or Intercal, but for MediaWiki?
Steve * Well, it would be logical madness if it actually rendered like that. For some reason the first apostrophe renders as neither bold nor italics. So it's illogical madness :)
On 11/8/07, Steve Bennett stevagewp@gmail.com wrote:
I think it would be a good idea to formalise and improve the grammar so that wasn't the case. Does any sane grammar need more than one token look ahead?
Certainly, unless your definition of "sane" is very narrow. I believe neither C++ nor Perl have LALR(1) grammars. I saw at least one syntax suggestion for Python one time that was rejected on the basis of requiring multi-token lookahead.
On 11/8/07, Steve Bennett stevagewp@gmail.com wrote:
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Clearly it's not possible to tokenize MW markup with one-character lookahead: you sure can't tell the difference between a second- and sixth-level heading, and of course that's even ignoring
Yes you can, if ====== is a token. Which at first glance, it should be. The fact that == looks like === looks like ==== is neither here nor there to the grammar - it's a handy mnemonic for humans, that's all.
The point is that if those are different token types (rather than the same token type -- which I guess they could be), you can't tokenize them without lookahead. Or at least I don't think you can: maybe I misunderstand lookahead. I guess it doesn't require backtracking, regardless.
Certainly apostrophes require more than one character lookahead, and backtracking.
What's wrong with ISBN handling? I don't see anything problematic in an "ISBN" token that consumes a following sequence of digits, possibly with hyphens and crap.
ISBN 123456789X is parsed as an ISBN. ISBN 123456789 is not, because it doesn't have enough digits. That means you need quite a lot of lookahead and backtracking for ISBNs, at least in the tokenizer. Which was my point, the tokenizer will need to be able to backtrack. It's not a big issue, I don't think, judging by the flex docs, which was the reason for my post: responding to Steve Sanbeg's remark about how much lookahead is needed by the tokenizer.
ISBN 123456789X is parsed as an ISBN. ISBN 123456789 is not, because it doesn't have enough digits. That means you need quite a lot of lookahead and backtracking for ISBNs, at least in the tokenizer. Which was my point, the tokenizer will need to be able to backtrack. It's not a big issue, I don't think, judging by the flex docs, which was the reason for my post: responding to Steve Sanbeg's remark about how much lookahead is needed by the tokenizer.
I may be completely misunderstanding all this, but doesn't ISBN just require lookahead, not backtracking? Nothing that comes before the "ISBN" bit is relevant.
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Certainly, unless your definition of "sane" is very narrow. I believe neither C++ nor Perl have LALR(1) grammars. I saw at least one syntax suggestion for Python one time that was rejected on the basis of requiring multi-token lookahead.
Ok, it's been a while since I've studied this.
Certainly apostrophes require more than one character lookahead, and
backtracking.
Yeah. Apostrophes again. Saying apostrophes "require" special treatment is like saying Paris Hilton "requires" special treatment.
ISBN 123456789X is parsed as an ISBN. ISBN 123456789 is not, because
it doesn't have enough digits. That means you need quite a lot of lookahead and backtracking for ISBNs, at least in the tokenizer. Which was my point, the tokenizer will need to be able to backtrack. It's not a big issue, I don't think, judging by the flex docs, which was the reason for my post: responding to Steve Sanbeg's remark about how much lookahead is needed by the tokenizer.
Ok, there you're assuming that if the sequence of digits doesn't match an
ISBN, then you want to reparse it as something else entirely. IMHO it's better to just parse it as an invalid ISBN. And if someone is really unhappy that the string "ISBN 23415[[link please]]" wasn't rendered as a link, then they can wrap the relevant bit with a <nowiki>
*grumble* I still think recognising "ISBN xxx" is a bad idea.*grumble*
My point in reply to yours is that although it may be feasible to backtrack, it might be a good idea to avoid it anyway, for the sake of simplicity in coding and user experience.
Steve
- Well, it would be logical
madness if it actually rendered like that. For some reason the first apostrophe renders as neither bold nor italics. So it's illogical madness :)
It's logical madness. The parser does everything backwards for some reason - when it finds a long sequence of tokens, it starts at the end and works backwards finding the longest possible token, so when it sees 6 apostrophes it starts and the ends and works back, until it has 3, which is the most that is meaningful, and replaces them with <B>, then keeps going until is has 2 more (which is now the most that is meaningful, since <B><B> is redundant), and turns them into <I>, and then is left with just 1, which is meaningless, so leaves it as a literal apostrophe. It's completely mad, but it does make sense once you understand it. (Actually, I think it might be even more complicated than that - I think I heard someone say it actually looks for closing tags and then tries to find a corresponding opening tag, so it's doubly backwards. Although, the opening and closing tags for emphasis are the same, so it might just be [[]] and {{}} that are doubly backwards...)
So, I would imagine implementing a new parser that has the same behaviour as the current one will require enormous amounts of backtracking, simply because the current behaviour is defined in terms of working backwards.
(This is all assuming I'm understanding the current parser correctly, which is a big assumption.)
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
So, I would imagine implementing a new parser that has the same behaviour as the current one will require enormous amounts of
Why on earth would we want to do that?
Steve
On 09/11/2007, Steve Bennett stevagewp@gmail.com wrote:
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
So, I would imagine implementing a new parser that has the same behaviour as the current one will require enormous amounts of
Why on earth would we want to do that?
Backwards compatibility. The main suggestion I've seen is rewriting the parser in such a way as to make it behave like the old one in everything except a few unavoidable corner cases (bold italics is not a corner case). If we're going to make significant changes to the way wikitext is rendered, then we don't want to start with writing a grammar for the existing parser - we should start with writing a grammar for how we'd like the parser to be, and forget about the existing parser completely. That would, however, break a large number of existing pages on a large number of websites and would quite likely result in us all being lynched. Depending on what changes we make, we might be able to implement automated conversion to the new syntax, but people still aren't going to like it.
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Backwards compatibility. The main suggestion I've seen is rewriting
Sure, with the bulk of sensible syntax that is being used. Attempting to duplicate the behaviour of the parser in dubious, ill-defined areas is a bad idea.
the parser in such a way as to make it behave like the old one in
everything except a few unavoidable corner cases (bold italics is not a corner case). If we're going to make significant changes to the way
Bold italics, as in a sentence with some ''''bold italics''''' in it is not a corner case. Some'''really ''really'' '''''we'ird''''''...uses of apostrophes are beyond corner cases. A corner implies the meeting of two well-defined edges. These are more like fuzzy grey area cases.
wikitext is rendered, then we don't want to start with writing a
grammar for the existing parser - we should start with writing a grammar for how we'd like the parser to be, and forget about the existing parser completely. That would, however, break a large number of existing pages on a large number of websites and would quite likely
Let's say that 90% of articles are composed of just 20% of the syntax. So let's write a grammar for 50% of the syntax, and build a parser to match. That way we break few pages but can still trim off dead wood for future growth.
result in us all being lynched. Depending on what changes we make, we
might be able to implement automated conversion to the new syntax, but people still aren't going to like it.
They might love it.
Steve
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Backwards compatibility. The main suggestion I've seen is rewriting the parser in such a way as to make it behave like the old one in everything except a few unavoidable corner cases (bold italics is not a corner case).
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
On 11/9/07, Stephen Bain stephen.bain@gmail.com wrote:
Well then, should it just take everything until the next whitespace?
Remember that some languages (like CJK) don't use whitespace to separate words. You would eat the entire paragraph. Regardless, I think we could probably do with eating all letter-characters (and number-characters? maybe not) from any alphabet that uses whitespace, for every language. Especially useful for sites like Commons or Meta or mediawiki.org. I've remarked on this before.
Anyway, if this behavior is not consistent across languages, we have the obvious problem that the parsing grammar depends on the language. This is probably not desirable. I suspect it would be entirely possible to make this behavior consistent across languages in this case, however, as I say.
Soo Reams schrieb:
I think work on a clean grammar and a slick parser are among the most important discussions I've ever read on here, and it's good to see it going somewhere.
I'm actually quite surprised it has gone on this long - usually these discussions are much shorter to my recollection.
The first time I (personally) ever thought about the problem of formalizing the grammar was about two years ago, when I first started with MW (around version 1.5.1). The problems then were the same as they are now, and they're the same as they're going to be onwards into the foreseeable future.
It's important to remember that MediaWiki syntax isn't a light-markup-language in the "traditional" sense. That is, unlike Markdown, Textile, APT and the like, wikitext is inexorably part of a rich infrastructure of functionality, and that infrastructure very heavily affects the grammar.
For example, language specificity (as Simetrical mentioned) would probably require that the MediaWiki grammar be a conglomerate of a myriad individual grammars for various language groups.
For another example, consider #REDIRECTs. When the #REDIRECT pattern is encountered at the beginning of a page, any subsequent content is ignored (stripped at submission time). And the "output" is variable. That is, it has an effect on the system whereby the rendered output depends on the viewing context - either it redirects to another page, or renders a link thereto.
Also consider extension tags. If no extension tag has claimed a particular handle, then the angle brackets are converted into their html encoded equivalents. That is, "<this>[[whatever]]</this>" becomes "<this><a href=...>whatever</a></this>". On the other hand, if an extension had hooked "this", then the [[whatever]] inside may be treated as a link, plaintext, or something totally different depending on the extension's implementation.
Perhaps even more complex is the treatment of parser functions, which continue to operate within the scope of page parsing (interpreting template parameters, etc), but ultimately give the option to the implementor to conditionally disable these features. That is, although {{#this:param1|param2|param3}} would usually be parsed as a call to the 'this' parser function with three parameters, it doesn't have to be. It could be a single parameter containing "param1|param2|param3".
It may even be possible to use reserved mediawiki template processing characters in the input So continuing this example, say the 'this' parser function wanted all internal text to be unparsed - treated as one string. Then "{{#this:{{whatever}}" may be treated as a call to 'this' with the parameter "{{whatever". I'm not absolutely sure if this works, as I haven't tested it, but if so, then that further complicates the tokenizer.
I'm not trying to be too defeatist here, though I sincerely doubt that these kinds of infrastructural ties will be explainable via a grammar - much less one with limited lookahead and lookbehind. The best one could hope for might be to define the basic wikitext markup language, ignoring the meanings of Namespaces, templating/transclusion, extension tags and parser functions. Even then, what use is such a grammar? It probably won't help simplify the MediaWiki Parser significantly since all the ignored features would still need to be accounted for, as they would be in any other application that hopes to integrate with MW syntax (for example an external WYSIWYG editor).
In all sincerity, I wish the best of luck to anyone who attempts to fully specify the wikitext syntax. As mentioned previously, the reward for such a feat could be as many as several beers. :)
-- Jim R. Wilson (jimbojw)
Another feature is multi-language support. The meaning .
On Nov 9, 2007 11:42 AM, Simetrical Simetrical+wikilist@gmail.com wrote:
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Backwards compatibility. The main suggestion I've seen is rewriting the parser in such a way as to make it behave like the old one in everything except a few unavoidable corner cases (bold italics is not a corner case).
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
On 11/9/07, Stephen Bain stephen.bain@gmail.com wrote:
Well then, should it just take everything until the next whitespace?
Remember that some languages (like CJK) don't use whitespace to separate words. You would eat the entire paragraph. Regardless, I think we could probably do with eating all letter-characters (and number-characters? maybe not) from any alphabet that uses whitespace, for every language. Especially useful for sites like Commons or Meta or mediawiki.org. I've remarked on this before.
Anyway, if this behavior is not consistent across languages, we have the obvious problem that the parsing grammar depends on the language. This is probably not desirable. I suspect it would be entirely possible to make this behavior consistent across languages in this case, however, as I say.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 09/11/2007, Jim Wilson wilson.jim.r@gmail.com wrote:
Even then, what use is such a grammar? It probably won't help simplify the MediaWiki Parser significantly since all the ignored features would still need to be accounted for, as they would be in any other application that hopes to integrate with MW syntax (for example an external WYSIWYG editor).
It would allow it to be implemented in a testable manner in other languages, and hence in many other programs than MediaWiki. Other programs with cause to process wikitext presently have to reverse-engineer the parser (always, so far, bodgily) or run the actual PHP parser code. Neither is a really good idea.
- d.
On Fri, Nov 09, 2007 at 06:24:45PM +0000, David Gerard wrote:
On 09/11/2007, Jim Wilson wilson.jim.r@gmail.com wrote:
Even then, what use is such a grammar? It probably won't help simplify the MediaWiki Parser significantly since all the ignored features would still need to be accounted for, as they would be in any other application that hopes to integrate with MW syntax (for example an external WYSIWYG editor).
It would allow it to be implemented in a testable manner in other languages, and hence in many other programs than MediaWiki.
Notably, I'd like to use wikitext for users creating pages in a CMS like WebGUI, and for that, the first 90% is mostly enough.
Though bold and italic just *have* to be changed. :-)
Cheers, -- jra
On 11/9/07, Jim Wilson wilson.jim.r@gmail.com wrote:
For another example, consider #REDIRECTs. When the #REDIRECT pattern is encountered at the beginning of a page, any subsequent content is ignored (stripped at submission time). And the "output" is variable. That is, it has an effect on the system whereby the rendered output depends on the viewing context - either it redirects to another page, or renders a link thereto.
Redirects are a special case that should be handled before passing off to the parser. This kind of thing does add extra complexity, yes, but it could be incorporated into a clear specification nonetheless.
Also consider extension tags. . . . Perhaps even more complex is the treatment of parser functions . . .
Ah, but those aren't going to be part of the "main" parser. Any parser would have to go through two main passes: one recursive pass to substitute templates and parser functions, and another non-recursive pass to deal with the resulting markup. The first pass would use only a very simple grammar, but would inherently require database access and knowledge of configuration. The second pass will not need to consider any parser tags or curly-brace stuff, but will need to know the bulk of the grammar, and that's the only place that the difficulties of formal specification arise. Defining the formal grammar for the first part is all but trivial, since it only needs to parse two different constructs (curly braces, which behave very straightforwardly; and XML-ish stuff, which for the most part also does).
The best one could hope for might be to define the basic wikitext markup language, ignoring the meanings of Namespaces, templating/transclusion, extension tags and parser functions. Even then, what use is such a grammar? It probably won't help simplify the MediaWiki Parser significantly since all the ignored features would still need to be accounted for, as they would be in any other application that hopes to integrate with MW syntax (for example an external WYSIWYG editor).
The grammar is only part of what you need to know to make sense of a page, if you want to do anything other than syntax highlighting. The grammar of C is only a small part of the C standard; if we standardize and clarify the meaning of the MediaWiki parser, the actual BNF grammars will only constitute part of the resulting document.
On 11/9/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Defining the formal grammar for the first part is all but trivial, since it only needs to parse two different constructs (curly braces, which behave very straightforwardly; and XML-ish stuff, which for the most part also does).
Well, I should clarify. You're correct that parser functions are free to define their own input conventions, at least to some extent. But the specification should then be that the parser itself ignores the arguments to the tag, and just passes them on unparsed to the appropriate authority.
On Fri, Nov 09, 2007 at 12:42:24PM -0500, Simetrical wrote:
On 11/9/07, Thomas Dalton thomas.dalton@gmail.com wrote:
Backwards compatibility. The main suggestion I've seen is rewriting the parser in such a way as to make it behave like the old one in everything except a few unavoidable corner cases (bold italics is not a corner case).
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
I believe that all of these arguments about what to do and how to do it would be *very* well served by a) defining a list of the corner cases/pinch points and b) surveying the WMF wikis to see on how many pages they *actually* appear.
The fundamental recurring argument seems to me to be "we can't do that; we'll break too much stuff."
Data is not the plural of anecdote; we need some.
Anyone want to contribute on either point?
Cheers, -- jra
Jay Ashworth wrote:
The fundamental recurring argument seems to me to be "we can't do that; we'll break too much stuff."
The last time this came up (or maybe it was five or ten times ago; I can't keep track) I think I remember Brion stating pretty emphatically that no change to the parser could be contemplated if it broke *any* stuff. And on one level I think I agree: a cavalier change, that might break stuff and take some time to clean up after, is a very different prospect on a project with 100 pages, or even 10,000 pages, than it is on one with 2,000,000 pages.
On 09/11/2007, Steve Summit scs@eskimo.com wrote:
Jay Ashworth wrote:
The fundamental recurring argument seems to me to be "we can't do that; we'll break too much stuff."
The last time this came up (or maybe it was five or ten times ago; I can't keep track) I think I remember Brion stating pretty emphatically that no change to the parser could be contemplated if it broke *any* stuff. And on one level I think I agree: a cavalier change, that might break stuff and take some time to clean up after, is a very different prospect on a project with 100 pages, or even 10,000 pages, than it is on one with 2,000,000 pages.
So, where are we now?
* Document weird-arse constructs * Test against odd cases * See what would make our lives WAY easier to remove * See how often said annoying bit is actually used, i.e. if it's important * If not too much, is it fixable? * goto 1
- d.
On Fri, Nov 09, 2007 at 08:05:14PM +0000, David Gerard wrote:
On 09/11/2007, Steve Summit scs@eskimo.com wrote:
Jay Ashworth wrote:
The fundamental recurring argument seems to me to be "we can't do that; we'll break too much stuff."
The last time this came up (or maybe it was five or ten times ago; I can't keep track) I think I remember Brion stating pretty emphatically that no change to the parser could be contemplated if it broke *any* stuff. And on one level I think I agree: a cavalier change, that might break stuff and take some time to clean up after, is a very different prospect on a project with 100 pages, or even 10,000 pages, than it is on one with 2,000,000 pages.
So, where are we now?
- Document weird-arse constructs
- Test against odd cases
- See what would make our lives WAY easier to remove
- See how often said annoying bit is actually used, i.e. if it's important
- If not too much, is it fixable?
- goto 1
I believe that's where we are, yes.
Installed base, for our purposes, is two things:
1) pages in the databases.
2) rules in people's heads.
The first is easy to fix, you just grind.
The second... well, I submit for your approval that in corner cases, users are either looking them up, or praying and trying again *anyway*, so you don't break anything by changing them.
That is, I suspect that //**this** wouldn't be any harder// for people to write, and in fact, quite a bit easier, and it would be *much* easier to parse. In point of fact, I suspect that on point 2 above, if we changed that from '''''this''' wouldn't be any harder'', that people would *cheer*, and not grumble.
(I, personally, think that *bold*, /italics/ and _underline_ would parse just fine, and that they wouldn't be nearly as difficult to disambig as people assert, but I've never tried to write a parser.)
Cheers, -- jra
On 09/11/2007, Jay R. Ashworth jra@baylink.com wrote:
That is, I suspect that //**this** wouldn't be any harder// for people to write, and in fact, quite a bit easier, and it would be *much* easier to parse. In point of fact, I suspect that on point 2 above, if we changed that from '''''this''' wouldn't be any harder'', that people would *cheer*, and not grumble. (I, personally, think that *bold*, /italics/ and _underline_ would parse just fine, and that they wouldn't be nearly as difficult to disambig as people assert, but I've never tried to write a parser.)
Maybe, but I think your chances of getting '' and ''' removed from MediaWiki wikitext are zero.
- d.
On Fri, Nov 09, 2007 at 08:41:34PM +0000, David Gerard wrote:
On 09/11/2007, Jay R. Ashworth jra@baylink.com wrote:
That is, I suspect that //**this** wouldn't be any harder// for people to write, and in fact, quite a bit easier, and it would be *much* easier to parse. In point of fact, I suspect that on point 2 above, if we changed that from '''''this''' wouldn't be any harder'', that people would *cheer*, and not grumble. (I, personally, think that *bold*, /italics/ and _underline_ would parse just fine, and that they wouldn't be nearly as difficult to disambig as people assert, but I've never tried to write a parser.)
Maybe, but I think your chances of getting '' and ''' removed from MediaWiki wikitext are zero.
They certainly are, if no one ever examines the corpus. I've just banged up a new server in the office, if no one else who already *has* a mirror of, say, en.wp set up steps up, I may do the testing myself, in my Copious Free Time.
I was hoping Jeff Merkey would volunteer, though. :-)
Cheers -- jra
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
They certainly are, if no one ever examines the corpus. I've just banged up a new server in the office, if no one else who already *has* a mirror of, say, en.wp set up steps up, I may do the testing myself, in my Copious Free Time.
What are you proposing, autobotically replacing ''' with **?
Steve
On Sat, Nov 10, 2007 at 05:30:53PM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
They certainly are, if no one ever examines the corpus. I've just banged up a new server in the office, if no one else who already *has* a mirror of, say, en.wp set up steps up, I may do the testing myself, in my Copious Free Time.
What are you proposing, autobotically replacing ''' with **?
Specifically, I was proposing defining the combinations of the current parser tokens which are difficult to interpret (primarily, combinations of bold, italics, and apostrophes), and determining how frequently they appear in the live corpus.
This will delimit the *actual* size of the Installed Base problem, in both meanings I gave it earlier. If in 2 megapages, there are only 100 occurrences, you fix them by hand. If 1000, you grind a robot. If 500K, then you take a different approach to the overall problem.
(To USAdians, this is referred to as "Dropping back 10, and punting".)
Cheers, -- jra
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
Specifically, I was proposing defining the combinations of the current parser tokens which are difficult to interpret (primarily, combinations of bold, italics, and apostrophes), and determining how frequently they appear in the live corpus.
This will delimit the *actual* size of the Installed Base problem, in both meanings I gave it earlier. If in 2 megapages, there are only 100 occurrences, you fix them by hand. If 1000, you grind a robot. If 500K, then you take a different approach to the overall problem.
Ok, it's still backwards from how I would picture it: 1) Come up with a solution (ie, new parser) 2) See how many pages that solution fits, call it X%. 3) If X% is too small, either extend the parser by adding more rules, or updating pages.
But this is probably just philosophy at this point: I'd rather be focussing on the grammar that we want to implement, than the grammar that we don't want to implement.
Steve
Steve Bennett wrote:
Ok, it's still backwards from how I would picture it:
- Come up with a solution (ie, new parser)
- See how many pages that solution fits, call it X%.
- If X% is too small, either extend the parser by adding more rules,
or updating pages.
But this is probably just philosophy at this point: I'd rather be focussing on the grammar that we want to implement, than the grammar that we don't want to implement.
Actually I think that's a good point to start. Here's my idea, which I invite you to critique:
We agree a limit for X% of articles parsed correctly, such that if > X% are correct then we say the parser is good and do some amount of hand-editing on the remaining (100-X)%. Then someone* knocks up a very rough "new wave" parser and runs a copy of, say, en: on top of it. We all try it out and see for ourselves how much stuff breaks. If too much, refine the parser and repeat. Hopefully, we eventually reach X% correctness; then we are happy and can think about how to roll it out more widely. If, however, we just can't reach X% using our optimistic two-pass approach, then we debate whether a more complex parser is necessary. If it is, the two-pass version will likely make a good basis for it, and the work is not wasted.
This plan allows us to actually do something, which is probably preferable to arguing about whether something hypothetical is doable.
Thoughts?
Soo Reams
* I would volunteer but I probably lack both skill and time.
On 11/11/07, Soo Reams soo@sooreams.com wrote:
We agree a limit for X% of articles parsed correctly, such that if > X% are correct then we say the parser is good and do some amount of hand-editing on the remaining (100-X)%. Then someone* knocks up a very
Ok, I'll bite: 99%, where "correct" means "renders sufficiently well that no one would bother editing the wikitext to fix it".
rough "new wave" parser and runs a copy of, say, en: on top of it. We
all try it out and see for ourselves how much stuff breaks. If too much, refine the parser and repeat. Hopefully, we eventually reach X% correctness; then we are happy and can think about how to roll it out
Let's think about rolling it out before we start. Let's not implement a parser than find out that it's impractical to roll it out.
Steve
On Sun, Nov 11, 2007 at 12:57:01PM +1100, Steve Bennett wrote:
Let's think about rolling it out before we start. Let's not implement a parser than find out that it's impractical to roll it out.
I don't think we have any choice in the matter.
The only way this will sell to Brion and Tim, much less the WMF and Wikipedia, is as a fait accompli; new parser, grammer, equivalence documentation, page grinder, and userlevel doco. All with a red ribbon.
Might not sell even then, but that's the way I see it.
Cheers, -- jra
On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:
The only way this will sell to Brion and Tim, much less the WMF and Wikipedia, is as a fait accompli; new parser, grammer, equivalence documentation, page grinder, and userlevel doco. All with a red ribbon.
I'm hoping we might be able to sell it off the plan:
"If we implement a parser that renders 99% of the current corpus of wikitext correctly, and we come up with a reasonable process for rolling it out without too much disruption, would you let us do it?"
I guess the answer would be yes. I'm not sure who all the angry comments in parser.php belong to, but they weren't kidding.
Steve
On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:
I'm hoping we might be able to sell it off the plan:
"If we implement a parser that renders 99% of the current corpus of wikitext correctly, and we come up with a reasonable process for rolling it out without too much disruption, would you let us do it?"
I guess the answer would be yes.
I'm guessing it would be "Sure, maybe, let's see the code first." One way to find out. :)
On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:
I'm hoping we might be able to sell it off the plan:
"If we implement a parser that renders 99% of the current corpus of wikitext correctly, and we come up with a reasonable process for rolling it out without too much disruption, would you let us do it?"
If you want a list of hypothetical acceptance requirements, then I would add: * Should render 99% of the articles in the English Wikipedia(*) identically to the current parser. * For the 1% that doesn't render the same, provide a list of what constructs don't render the same, and an explanation of whether support for that construct is planned to be added, or whether you think it should not be supported because it's a corner-case or badly-thought-out construct, or something else. * Should have a total runtime for rendering the entire English Wikipedia equal to or better than the total render time with the current parser. * Should be implemented in the same language (i.e. PHP) so that any comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility. * Should have a worst-case render time no worse than 2x slower on any given input. * Should use as much run-time memory as the current parser or less on average, and no more than 2x more in the worst case. * Any source code should be documented. The grammar used should be documented. (since this is relates to the core driving reason for implementing a new parser). * When running parserTests should introduce a net total of no more than (say) 2 regressions (e.g. if you break 5 parser tests, then you have to fix 3 or more parser tests that are currently broken).
(*) = I'm using the English Wikipedia here as a test corpus as it's a large enough body of work, written by enough people, that it's statistically useful when comparing average and worst-case performance and compatibility of wiki text as used by people in the real world. Any other large body of human-generated wikitext of equivalent size, with an equivalent number of authors, would do equally well for comparison purposes.
I guess the answer would be yes.
I'm guessing it would be "Sure, maybe, let's see the code first." One way to find out. :)
If you can provide an implementation that has the above characteristics, and which has a documented grammar, then I think it's reasonable to assume that people would be willing to take a good look at that implementation.
I'm not sure who all the angry comments in parser.php belong to
svn praise includes/Parser.php | less
-- All the best, Nick.
On 11/12/07, Nick Jenkins nickpj@gmail.com wrote:
- For the 1% that doesn't render the same, provide a list of what
constructs don't render the same, and an explanation of whether support for that construct is planned to be added, or whether you think it should not be supported because it's a corner-case or badly-thought-out construct, or something else.
That seems reasonable.
* Should be implemented in the same language (i.e. PHP) so that any
comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility.
That condition seems bizarre. The parser is either faster or it's slower. Whether it's faster because it's implemented in C is irrelevant: it's faster.
In any case I thought it had been decided that it had to be in PHP?
* Should have a worst-case render time no worse than 2x slower on any given
input.
Any given? That's not reasonable. Perhaps "Any given existing Wikipedia page"? It would be too easy to find some construct that is rendered quickly by the existing parser but is slow with the new one, then create a page that contained 5000 examples of that construct.
* Should use as much run-time memory as the current parser or less on
average, and no more than 2x more in the worst case.
As above.
* Any source code should be documented. The grammar used should be
documented. (since this is relates to the core driving reason for implementing a new parser).
Err, yes. I have to say, the current parser is very nicely written and very well commented.
* When running parserTests should introduce a net total of no more than
(say) 2 regressions (e.g. if you break 5 parser tests, then you have to fix 3 or more parser tests that are currently broken).
I'm not familiar enough with the current set of tests to comment on that.
(*) = I'm using the English Wikipedia here as a test corpus as it's a large
enough body of work, written by enough people, that it's statistically useful when comparing average and worst-case performance and compatibility of wiki text as used by people in the real world. Any other large body of human-generated wikitext of equivalent size, with an equivalent number of authors, would do equally well for comparison purposes.
Ok I think that answers my concerns there.
Thanks for the feedback.
Steve
If you can provide an implementation that has the above characteristics, and
which has a documented grammar, then I think it's reasonable to assume that people would be willing to take a good look at that implementation.
I'm not sure who all the angry comments in parser.php belong to
svn praise includes/Parser.php | less
-- All the best, Nick.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Nov 12, 2007 at 08:35:18PM +1100, Steve Bennett wrote:
- Should be implemented in the same language (i.e. PHP) so that any
comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility.
That condition seems bizarre. The parser is either faster or it's slower. Whether it's faster because it's implemented in C is irrelevant: it's faster.
In any case I thought it had been decided that it had to be in PHP?
No, I think if we can get a 20:1 speedup for a C version, they'd take it. :-0
- Should have a worst-case render time no worse than 2x slower on any given
input.
Any given? That's not reasonable. Perhaps "Any given existing Wikipedia page"? It would be too easy to find some construct that is rendered quickly by the existing parser but is slow with the new one, then create a page that contained 5000 examples of that construct.
Sure; pathological cases are always possible. Let's say "on any 10 randomly chosen already extant pages of wikitext."
Cheers, -- jra
- Should be implemented in the same language (i.e. PHP) so that any
comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility.
That condition seems bizarre. The parser is either faster or it's slower. Whether it's faster because it's implemented in C is irrelevant: it's faster.
In any case I thought it had been decided that it had to be in PHP?
No, I think if we can get a 20:1 speedup for a C version, they'd take it. :-0
I don't doubt it in the case of most large wiki farms - but numerically most installations of MediaWiki are on small wikis, probably running on shared hosts, and in those situations using a C-based parser is either not possible, or significantly more complicated than running a PHP script. So for those installs, if the speed of a PHP parser suddenly gets much worse, then I expect those admins would complain. So whilst a faster parser is a faster parser, if it requires running code that you can't run, then it ain't going to do you much good. A custom super-fast wiki-farm parser is great, but the general-case parser should have similar performance characteristics and the same software requirements (i.e. the test is that nobody should be noticeably worse off).
- Should have a worst-case render time no worse than 2x slower on any given
input.
Any given? That's not reasonable. Perhaps "Any given existing Wikipedia page"? It would be too easy to find some construct that is rendered quickly by the existing parser but is slow with the new one, then create a page that contained 5000 examples of that construct.
Sure; pathological cases are always possible. Let's say "on any 10 randomly chosen already extant pages of wikitext."
The current parser (from my perspective) seems to cope quite well with malformed input. So all I'm saying is that if a replacement parser could behave similarly then that would be good - although I take your point that the input that is considered pathological could be different for different parsers, so let's say that the render time on randomly generated malformed input should be equivalent on average.
The English Wikipedia does an provide excellent environment to test the English language environment. It does not do the same for other languages. Remember that MediaWiki supports over 250 languages?
Indeed - it's only intended as a test for performance and most functionality. For a more complete compatibility test with a variety of languages, you'd probably need to test against all the database dumps at: http://download.wikimedia.org/
- When running parserTests should introduce a net total of no more than
(say) 2 regressions (e.g. if you break 5 parser tests, then you have to fix 3 or more parser tests that are currently broken).
I'm not familiar enough with the current set of tests to comment on that.
The core tests are in maintenance/parserTests.txt ( http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/parserTes... ) and generally follow a structure with name of the test, wiki text input, and the expected XHTML output, for example:
!! test Preformatted text !! input This is some Preformatted text With ''italic'' And '''bold''' And a [[Main Page|link]] !! result <pre>This is some Preformatted text With <i>italic</i> And <b>bold</b> And a <a href="/wiki/Main_Page" title="Main Page">link</a> </pre> !! end
It's probably a pretty good place to start with writing a parser, in terms of what the expected behaviour is. Then probably after that comes testing against user-generated input versus the current parser.
-- All the best, Nick.
You're right Steve - I missed the pound symbol in my retort.
But, some good came out of it. These two constructs produce identical HTML:
;#what does: this render? ;#how about this? ;#or this?
;what does :# this render :# how about this? :# or this?
So we conclude that:
;#A: B
is shorthand for:
;A :# B
-- Jim
On Nov 12, 2007 7:54 PM, Nick Jenkins nickpj@gmail.com wrote:
- Should be implemented in the same language (i.e. PHP) so that any
comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility.
That condition see
ms bizarre. The parser is either faster or it's
slower. Whether it's faster because it's implemented in C is irrelevant: it's faster.
In any case I thought it had been decided that it had to be in PHP?
No, I think if we can get a 20:1 speedup for a C version, they'd take it. :-0
I don't doubt it in the case of most large wiki farms - but numerically most installations of MediaWiki are on small wikis, probably running on shared hosts, and in those situations using a C-based parser is either not possible, or significantly more complicated than running a PHP script. So for those installs, if the speed of a PHP parser suddenly gets much worse, then I expect those admins would complain. So whilst a faster parser is a faster parser, if it requires running code that you can't run, then it ain't going to do you much good. A custom super-fast wiki-farm parser is great, but the general-case parser should have similar performance characteristics and the same software requirements (i.e. the test is that nobody should be noticeably worse off).
- Should have a worst-case render time no worse than 2x slower on any given
input.
Any given? That's not reasonable. Perhaps "Any given existing Wikipedia page"? It would be too easy to find some construct that is rendered quickly by the existing parser but is slow with the new one, then create a page that contained 5000 examples of that construct.
Sure; pathological cases are always possible. Let's say "on any 10 randomly chosen already extant pages of wikitext."
The current parser (from my perspective) seems to cope quite well with malformed input. So all I'm saying is that if a replacement parser could behave similarly then that would be good - although I take your point that the input that is considered pathological could be different for different parsers, so let's say that the render time on randomly generated malformed input should be equivalent on average.
The English Wikipedia does an provide excellent environment to test the English language environment. It does not do the same for other languages. Remember that MediaWiki supports over 250 languages?
Indeed - it's only intended as a test for performance and most functionality. For a more complete compatibility test with a variety of languages, you'd probably need to test against all the database dumps at: http://download.wikimedia.org/
- When running parserTests should introduce a net total of no more than
(say) 2 regressions (e.g. if you break 5 parser tests, then you have to fix 3 or more parser tests that are currently broken).
I'm not familiar enough with the current set of tests to comment on that.
The core tests are in maintenance/parserTests.txt ( http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/parserTes... ) and generally follow a structure with name of the test, wiki text input, and the expected XHTML output, for example:
!! test Preformatted text !! input This is some Preformatted text With ''italic'' And '''bold''' And a [[Main Page|link]] !! result
<pre>This is some Preformatted text With <i>italic</i> And <b>bold</b> And a <a href="/wiki/Main_Page" title="Main Page">link</a> </pre>
!! end
It's probably a pretty good place to start with writing a parser, in terms of what the expected behaviour is. Then probably after that comes testing against user-generated input versus the current parser.
-- All the best, Nick.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/13/07, Jim Wilson wilson.jim.r@gmail.com wrote:
So we conclude that:
;#A: B
is shorthand for:
;A :# B
Yeah, see http://bugzilla.wikipedia.org/show_bug.cgi?id=11894 for the full range of craziness, if you're interested. IMHO, ;definition:term is a dud syntactic element. It only works for a narrow range of constructions, and its actual uses are pretty limited. We'd be better off getting rid of it and just making ;definition <newline>:definition the norm.
Steve
Hoi, The English Wikipedia does an provide excellent environment to test the English language environment. It does not do the same for other languages. Remember that MediaWiki supports over 250 languages?
We do know that the current parser makes it impossible to write Neapolitan words like l'' improve this is because '' gives you italic. In my opinion the parser should not be in the way of people writing the language as is usual for that language.
Thanks, GerardM
On Nov 12, 2007 6:49 AM, Nick Jenkins nickpj@gmail.com wrote:
On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:
I'm hoping we might be able to sell it off the plan:
"If we implement a parser that renders 99% of the current corpus of
wikitext
correctly, and we come up with a reasonable process for rolling it out without too much disruption, would you let us do it?"
If you want a list of hypothetical acceptance requirements, then I would add:
- Should render 99% of the articles in the English Wikipedia(*)
identically to the current parser.
- For the 1% that doesn't render the same, provide a list of what
constructs don't render the same, and an explanation of whether support for that construct is planned to be added, or whether you think it should not be supported because it's a corner-case or badly-thought-out construct, or something else.
- Should have a total runtime for rendering the entire English Wikipedia
equal to or better than the total render time with the current parser.
- Should be implemented in the same language (i.e. PHP) so that any
comparisons are comparing-apples-with-applies, and so that it can run on the current installed base of servers as-is. Having other implementations in other languages is fine (e.g. you could have a super-fast version in C too) just provide one in PHP that can be directly compared with the current parser for performance and backwards-compatibility.
- Should have a worst-case render time no worse than 2x slower on any
given input.
- Should use as much run-time memory as the current parser or less on
average, and no more than 2x more in the worst case.
- Any source code should be documented. The grammar used should be
documented. (since this is relates to the core driving reason for implementing a new parser).
- When running parserTests should introduce a net total of no more than
(say) 2 regressions (e.g. if you break 5 parser tests, then you have to fix 3 or more parser tests that are currently broken).
(*) = I'm using the English Wikipedia here as a test corpus as it's a large enough body of work, written by enough people, that it's statistically useful when comparing average and worst-case performance and compatibility of wiki text as used by people in the real world. Any other large body of human-generated wikitext of equivalent size, with an equivalent number of authors, would do equally well for comparison purposes.
I guess the answer would be yes.
I'm guessing it would be "Sure, maybe, let's see the code first." One way to find out. :)
If you can provide an implementation that has the above characteristics, and which has a documented grammar, then I think it's reasonable to assume that people would be willing to take a good look at that implementation.
I'm not sure who all the angry comments in parser.php belong to
svn praise includes/Parser.php | less
-- All the best, Nick.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/12/07, GerardM gerard.meijssen@gmail.com wrote:
We do know that the current parser makes it impossible to write Neapolitan words like l'' improve this is because '' gives you italic. In my opinion the parser should not be in the way of people writing the language as is usual for that language.
Looking through the code, I find this weird. There is already an extension to parse link trails backwards for Arabic words (if I understand correctly, al[[Razi]] is rendered <alRazi>). It would seem straightforward to alter the grammar for that installation so that '' wasn't rendered as italics, and so that there was another way of doing italics.
Of course it's not ideal that different Wikipedias use different dialects of wikitext, but the '' problem is pretty severe.
Steve
Hoi, Wikipedia is not the only project using MediaWiki .. :) The thing that should trigger behaviour is the tag that indicates that a text is in a particulary language. This means that the behaviour should be available whenever a special service is required. Thanks, Gerard
On Nov 12, 2007 2:13 PM, Steve Bennett stevagewp@gmail.com wrote:
On 11/12/07, GerardM gerard.meijssen@gmail.com wrote:
We do know that the current parser makes it impossible to write Neapolitan words like l'' improve this is because '' gives you italic. In my
opinion
the parser should not be in the way of people writing the language as is usual for that language.
Looking through the code, I find this weird. There is already an extension to parse link trails backwards for Arabic words (if I understand correctly, al[[Razi]] is rendered <alRazi>). It would seem straightforward to alter the grammar for that installation so that '' wasn't rendered as italics, and so that there was another way of doing italics.
Of course it's not ideal that different Wikipedias use different dialects of wikitext, but the '' problem is pretty severe.
Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Nov 12, 2007 at 11:30:28AM +0100, GerardM wrote:
We do know that the current parser makes it impossible to write Neapolitan words like l'' improve this is because '' gives you italic. In my opinion the parser should not be in the way of people writing the language as is usual for that language.
As I noted in another thread on this topic, we're thinking about that -- to the extent we can. If you know of other languages with specific punctuation requirements that interact here, encourage them to get on the Parser Practicum thread.
Cheers, -- jra
No one knows how ;#foo:blaa renders, so no one is put out by us changing it.
;I know what it renders: and fyi, I use that construct all the time
-- Jim R. Wilson (jimbojw)
On Nov 12, 2007 9:56 AM, Jay R. Ashworth jra@baylink.com wrote:
On Mon, Nov 12, 2007 at 11:30:28AM +0100, GerardM wrote:
We do know that the current parser makes it impossible to write Neapolitan words like l'' improve this is because '' gives you italic. In my opinion the parser should not be in the way of people writing the language as is usual for that language.
As I noted in another thread on this topic, we're thinking about that -- to the extent we can. If you know of other languages with specific punctuation requirements that interact here, encourage them to get on the Parser Practicum thread.
Cheers,
-- jra
Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/13/07, Jim Wilson wilson.jim.r@gmail.com wrote:
No one knows how ;#foo:blaa renders, so no one is put out by us changing
it.
;I know what it renders: and fyi, I use that construct all the time
Really? You use ;#term:definition - not just ;term:definition? Heh.
Steve
On Mon, Nov 12, 2007 at 12:50:51PM +1100, Steve Bennett wrote:
On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:
The only way this will sell to Brion and Tim, much less the WMF and Wikipedia, is as a fait accompli; new parser, grammer, equivalence documentation, page grinder, and userlevel doco. All with a red ribbon.
I'm hoping we might be able to sell it off the plan:
"If we implement a parser that renders 99% of the current corpus of wikitext correctly, and we come up with a reasonable process for rolling it out without too much disruption, would you let us do it?"
I guess the answer would be yes. I'm not sure who all the angry comments in parser.php belong to, but they weren't kidding.
I have the (in this case) distinct advantage of never having seen that code. ;-)
Cheers, -- jra
Steve Bennett wrote:
"If we implement a parser that renders 99% of the current corpus of wikitext correctly, and we come up with a reasonable process for rolling it out without too much disruption, would you let us do it?"
I guess the answer would be yes. I'm not sure who all the angry comments in parser.php belong to, but they weren't kidding.
I think probably the new parser would also have to be faster than the old one.
On 11/16/07, Mark Jaroski mark@geekhive.net wrote:
I think probably the new parser would also have to be faster than the old one.
I think it's more important that it not be slower. But obviously at this stage of the process I couldn't even hazard a guess as to what the performance would be like. I suspect it's easier to fine-tune a grammar for performance than it is a bunch of regex's.
Steve
That is, I suspect that //**this** wouldn't be any harder// for people to write, and in fact, quite a bit easier, and it would be *much* easier to parse. In point of fact, I suspect that on point 2 above,
if
As an experiment, I added support for // and **:
Replace:
$outtext .= $this->doQuotes ( $line ) . "\n";
with: -- $line = $this->doQuotes ($line); $line = preg_replace("///(.*?)///", "<I>$1</I>", $line); $line = preg_replace("/**(.*?)**/", "<B>$1</B>", $line); $outtext .= $line . "\n"; --
Whoever came up with '' / ''' made a *lot* of work for themselves.
Steve PS In that implementation, //mismatching **bold//and** italics seems to work ok. But the output is not correct HTML...
Steve Bennett wrote:
As an experiment, I added support for // and **:
Replace:
$outtext .= $this->doQuotes ( $line ) . "\n";
with:
$line = $this->doQuotes ($line); $line = preg_replace("///(.*?)///", "<I>$1</I>", $line); $line = preg_replace("/**(.*?)**/", "<B>$1</B>", $line); $outtext .= $line . "\n"; --
Whoever came up with '' / ''' made a *lot* of work for themselves.
Steve PS In that implementation, //mismatching **bold//and** italics seems to work ok. But the output is not correct HTML...
Then it's wrong. You see it "working" just because your browser html parser is better than your syntax parser. Try sending the pages as strict xhtml. ;)
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
- rules in people's heads.
The first is easy to fix, you just grind.
The second... well, I submit for your approval that in corner cases, users are either looking them up, or praying and trying again *anyway*, so you don't break anything by changing them.
No one knows how ''''foo''' renders, so no one is put out by us changing it.
No one knows how ;#foo:blaa renders, so no one is put out by us changing it.
Etc.
That is, I suspect that //**this** wouldn't be any harder// for people
to write, and in fact, quite a bit easier, and it would be *much* easier to parse. In point of fact, I suspect that on point 2 above, if
That is ridiculously readable. I know it's just an arbitrary example, but it's extremely easy to know exactly what you meant. Which sick individual ever came up with '''''this''' crazy'' syntax anyway?
we changed that from '''''this''' wouldn't be any harder'', that people
would *cheer*, and not grumble.
*Steve cheers*.
(I, personally, think that *bold*, /italics/ and _underline_ would
parse just fine, and that they wouldn't be nearly as difficult to disambig as people assert, but I've never tried to write a parser.)
We don't really need underline. And I don't agree: people use * and / all the time in normal text, whereas ** and // are almost unheard of. Yes, if people are going to quote C comments, they'll have to escape it, but that's basically the case now anyway with an empty string in Pascal etc. Let's not be biased towards quoting source code. And let's minimise the amount of escaping needed.
Steve
On Sat, Nov 10, 2007 at 05:29:11PM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
- rules in people's heads.
The first is easy to fix, you just grind.
The second... well, I submit for your approval that in corner cases, users are either looking them up, or praying and trying again *anyway*, so you don't break anything by changing them.
No one knows how ''''foo''' renders, so no one is put out by us changing it.
I know, just from conversation on this list today: it will render as a boldfaced foo, preceded by an apostrophe.
No one knows how ;#foo:blaa renders, so no one is put out by us changing it.
But your point here is pretty close to the one I was trying to make.
That is, I suspect that //**this** wouldn't be any harder// for people to write, and in fact, quite a bit easier, and it would be *much* easier to parse. In point of fact, I suspect that on point 2 above, if
That is ridiculously readable. I know it's just an arbitrary example, but it's extremely easy to know exactly what you meant. Which sick individual ever came up with '''''this''' crazy'' syntax anyway?
Someone whose legacy I wish to demolish, yes.
we changed that from '''''this''' wouldn't be any harder'', that people would *cheer*, and not grumble.
*Steve cheers*.
You were grumbling two messages ago.... :-)
(I, personally, think that *bold*, /italics/ and _underline_ would parse just fine, and that they wouldn't be nearly as difficult to disambig as people assert, but I've never tried to write a parser.)
We don't really need underline. And I don't agree: people use * and / all the time in normal text, whereas ** and // are almost unheard of. Yes, if people are going to quote C comments, they'll have to escape it, but that's basically the case now anyway with an empty string in Pascal etc. Let's not be biased towards quoting source code. And let's minimise the amount of escaping needed.
Conceded. And, of course __this__ is already in use, anyway.
Cheers, -- jra
We don't really need underline. And I don't agree: people use * and / all the time in normal text, whereas ** and // are almost unheard of. Yes, if people are going to quote C comments, they'll have to escape it, but that's basically the case now anyway with an empty string in Pascal etc. Let's not be biased towards quoting source code. And let's minimise the amount of escaping needed.
// appears in every single URL. It might be possible to have the parser recognise urls and automatically escape the token, but it's a definite complication.
On Sat, Nov 10, 2007 at 03:32:39PM +0000, Thomas Dalton wrote:
We don't really need underline. And I don't agree: people use * and / all the time in normal text, whereas ** and // are almost unheard of. Yes, if people are going to quote C comments, they'll have to escape it, but that's basically the case now anyway with an empty string in Pascal etc. Let's not be biased towards quoting source code. And let's minimise the amount of escaping needed.
// appears in every single URL. It might be possible to have the parser recognise urls and automatically escape the token, but it's a definite complication.
No it's not. Or at least, not much of one. If it immediately follows \W[a-z]+: then you assume it's part of a URL. That will make it difficult to impossible to italicize the domain name in a URL, but you know? I can live with that. :-)
Cheers, -- jra
// appears in every single URL. It might be possible to have the parser recognise urls and automatically escape the token, but it's a definite complication.
No it's not. Or at least, not much of one. If it immediately follows \W[a-z]+: then you assume it's part of a URL. That will make it difficult to impossible to italicize the domain name in a URL, but you know? I can live with that. :-)
It's a complication, just a resolvable one. Part of the idea of sorting out the grammar is to make the syntax make sense and not be full of special cases where the parser makes it up as it goes along. In this case, it may well be worth making the very safe guess, but it's a complication that has to be considered. (You could still italicise a domain name: http:////www.domain.com///index.html The only question is whether or not the / between the domain name and the file name is in italics or not...)
On Nov 11, 2007 2:32 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
We don't really need underline. And I don't agree: people use * and / all the time in normal text, whereas ** and // are almost unheard of. Yes, if people are going to quote C comments, they'll have to escape it, but that's basically the case now anyway with an empty string in Pascal etc. Let's not be biased towards quoting source code. And let's minimise the amount of escaping needed.
// appears in every single URL. It might be possible to have the parser recognise urls and automatically escape the token, but it's a definite complication.
Why would we replace apostrophes with some other markup for *both* bold and italics? Why not just have say, asterisks for bold and leave apostrophes for italics?
On 11/11/07, Stephen Bain stephen.bain@gmail.com wrote:
Why would we replace apostrophes with some other markup for *both* bold and italics? Why not just have say, asterisks for bold and leave apostrophes for italics?
If we're *replacing* (err, **replacing**) them, you're right. If we're
supplementing then we would want new symbols for both.
''' replaced by **. ''''' renders as an italic apostrophe. ''' still bold but supplemented by **. ''''' renders as bold-italics.
I actually quite like the idea of just supplementing, and possibly toning down some of the arcane apostrophe disambiguation. In case anyone hasn't seen the code, it's about 150 lines to process '' and '''. It's one line each to process // and **. We could probably reduce the 150 to 30 or so by only handling simple cases and forcing people to use // and ** for anything more complex.
The complex cases don't seem to work very well anyway (evidenced by people writing bold apstrophe as ''' ' ''' and using a stylised apostrophe for L''''arc de triomphe''' (and elsewhere, in my cursory glance). So we'd lose nothing, and gain something by having a simple, unambiguous syntax **in addition** to the current syntax.
**'** is the apstrophe character. It could also refer to... L'**arc de triomphe** signifie...
(It must also be said that if people simply wanted an unambiguous way of saying bold, they could just make a template: {{b|'}} is the..., L'{{b|arc de triomphe}}.... )
Steve
On Nov 11, 2007 6:11 AM, Steve Bennett stevagewp@gmail.com wrote:
(It must also be said that if people simply wanted an unambiguous way of saying bold, they could just make a template: {{b|'}} is the..., L'{{b|arc de triomphe}}.... )
Or just use XHTML: L'<strong>arc de triomphe</strong>
On 11/11/07, Minute Electron minuteelectron@googlemail.com wrote:
Or just use XHTML: L'<strong>arc de triomphe</strong>
Or for that matter, <I>. But we have to assume there is some value in
having our own special syntax. Or we wouldn't have one. Or something.
Steve
On Sun, Nov 11, 2007 at 11:02:48PM +1100, Steve Bennett wrote:
On 11/11/07, Minute Electron minuteelectron@googlemail.com wrote:
Or just use XHTML: L'<strong>arc de triomphe</strong>
Or for that matter, <I>. But we have to assume there is some value in having our own special syntax. Or we wouldn't have one. Or something.
There are two possible values: typing angle brackets is hard and geeky.
No, really. Ask around: it's one notch scarier.
And secondly, we aren't a *native* HTML environment, mostly to avoid <P>, which makes *me* run screaming into the night everytime I post a blog entry...
I was going to observe that it makes it easier to lock down dangerous HTML, but it doesn't really.
Cheers, -- jra
Installed base, for our purposes, is two things:
pages in the databases.
rules in people's heads.
The first is easy to fix, you just grind.
Grind what? Is there an exhaustive list of all MediaWiki installs somewhere? I very much doubt it. We can test against WMF and Wikia databases, and maybe a handful of others, but there's no way we can check every install.
On Sat, Nov 10, 2007 at 03:34:17PM +0000, Thomas Dalton wrote:
Installed base, for our purposes, is two things:
pages in the databases.
rules in people's heads.
The first is easy to fix, you just grind.
Grind what? Is there an exhaustive list of all MediaWiki installs somewhere? I very much doubt it. We can test against WMF and Wikia databases, and maybe a handful of others, but there's no way we can check every install.
Sure there isn't.
As I've pointed out myself, several times, it's really easy to come up with reasons why parser/markup reform is a *bad* idea, that *won't* work.
But there are reasonable solutions for almost every posited problem, for suitable values of reasonable.
Cheers, -- jra
On Sat, Nov 10, 2007 at 04:04:00PM +0000, Thomas Dalton wrote:
But there are reasonable solutions for almost every posited problem, for suitable values of reasonable.
It all boils down the deciding what values of "reasonable" are suitable.
You bet, which I why I hope someone better suited to the task than I volunteers to scan the corpus to round up some non-anecdotal data on construct usage.
Cheers, -- jra
On 11/10/07, Steve Summit scs@eskimo.com wrote:
The last time this came up (or maybe it was five or ten times ago; I can't keep track) I think I remember Brion stating pretty emphatically that no change to the parser could be contemplated if it broke *any* stuff. And on one level I think I agree: a cavalier change, that might break stuff and take some time to clean up after, is a very different prospect on a project with 100 pages, or even 10,000 pages, than it is on one with 2,000,000 pages.
That's why I'd like input from Brion and/or other "senior developers" on
ideas like a gradual transition from one parser to another. The current grammar has a lot of fat we don't need. Let's trim it.
Steve
Steve Summit wrote:
Jay Ashworth wrote:
The fundamental recurring argument seems to me to be "we can't do that; we'll break too much stuff."
The last time this came up (or maybe it was five or ten times ago; I can't keep track) I think I remember Brion stating pretty emphatically that no change to the parser could be contemplated if it broke *any* stuff.
That's obviously an exaggeration... after all, our existing parser breaks some stuff itself. ;)
But if we do break additional things, we should be careful about what we break, considering why it's broken, if it was wrong in the first place, and what the impact would be. Obviously any major parser change needs to be well tested against the existing tests and a lot of live content to see where the differences in behavior are.
And on one level I think I agree: a cavalier change, that might break stuff and take some time to clean up after, is a very different prospect on a project with 100 pages, or even 10,000 pages, than it is on one with 2,000,000 pages.
*nod*
Steve Bennett wrote:
That's why I'd like input from Brion and/or other "senior developers" on ideas like a gradual transition from one parser to another. The current grammar has a lot of fat we don't need. Let's trim it.
Not going to happen. :) A new parser will need to be a drop-in which works as correctly as possible.
While that likely would mean changing some corner-case behavior (as noted above, the existing parser doesn't always do what's desired), it would not be a different *syntax* from the human perspective.
(Though a computer scientist would consider it distinct as it wouldn't be identical.)
-- brion vibber (brion @ wikimedia.org)
On Tue, Nov 13, 2007 at 03:12:24PM -0500, Brion Vibber wrote:
Steve Summit wrote:
Jay Ashworth wrote:
The fundamental recurring argument seems to me to be "we can't do that; we'll break too much stuff."
The last time this came up (or maybe it was five or ten times ago; I can't keep track) I think I remember Brion stating pretty emphatically that no change to the parser could be contemplated if it broke *any* stuff.
That's obviously an exaggeration... after all, our existing parser breaks some stuff itself. ;)
But if we do break additional things, we should be careful about what we break, considering why it's broken, if it was wrong in the first place, and what the impact would be. Obviously any major parser change needs to be well tested against the existing tests and a lot of live content to see where the differences in behavior are.
Certainly.
I thought I'd suggested that in pretty nauseating detail, in... oh, one of these five threads. ;-)
Steve Bennett wrote:
That's why I'd like input from Brion and/or other "senior developers" on ideas like a gradual transition from one parser to another. The current grammar has a lot of fat we don't need. Let's trim it.
Not going to happen. :) A new parser will need to be a drop-in which works as correctly as possible.
While that likely would mean changing some corner-case behavior (as noted above, the existing parser doesn't always do what's desired), it would not be a different *syntax* from the human perspective.
Well, the fundamental point of these five threads was *there is no formal specification*. That means that any replacement has to *exactly match* the behaviour of the current code.
If that's not what we want, then we need to decide on a specced behavior for wikitext, and we can't back one out of the current parser, since as you note, there are already corner cases.
(Though a computer scientist would consider it distinct as it wouldn't be identical.)
Computer scientists? Here?
Cheers, -- jra
On 14/11/2007, Jay R. Ashworth jra@baylink.com wrote:
Well, the fundamental point of these five threads was *there is no formal specification*. That means that any replacement has to *exactly match* the behaviour of the current code. If that's not what we want, then we need to decide on a specced behavior for wikitext, and we can't back one out of the current parser, since as you note, there are already corner cases.
We can iteratively approach it. "Here's 99%. What's the breakage level?" "This, this and this need to work." *fixfixfix*
- d.
On 11/14/07, David Gerard dgerard@gmail.com wrote:
We can iteratively approach it. "Here's 99%. What's the breakage level?" "This, this and this need to work." *fixfixfix*
Yep. Except that the 99% we've been bandying about is effectively the "non-breakage" level - the proportion of corpus pages that render correctly. As opposed to the portion of the imaginary grammar we think we've implemented.
Well, I'm chugging along here on the grammar, feeling pretty optimistic. The current parser can certainly do crazy things, but if we accept that those crazy things are not for the most part useful, and that no one will mind if they go away, the task becomes easier.
It's interesting to note the language features that were actually difficult to implement in the pattern-transform parser.
For example, __TOC__ is not particularly straightforward: 1. search for the first occurrence 2. replace it with a special token 3. remove all other occurrences ... 4. find the special token and replace it with the actual contents.
Dealing with <nowiki> is even more tortured.
However, in a consume-parse-render parser, it's simpler: 1. Find the token just like any other token 2. If the "magicword-TOC-found" flag is already set, skip further processing. 3. Set the magicword-TOC-found flag. 4. Store the token as normal.
Then at render time: 1. When you hit the token, write out the contents. Presumably you already have enough information to do this.
So not all of the parser conversion is bad news.
Steve
On Wed, Nov 14, 2007 at 08:27:16PM +1100, Steve Bennett wrote:
On 11/14/07, David Gerard dgerard@gmail.com wrote:
We can iteratively approach it. "Here's 99%. What's the breakage level?" "This, this and this need to work." *fixfixfix*
Yep. Except that the 99% we've been bandying about is effectively the "non-breakage" level - the proportion of corpus pages that render correctly. As opposed to the portion of the imaginary grammar we think we've implemented.
Yes. Precisely. Or at least "that's also how I interpret Brion's disinclination towards the topic". ;-)
Well, I'm chugging along here on the grammar, feeling pretty optimistic. The current parser can certainly do crazy things, but if we accept that those crazy things are not for the most part useful, and that no one will mind if they go away, the task becomes easier.
You go, boy. :-)
It's interesting to note the language features that were actually difficult to implement in the pattern-transform parser.
For example, __TOC__ is not particularly straightforward:
- search for the first occurrence
- replace it with a special token
- remove all other occurrences
... 4. find the special token and replace it with the actual contents.
Dealing with <nowiki> is even more tortured.
I'll bet.
However, in a consume-parse-render parser, it's simpler:
- Find the token just like any other token
- If the "magicword-TOC-found" flag is already set, skip further
processing. 3. Set the magicword-TOC-found flag. 4. Store the token as normal.
Well, that's how *my* head said to do it as you wrote the other out.
Then at render time:
- When you hit the token, write out the contents. Presumably you already
have enough information to do this.
So not all of the parser conversion is bad news.
Indeed.
Cheers, -- jra
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
I believe that all of these arguments about what to do and how to do it would be *very* well served by a) defining a list of the corner cases/pinch points and b) surveying the WMF wikis to see on how many pages they *actually* appear.
So: 1) Carefully compile a list of stuff we don't want 2) Test the validity of that list 3) Throw the list out, since we don't need to implement it.
It would be better to: 1) Define a subset of the grammar that we think is useful 2) See how many pages correspond to that grammar 3) Implement a parser for that grammar.
Anyone want to contribute on either point?
I'm very willing to help with (my) step 1. And probably 3. Which, to
answer your question means "not me", iiuc.
Steve
On Sat, Nov 10, 2007 at 05:23:17PM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
I believe that all of these arguments about what to do and how to do it would be *very* well served by a) defining a list of the corner cases/pinch points and b) surveying the WMF wikis to see on how many pages they *actually* appear.
So:
- Carefully compile a list of stuff we don't want
- Test the validity of that list
- Throw the list out, since we don't need to implement it.
You're being... *purposefully* obtuse? :-)
It would be better to:
- Define a subset of the grammar that we think is useful
- See how many pages correspond to that grammar
- Implement a parser for that grammar.
But that won't make the sale necessary to implement 4: deploy that parser on WMF.
Cheers, -- jra
On 11/10/07, Simetrical Simetrical+wikilist@gmail.com wrote:
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
There's one obvious use: French. Eg: L''''arc de triomphe de l'Étoile''' appelé...
(though interestingly when I got to that page, the initial ' was actually a ')
But yes, clearer, unambiguous syntax that didn't rely on arbitrary processing rules would be good.
For example one might define a character like _ which represents a non-joining non-space. So the above could be written L'_'''arc de triomphe''' which would be clearer and unambiguous.
Steve
On Sat, Nov 10, 2007 at 01:04:37PM +1100, Steve Bennett wrote:
On 11/10/07, Simetrical Simetrical+wikilist@gmail.com wrote:
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
There's one obvious use: French. Eg: L''''arc de triomphe de l'Étoile''' appelé...
(though interestingly when I got to that page, the initial ' was actually a ')
But yes, clearer, unambiguous syntax that didn't rely on arbitrary processing rules would be good.
For example one might define a character like _ which represents a non-joining non-space. So the above could be written L'_'''arc de triomphe''' which would be clearer and unambiguous.
Or, and here's an idea I don't see much, we could define **bold** and //italics// as *additional* ways to punctuate such things, and keep the old ones until they wither and die.
L'//arc de Triompe// would be *entirely* unambiguous.
(As I continue to note, any extended syntax in this specific area should track historical usage as closely as possible to comply with the Principle of Least Surprise.)
It shouldn't be all *that* hard to instrument the parser to flag such constructs and put them in extra columns or a parallel table.
Make analytical work a lot easier, too.
Cheers, -- jra
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
Or, and here's an idea I don't see much, we could define **bold** and //italics// as *additional* ways to punctuate such things, and keep the old ones until they wither and die.
L'//arc de Triompe// would be *entirely* unambiguous.
And wrong :) You mean L'**arc de Triomphe**. It's an appealing idea (and certainly // and ** is better than '' and ''') but means the parser passes through an even more complex phase.
(As I continue to note, any extended syntax in this specific area
should track historical usage as closely as possible to comply with the Principle of Least Surprise.)
I don't think the principle of least surprise even comes into it. You can't change syntax overnight for far more pragmatic reasons, like you simply can't train everyone on the new syntax fast enough. Principle or no principle.
It shouldn't be all *that* hard to instrument the parser to flag such
constructs and put them in extra columns or a parallel table.
Make analytical work a lot easier, too.
I'm not sure which constructs you're talking about? Why do you need to
flag ** and //?
Steve
On Sat, Nov 10, 2007 at 05:08:11PM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
Or, and here's an idea I don't see much, we could define **bold** and //italics// as *additional* ways to punctuate such things, and keep the old ones until they wither and die.
L'//arc de Triompe// would be *entirely* unambiguous.
And wrong :) You mean L'**arc de Triomphe**. It's an appealing idea (and certainly // and ** is better than '' and ''') but means the parser passes through an even more complex phase.
It's *bold*?
(As I continue to note, any extended syntax in this specific area should track historical usage as closely as possible to comply with the Principle of Least Surprise.)
I don't think the principle of least surprise even comes into it. You can't change syntax overnight for far more pragmatic reasons, like you simply can't train everyone on the new syntax fast enough. Principle or no principle.
Not my point. You're discussing speed of cut, I'm discussing *target* of cut.
It shouldn't be all *that* hard to instrument the parser to flag such
constructs and put them in extra columns or a parallel table.
Make analytical work a lot easier, too.
I'm not sure which constructs you're talking about? Why do you need to flag ** and //?
My assertion was that, for analytical purposes, if it was practical to run it, we could instrument the parser to log somewhere the count of constructs it parses on each page, which would save grinding the entire database to get the statistics of which I speak. The users would grin it for us.
Cheers, -- jra
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
Not my point. You're discussing speed of cut, I'm discussing *target* of cut.
Eh? How can good syntax violate the principle of least surprise?
My assertion was that, for analytical purposes, if it was practical to run it, we could instrument the parser to log somewhere the count of constructs it parses on each page, which would save grinding the entire database to get the statistics of which I speak. The users would grin it for us.
Sounds good. Transclusion would alter your results, of course.
Steve
On Sun, Nov 11, 2007 at 09:44:46AM +1100, Steve Bennett wrote:
On 11/10/07, Jay R. Ashworth jra@baylink.com wrote:
Not my point. You're discussing speed of cut, I'm discussing *target* of cut.
Eh? How can good syntax violate the principle of least surprise?
There are many possible syntacices which qualify as "good". Those which leverage long-held reflexes of people who type on the Internet a lot are "better". PLS is why.
My assertion was that, for analytical purposes, if it was practical to run it, we could instrument the parser to log somewhere the count of constructs it parses on each page, which would save grinding the entire database to get the statistics of which I speak. The users would grin it for us.
Sounds good. Transclusion would alter your results, of course.
I think we probably need to skip transclusion for the purposes of this statistic. But I see your point.
Cheers, -- jra
Jay R. Ashworth wrote:
On Sat, Nov 10, 2007 at 01:04:37PM +1100, Steve Bennett wrote:
On 11/10/07, Simetrical Simetrical+wikilist@gmail.com wrote:
I would view bold italics with adjacent apostrophes as a corner case. The behavior in that case makes very little sense and I doubt it's being widely used.
There's one obvious use: French. Eg: L''''arc de triomphe de l'Étoile''' appelé...
(though interestingly when I got to that page, the initial ' was actually a ')
But yes, clearer, unambiguous syntax that didn't rely on arbitrary processing rules would be good.
For example one might define a character like _ which represents a non-joining non-space. So the above could be written L'_'''arc de triomphe''' which would be clearer and unambiguous.
Or, and here's an idea I don't see much, we could define **bold** and //italics// as *additional* ways to punctuate such things, and keep the old ones until they wither and die.
L'//arc de Triompe// would be *entirely* unambiguous.
Then people could be using L'<i>arc de Triompe</i> (or L'<b>arc de Triompe</b> for the original wikitext).
That's something guaranteed not to change. You don't even need a parser! :D
On 11/11/07, Platonides Platonides@gmail.com wrote:
That's something guaranteed not to change. You don't even need a parser! :D
There are two obvious ways that <i> may change:
1) The parser will be adapted to allow rendering a format other than HTML, like PDF or something. 2) The grammar will be altered to deny all HTML.
Incidentally, on the issue of apostrophes, I couldn't help but notice that at [[' (disambiguation)]] when they want to actually display a single apostrophe in bold, ''''''' doesn't work: they use ''' ' '''.
Steve
Hoi, On the issue of apostrophes, do remember that languages like Neaopolitan have double quotes in plain text. The system as it is, is broken for all languages anyway. Thanks. GerardM
On Nov 11, 2007 3:01 AM, Steve Bennett stevagewp@gmail.com wrote:
On 11/11/07, Platonides Platonides@gmail.com wrote:
That's something guaranteed not to change. You don't even need a parser! :D
There are two obvious ways that <i> may change:
- The parser will be adapted to allow rendering a format other than HTML,
like PDF or something. 2) The grammar will be altered to deny all HTML.
Incidentally, on the issue of apostrophes, I couldn't help but notice that at [[' (disambiguation)]] when they want to actually display a single apostrophe in bold, ''''''' doesn't work: they use ''' ' '''.
Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Nov 11, 2007 2:01 AM, Steve Bennett stevagewp@gmail.com wrote:
- The grammar will be altered to deny all HTML.
Why would HTML ever be scrapped? Do you have a better way to do <div class="someclass"> with wikitext rather than HTML. It would be foolish to stop people using limited HTML as it is incredibly useful.
On 11/11/07, Minute Electron minuteelectron@googlemail.com wrote:
Why would HTML ever be scrapped? Do you have a better way to do <div class="someclass"> with wikitext rather than HTML. It would be foolish to stop people using limited HTML as it is incredibly useful.
All I meant is that that could happen. Which it could.
Steve
On Sun, Nov 11, 2007 at 10:51:06PM +1100, Steve Bennett wrote:
On 11/11/07, Minute Electron minuteelectron@googlemail.com wrote:
Why would HTML ever be scrapped? Do you have a better way to do <div class="someclass"> with wikitext rather than HTML. It would be foolish to stop people using limited HTML as it is incredibly useful.
All I meant is that that could happen. Which it could.
Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Steve: note that you're leaving the first graf of your replies quoted in one level more than you mean to.
Cheers, -- jra
On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:
On Sun, Nov 11, 2007 at 10:51:06PM +1100, Steve Bennett wrote:
On 11/11/07, Minute Electron minuteelectron@googlemail.com wrote:
Why would HTML ever be scrapped? Do you have a better way to do <div class="someclass"> with wikitext rather than HTML. It would be foolish to stop people using limited HTML as it is incredibly useful.
All I meant is that that could happen. Which it could.
Steve
Steve: note that you're leaving the first graf of your replies quoted in one level more than you mean to.
Interesting. That's not how it appears to me (Gmail/Opera). I've left an extra newline this time, presumably it renders correctly?
Steve
On Mon, Nov 12, 2007 at 10:31:40AM +1100, Steve Bennett wrote:
On 11/12/07, Jay R. Ashworth jra@baylink.com wrote:
On Sun, Nov 11, 2007 at 10:51:06PM +1100, Steve Bennett wrote:
On 11/11/07, Minute Electron minuteelectron@googlemail.com wrote:
Why would HTML ever be scrapped? Do you have a better way to do <div class="someclass"> with wikitext rather than HTML. It would be foolish to stop people using limited HTML as it is incredibly useful.
All I meant is that that could happen. Which it could.
Steve
Steve: note that you're leaving the first graf of your replies quoted in one level more than you mean to.
Interesting. That's not how it appears to me (Gmail/Opera). I've left an extra newline this time, presumably it renders correctly?
Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
This is what it came back as this time.
So yes, although no traditional blank line after the quote.
This is why I use mutt. :-)
Cheers, -- jra
On Sun, Nov 11, 2007 at 09:57:23AM +0000, Minute Electron wrote:
On Nov 11, 2007 2:01 AM, Steve Bennett stevagewp@gmail.com wrote:
- The grammar will be altered to deny all HTML.
Why would HTML ever be scrapped? Do you have a better way to do <div class="someclass"> with wikitext rather than HTML. It would be foolish to stop people using limited HTML as it is incredibly useful.
A (good) argument can be made that you've mistakenly typed "useful" when you meant to type "dangerous", and that such work should be sanitized by something like a parser function.
Cheers, -- jra
Jay R. Ashworth wrote:
A (good) argument can be made that you've mistakenly typed "useful" when you meant to type "dangerous", and that such work should be sanitized by something like a parser function.
Cheers, -- jra
Steve Bennet:
- The parser will be adapted to allow rendering a format other than
HTML, like PDF or something.
- The grammar will be altered to deny all HTML.
Those HTMLish tags are wikisyntax tags, and as such are bound to all restrictions of the rest of the parser.
I obviously said it doesn't need to be parsed thinking in a HTML parser. If you want to make a pdf parser before the html one it's up to you. We could send user pdfs instead of html...
On 11/12/07, Platonides Platonides@gmail.com wrote:
Those HTMLish tags are wikisyntax tags, and as such are bound to all restrictions of the rest of the parser.
Ah. I thought <I> was just being ignored by the parser, rather than being interpreted and written out as <I>
Steve
Ah. I thought <I> was just being ignored by the parser, rather than being interpreted and written out as <I>
The only way to ignore <I> completely would be to ignore all HTML tags, and that would be a very bad plan. The parser needs to recognise it as an HTML tag and then compare against a list of acceptable tags.
On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:
Ah. I thought <I> was just being ignored by the parser, rather than being interpreted and written out as <I>
It should be clear <I> must be interpreted and rewritten, since it's invalid XHTML. It gets output as <i>. I expect we have an explode on < and iterate through those, although I haven't checked.
On Sun, Nov 11, 2007 at 06:23:00PM -0500, Simetrical wrote:
On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:
Ah. I thought <I> was just being ignored by the parser, rather than being interpreted and written out as <I>
It should be clear <I> must be interpreted and rewritten, since it's invalid XHTML. It gets output as <i>. I expect we have an explode on < and iterate through those, although I haven't checked.
XHTML is case sensitive on the contents of the tags?
What have those people been *smoking* over there?
And I thought it was bad enough that *quoted* ampersands in URLs have to be escaped to validate. If it's quoted, stay the hell *out* of it; that's why we quoted it.
Cheers -- jra
Jay R. Ashworth wrote:
On Sun, Nov 11, 2007 at 06:23:00PM -0500, Simetrical wrote:
On 11/11/07, Steve Bennett stevagewp@gmail.com wrote:
Ah. I thought <I> was just being ignored by the parser, rather than being interpreted and written out as <I>
It should be clear <I> must be interpreted and rewritten, since it's invalid XHTML. It gets output as <i>. I expect we have an explode on < and iterate through those, although I haven't checked.
XHTML is case sensitive on the contents of the tags?
No. It's case sensitive on the tags names.
What have those people been *smoking* over there?
And I thought it was bad enough that *quoted* ampersands in URLs have to be escaped to validate. If it's quoted, stay the hell *out* of it; that's why we quoted it.
There was people on wikipedia changing <em>s by <i>s (or viceversa)...
On Thu, 08 Nov 2007 14:55:02 +1100, Steve Bennett wrote:
On 11/8/07, Simetrical Simetrical+wikilist@gmail.com wrote:
- Now that we have a grammar, a yacc parser is compiled, and
appropriate rendering bits are added to get it to render to HTML.
People have already done this, at least once, haven't they? Do we have a list of attempts?
- The stuff the BNF grammar doesn't cover is tacked on with some
other methods. In practice, it seems like a two-pass parser would be ideal: one recursive pass to deal with templates and other substitution-type things, then a second pass with the actual grammar of most of the language. The first pass is of necessity recursive, so there's probably no point in having it spend the time to repeatedly parse italics or whatever, when it's just going to have to do it again when it substitutes stuff in. Further rendering passes are going to be needed, e.g., to insert the table of contents. Further parsing passes may or may not be needed.
Ouch, now you're up to about 4 passes, which isn't far off the current version. Two passes would be good, like a C compiler: once for meta-markup (templates, parser functions), and once for content. Would it be possible to perhaps have an in-place pattern-based parser for the first phase, then a proper recursive descent for the content?
If we're going to get anything done, we'd likely need incremental improvements (or real strong motivation). Yes, four passes is a little more complex than we'd want, and would make the grammar a bit unwieldy. But it's not complex or unwieldy enough to handle all of the corner cases in the current parser. So it would at least be a step in the right direction.
If we're going to get anything done, we'd likely need incremental improvements (or real strong motivation). Yes, four passes is a little more complex than we'd want, and would make the grammar a bit unwieldy. But it's not complex or unwieldy enough to handle all of the corner cases in the current parser. So it would at least be a step in the right direction.
If incremental improvements was going to work, we'd have done it by now. The parser is such a mess that we really need to bite the bullet and just do it - incremental improvements isn't going to cut it.
(And when I say "we", I mean those us of know what EBNF stands for without looking it up, which doesn't include me. ;))
On Thu, Nov 08, 2007 at 10:45:18PM +0000, Thomas Dalton wrote:
(And when I say "we", I mean those us of know what EBNF stands for without looking it up, which doesn't include me. ;))
Enhanced (or Extended) [[Backus-Naur Form]].
It's a language for describing languages (much like XML, oddly).
I'm pretty sure there's a BNF grammar for C in the back of K&&R, for example. There are BNF's for email adresses in RFC 2821 and 2822.
Cheers, -- jra
On 08/11/2007, Simetrical Simetrical+wikilist@gmail.com wrote:
- Everything is rolled out live. Pages break left and right. Large
complaint threads are started on the Village Pump, people fix it, and everyone forgets about it. Developers get a warm fuzzy feeling for having finally succeeded at destroying Parser.php.
Before this, step 5.5) A bunch of bots run over all the wikis fixing corner-case wikitext that's fixable - changing stuff that works only in the old parser to stuff that works in old and new.
And step 6.5) A bunch of bots run over all the wikis fixing the stuff that was missed in 5.5.
And step 6.6) the converter script for the next version of MediaWiki includes a text fixer.
- d.
On 11/9/07, David Gerard dgerard@gmail.com wrote:
Before this, step 5.5) A bunch of bots run over all the wikis fixing corner-case wikitext that's fixable - changing stuff that works only in the old parser to stuff that works in old and new.
For some reason, I was expecting "7) hop on the back of a passing pig and fly away". But maybe you were actually being serious.
I thought my suggestion of this slow migration to a new parser was a bit fanciful. But is it a good idea? What do people think? Given that:
a) The parser is extremely complex, and difficult to make significant improvements to b) Writing a new parser that can implement *all* of the current parser's idiosyncracies while actually being an improvement, is a massive challenge c) Breaking everyone's wikitext overnight is not going to endear anybody to anybody.
is there any other way to end up with a new parser based on fully specifiable formal grammar?
I'd really like to hear from Brion on this.
Steve
My understanding (which could of course be wrong) is this: * It is still a goal (at the moment the only complete wiki text specification is the code that implements the parser). * Progress towards this goal is stalled. * The reason it is stalled is because it may not be possible (I think we have a few 90% implementations, where people realised the last 10% could not be done using the approach used in the other 90%). * Unless someone has a 100% implementation, or a 90% implementation with a clear actionable plan and every reason to believe they can get to 100%, then there's not really any point in expending more time on it and discussing it extensively (it's already been discussed several times previously on this mailing list, with no resolution). * Unless someone can prove that it's impossible to implement, then we can't close the bug as WONTFIX. * So, to summarise, it's in limbo, with no clear path out of limbo, either to a WONTFIX status, or to a FIXED status. Furthermore, it has a bit of "time black hole" vibe about it - that you could potentially throw lots of time and money at it, and have nothing useful to show for it, kind of like A.I. research ;-) [with apologies to any AI researchers I just offended, but where's my robot that can learn how to do my housework and my shopping?] ... But the flipside is that if anyone can solve this problem, then they get bragging rights, and several rounds of beer from the rest of us (just as long as it's a 100% implementation).
At least, that's my general understanding. If anyone knows better, please correct me.
-- All the best, Nick.
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org]On Behalf Of Steve Bennett Sent: Thursday, 8 November 2007 10:10 AM To: Wikimedia developers Subject: [Wikitech-l] EBNF grammar project status?
What's the status of the project to create a grammar for Wikitext in EBNF?
There are two pages: http://meta.wikimedia.org/wiki/Wikitext_Metasyntax http://www.mediawiki.org/wiki/Markup_spec
Nothing seems to have happened since January this year. Also the comments on the latter page seem to indicate a lack of clear goal: is this just a fun project, is it to improve the existing parser, or is it to facilititate a new parser? It's obviously a lot of work, so it needs to be of clear benefit. Brion requested the grammar IIRC (and there's a comment to that effect at http://bugzilla.wikimedia.org/show_bug.cgi?id=7 ), so I'm wondering what became of it.
Is there still a goal of replacing the parser? Or is there some alternative plan?
Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 11/8/07, Nick Jenkins nickpj@gmail.com wrote:
My understanding (which could of course be wrong) is this:
- It is still a goal (at the moment the only complete wiki text
specification is the code that implements the parser).
- Progress towards this goal is stalled.
- The reason it is stalled is because it may not be possible (I think we
have a few 90% implementations, where people realised the last 10% could not be done using the approach used in the other 90%).
What exactly is the "goal"? If it's just "formally defining whatever it is that the code currently does", is that a worthy goal? I ask because obviously the parser currently does some crappy things, and it does some of them in a crappy way. Is it really worth carefully defining those crappy things just so that it can do them in a better way? Could we not just define the 90% of what it currently does that we actually *want* it to do, then write a parser that does that, and accept the fact that some wikitext *will* be broken?
For example, I just opened a bug on the strange, unpredictable, and not very useful behaviour that arises from constructs such as:
;#foo:blaa
And the fact that this:
;foo:#blaa
is distinct from this:
;foo :#blaa
This is all weird, very complex behaviour that arises from the nature of the parser. But it's not useful. And it's not really worth spending endless hours attempting to understand. Could we not simply redefine the grammar, and chop out the ";term:definition" syntax altogether? Yes, it would break some code but it would be easy to identify those bits and replace them with ";term<newline>:definition".
If the goal is to improve the grammar and then write a good parser, why expend so much effort attempting to document to the nth degree the virtually incomprehensible behaviour currently exhibited by the parser?
Steve PS Lest I be misunderstood, I'm not criticising the parser or its authors: the parser code actually looks well written and extremely readable. It's just in the nature of multi-pass pattern matching/replacement that extraordinarily complex behaviour arises. And then you run the output through tidy...
wikitech-l@lists.wikimedia.org