This made me think: Would it make sense to make a formal BNF grammar for the Wikipedia text format, so a LALR(1) parser could be made for it? Would that make any sense at all with PHP, or just be too hard to code and inflexible?
I'd love to have a formal grammar of some kind (I think regexps would be fine), and I agree with Jan that a totally wiki-specific syntax would be far better than out current mish-mash of HTML and wiki markup. But I'm not sure if it's not already too late to revisit those decisions.
But if it isn't, I'll be happy to discuss what a syntax might look like.
On Thu, 25 Jul 2002 lcrocker@nupedia.com wrote:
This made me think: Would it make sense to make a formal BNF grammar for the Wikipedia text format, so a LALR(1) parser could be made for it? Would that make any sense at all with PHP, or just be too hard to code and inflexible?
I'd love to have a formal grammar of some kind (I think regexps would be fine), and I agree with Jan that a totally wiki-specific syntax would be far better than out current mish-mash of HTML and wiki markup. But I'm not sure if it's not already too late to revisit those decisions.
But if it isn't, I'll be happy to discuss what a syntax might look like.
Wiki is still a new concept. Think how HTML was based on SGML, then evolved into HTML 2, 3, 4, 5, and then XML came along, because people understood from the HTML experience that SGML was overly complex.
There is a big world of PhpWikis out there with [single bracket] link syntax. There are other wiki implementations with different ideas about syntax. But no wiki is as big as Wikipedia, so this is the most concentrated amount of experience. This is where a format standard should or at least could start to form.
On Thu, Jul 25, 2002 at 12:17:26PM +0200, Lars Aronsson wrote:
On Thu, 25 Jul 2002 lcrocker@nupedia.com wrote:
This made me think: Would it make sense to make a formal BNF grammar for the Wikipedia text format, so a LALR(1) parser could be made for it? Would that make any sense at all with PHP, or just be too hard to code and inflexible?
I'd love to have a formal grammar of some kind (I think regexps would be fine), and I agree with Jan that a totally wiki-specific syntax would be far better than out current mish-mash of HTML and wiki markup. But I'm not sure if it's not already too late to revisit those decisions.
But if it isn't, I'll be happy to discuss what a syntax might look like.
Wiki is still a new concept. Think how HTML was based on SGML, then evolved into HTML 2, 3, 4, 5, and then XML came along, because people understood from the HTML experience that SGML was overly complex.
There is a big world of PhpWikis out there with [single bracket] link syntax. There are other wiki implementations with different ideas about syntax. But no wiki is as big as Wikipedia, so this is the most concentrated amount of experience. This is where a format standard should or at least could start to form.
I tried to make formal grammar of Wikipedia, LALR, regexps of whatever, and I can tell you that it's next to impossible if almost arbitrary HTML markup is allowed.
Especially HTML tables syntax is difficult to parse, so maybe we should make our own ?
Without HTML tables I think that we could limit what kind of HTML is allowed and make some sane formal syntax.
It's not easy to design simple table markup that: * allows multicolumn and multirow cells * allows cell attributes * can nest tables * allows all constructs that HTML allows inside cells, i.e. multiple paragraphs, lists etc. * is readable * is easy to write
So I suggest that you check http://sf.net/projects/freetable I made this a while ago to allow simpler HTML tables. It seems to be working and is used by WebMake and WebsiteMetaLanguage.
Syntax looks something like this:
<wwwtable border=1> (1,1) column 1, row 1 (+,) the same column, next row (*,2) column 2 in any row (*,3) align=center columns 3 should be centered (1,3) Some centered text (3,3) Other centered text </wwwtable>
What is converted to: <table border=1> <tr> <td>column 1, row 1</td> <td>column 2 in any row</td> <td align=center>columns 3 should be centered Some centered text</td> </tr> <tr> <td>the same column, next row</td> <td>column 2 in any row</td> <td align=center>columns 3 should be centered</td> </tr> <tr> <td> </td> <td>column 2 in any row</td> <td align=center>columns 3 should be centered Other centered text</td> </tr> </table>
I'd say it's much better than what wikipedia currently uses.
On Thu, Jul 25, 2002 at 02:25:18AM -0700, lcrocker@nupedia.com wrote:
I'd love to have a formal grammar of some kind (I think regexps would be fine),
Hmm, I seem to remember I promised that once. :-/ I'll see what I can do. If people want to help, just go to
http://www.wikipedia.com/wiki/User:Jan_Hidders/Wikipedia_syntax
(I probably should put this on the meta-wikipedia.)
Just to be clear; the syntax should not describe what we accept and not accept (we accept actualy everything sot that's a really simple grammar :-)) but should have enough "resolution" to allow us to specify the semantics of the mark-up. We first should not concentrate on making it LALR(1) or anything, but just that it is unambiguous (in the parsing-sense of the word) and complete.
and I agree with Jan that a totally wiki-specific syntax would be far better than our current mish-mash of HTML and wiki markup. But I'm not sure if it's not already too late to revisit those decisions.
Was it a conscious decision? I got the impressions the early software didn't filter out HTML so people used it and now we are stuck with it.
Apart from the big technical advantages I still feel that having a simple HTML-free mark-up language is necessary to keep Wikipedia accessible for newcomers. Having lots of complicated HTML that is not very WYSIWIG makes editing harder. This inevitably means that you cannot do a lot of fancy lay-out things, but I believe that is not a bug but a feature.
So, yes, it is probably impossible to come up with an HTML-free mark-up that has an equivalent for all the HTML that is currently used. However, we would probably be breaking only a very small percentage of pages and we could even automatically detect those pages and put them on a "to be simplified" list.
But if it isn't, I'll be happy to discuss what a syntax might look like.
I have once made a proposal on
http://www.wikipedia.com/wiki/User:Jan_Hidders/HTML-free_mark-up
but I have to admit that it was mainly to draw some discussion.
-- Jan Hidders
Jan.Hidders wrote:
Was it a conscious decision? I got the impressions the early software didn't filter out HTML so people used it and now we are stuck with it.
That's more or less right. There was a discussion, but really, the code is the law.
Apart from the big technical advantages I still feel that having a simple HTML-free mark-up language is necessary to keep Wikipedia accessible for newcomers. Having lots of complicated HTML that is not very WYSIWIG makes editing harder. This inevitably means that you cannot do a lot of fancy lay-out things, but I believe that is not a bug but a feature.
My grasp of the consensus (but maybe I am misremembering... perhaps this is just my grasp of my own opinion!) is that going out of our way to be HTML-free is not a good thing, but that not allowing the many html-nightmares is a good thing.
For example, many many many people, not just programmers, understand how to make html <b>bold</b> and <i>italics</i>. Those are intuitive and harmless. The original Ward Cunningham wiki solution of ' and '' and ''' for different things, well, that was never very intuitive and newcomers didn't know about it.
Supporting some html tags, familiar and harmless ones, seems like a good idea.
Of course, at this point, we do have an established userbase of writers, some of whom are known to us as regulars, but lots and lots of whom may only show up once every month to write a little bit. I think we have a duty not to change anything in a way that will astonish them.
--Jimbo
On Fri, Jul 26, 2002 at 07:00:28AM -0700, Jimmy Wales wrote:
Jan.Hidders wrote:
My grasp of the consensus (but maybe I am misremembering... perhaps this is just my grasp of my own opinion!) is that going out of our way to be HTML-free is not a good thing, but that not allowing the many html-nightmares is a good thing.
I think you are right that this was and is the consensus. Doesn't mean I can't try, does it? :-)
For example, many many many people, not just programmers, understand how to make html <b>bold</b> and <i>italics</i>. Those are intuitive and harmless. The original Ward Cunningham wiki solution of ' and '' and ''' for different things, well, that was never very intuitive and newcomers didn't know about it.
<rave> Oh, come on! How long does it take for newcomers to grasp what '' and ''' means? I agree that in itself there is nothing wrong with <b> and <i> although I personally think they are slightly less easier to read then the WikiWiki notation and I think it is always better to simply have one notation for every mark-up.
However, they are at the root of the HTML problems because once you start allowing such HTML-like mark-up the software has to decide which tags it allows and which it doesn't and if the stuff is well-formed or not and that's just hard to do. So in the beginning it simply wasn't done and because, as you said, the code was and is the law now the life of the parser has become unnecessarily difficult and some pages are more and more looking like regular hard to understand and edit HTML pages.
It really would have been so much better if right from the beginning somebody would have said: no HTML tags. It really would have. *sigh* </rave>
Of course, at this point, we do have an established userbase of writers, some of whom are known to us as regulars, but lots and lots of whom may only show up once every month to write a little bit. I think we have a duty not to change anything in a way that will astonish them.
You are right, of course, but I would say that we also have a duty towards the hundreds of thousands potential contributers that will come to visit us in the future. And seeing how Wikipedia is happily humming along at the moment and even getting a facial soon, I'm quite sure that they will come. :-)
-- Jan Hidders
wikitech-l@lists.wikimedia.org