Re: [Wikitech-l] A Modest Proposal on grammar and parsers

List overview All Threads
Download

newer

older

Parsing italics/bold

So can we change the grammar or...

Merlijn van Deen

10 Nov 2007 10 Nov '07

2:26 p.m.

First of all, I have to admit I have not read all 50 emails, but here are my two cents.

Most importantly, I think we should stop storing wikitext. Storing wikitext makes it hard to make changes in the syntax, because it would break pretty much every existing page. Wikitext is an ambiguous way of storing 'the way it is meant'; XML is a clear way of doing this. As the text is compressed, using wikitext or XML does not make that big of a difference.

However, XML makes parsing much easier. Yes, it will need two steps, but when regenerating the page from the database, it's much easier (no ugly regexps, just a simple SAX parser). Besides, as a pywikipedia developer, I'd like to have XML output ;)

To change the format to XML (and updating the wikitext format at the same time) means we need four important things: an 'old wikitext'->XML converter, a XML->'good wikitext' converter, a 'good wikitext'->XML converter and a XML->HTML parser. (s/converter/parser, if you care about the exact words). The 'good wikitext' and html parsers should be fairly easy; the first is just plain hard.

I have tried to build a parser by using standard systems, but I have given up, and I have built a basic lexer + parser by hand. It is by no means complete, and I have not worked on it for some time, as Ping Yeh has a more complete implementation. He was busy refactoring it, but with little time, there was little progress. My code is available at http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/ - feel free to take a look at it :)

To summarize: We should switch to storing a much more descriptive format so changes in the wikitext format do not break anything: the wikitext can just be generated from the XML, in whichever format you want. This means it should be able to use (cleaned up) mediawiki wikitext, wikicreole or many other systems - per user. (Although as far as I can see wikicreole isn't available as context free grammar either..)

--valhallasw

P.S. Some people, when confronted with a problem, think `I know, I'll use regular expressions.' Now they have two problems. --jwz

Show replies by date

Steve Bennett

10 Nov 10 Nov

2:42 p.m.

New subject: A Modest Proposal on grammar and parsers

On 11/11/07, Merlijn van Deen valhallasw@arctus.nl wrote:

...

Most importantly, I think we should stop storing wikitext. Storing wikitext makes it hard to make changes in the syntax, because it would break pretty much every existing page. Wikitext is an ambiguous way of storing 'the way it is meant'; XML is a clear way of doing this. As the text is compressed, using wikitext or XML does not make that big of a difference.

Interesting idea, and it does mean you can update the syntax whenever you want: however if you make a big change, then the amount of broken syntax that will be *written* will increase.

...

To change the format to XML (and updating the wikitext format at the same time) means we need four important things: an 'old wikitext'->XML converter, a XML->'good wikitext' converter, a 'good wikitext'->XML converter and a XML->HTML parser. (s/converter/parser, if you care about the exact words). The 'good wikitext' and html parsers should be fairly easy; the first is just plain hard.

I've only ever used one system that worked like that: LambdaMOO. When you write code in that system, it compiles it to bytecode, then decompiles it next time you want to edit it. It had some interesting quirks though:

- Whitespace was self-normalising (not a bad thing) - Parentheses were self-normalising (sometimes a confusing thing) - /* Comments */ were stripped out and not stored (a stupid thing) - You couldn't save non-compiling code

I find your suggestion of replacing one parser by 4 parsers a bit scary though. Admittedly one of those parsers (old wikitext -> XML) is not needed in the long term, and the XML->XHTML renderer would be pretty simple. But it does mean that every change to the grammar needs to be carefully implemented both in parsing and de-parsing.

I guess every test case would also involve a compulsory roundtrip. If it doesn't survive the roundtrip perfectly, it fails.

...

To summarize: We should switch to storing a much more descriptive format so changes in the wikitext format do not break anything: the wikitext can just be generated from the XML, in whichever format you want. This means it should be able to use (cleaned up) mediawiki wikitext, wikicreole or many other systems - per user. (Although as far as I can see wikicreole isn't available as context free grammar either..)

That's quite a big benefit - if we use/invent a "standard" XML format, we would be interoperable with any other wiki software that used it. Templates etc notwithstanding.

Steve

Thomas Dalton

11 Nov 11 Nov

4:24 a.m.

New subject: A Modest Proposal on grammar and parsers

...

...
To change the format to XML (and updating the wikitext format at the same time) means we need four important things: an 'old wikitext'->XML converter, a XML->'good wikitext' converter, a 'good wikitext'->XML converter and a XML->HTML parser. (s/converter/parser, if you care about the exact words). The 'good wikitext' and html parsers should be fairly easy; the first is just plain hard.

I've only ever used one system that worked like that: LambdaMOO. When you write code in that system, it compiles it to bytecode, then decompiles it next time you want to edit it. It had some interesting quirks though:

Whitespace was self-normalising (not a bad thing)

Parentheses were self-normalising (sometimes a confusing thing)

/* Comments */ were stripped out and not stored (a stupid thing)

You couldn't save non-compiling code

Those are fairly minor problems for a programming language. They are quite major problems for a language intended for laypeople to write articles in. Consider tables - at the moment, we use whitespace quite liberally and inconsistently to make tables easier to work with. Since the way you want it varies from page to page it would be impossible for the "deparser" to get it how users want it.

I think this is what would happen:

1) User creates new page with lots of wikitext. 2) User saves page. 3) User spots a mistake and clicks "Edit this page" to fix it. 4) User sees that everything has changed from when they saved it. 5) User runs away never to be seen again.

Steve Bennett

5:29 a.m.

New subject: A Modest Proposal on grammar and parsers

On 11/11/07, Thomas Dalton thomas.dalton@gmail.com wrote:

...

Those are fairly minor problems for a programming language. They are quite major problems for a language intended for laypeople to write articles in. Consider tables - at the moment, we use whitespace quite liberally and inconsistently to make tables easier to work with. Since the way you want it varies from page to page it would be impossible for the "deparser" to get it how users want it.

Nothing is ever going to make editing tables in raw text a fun or productive thing to do. Even if we never get wikiwyg happening, a table editor would be pretty bloody useful.

But the broader question of how whitespace would be treated in a parser/deparser is worth considering.

I think this is what would happen:

...

User creates new page with lots of wikitext.

User saves page.

User spots a mistake and clicks "Edit this page" to fix it.

User sees that everything has changed from when they saved it.

User runs away never to be seen again.

I asked my gf tonight whether she had ever edited Wikipedia:

1) User clicks "edit" 2) User sees lots of wikitext. 3) Step 5 as above.

Steve

Jay R. Ashworth

7:42 a.m.

New subject: A Modest Proposal on grammar and parsers

On Mon, Nov 12, 2007 at 12:29:00AM +1100, Steve Bennett wrote:

...

I asked my gf tonight whether she had ever edited Wikipedia:

User clicks "edit"

User sees lots of wikitext.

Step 5 as above.

Well, that leads us off to "are barriers to entry a feature or a bug", yet another topic that has engendered 500-posting monsters on this lest, most recently in about Feb or Mar, I think. :-)

Cheers, -- jr 'feature, if not too tall' a

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Jim Hu

14 Nov 14 Nov

10:49 a.m.

New subject: A Modest Proposal on grammar and parsers

Not ready for prime time, I'm sure.. but my feeble coding efforts have been going into this:

http://www.mediawiki.org/wiki/Extension:TableEdit

Lots of things on the todo, but we do have it running on EcoliWiki.

Jim

On Nov 11, 2007, at 7:29 AM, Steve Bennett wrote:

...

On 11/11/07, Thomas Dalton thomas.dalton@gmail.com wrote:

...
Those are fairly minor problems for a programming language. They are quite major problems for a language intended for laypeople to write articles in. Consider tables - at the moment, we use whitespace quite liberally and inconsistently to make tables easier to work with. Since the way you want it varies from page to page it would be impossible for the "deparser" to get it how users want it.

Nothing is ever going to make editing tables in raw text a fun or productive thing to do. Even if we never get wikiwyg happening, a table editor would be pretty bloody useful.

But the broader question of how whitespace would be treated in a parser/deparser is worth considering.

I think this is what would happen:

...

User creates new page with lots of wikitext.

User saves page.

User spots a mistake and clicks "Edit this page" to fix it.

User sees that everything has changed from when they saved it.

User runs away never to be seen again.

I asked my gf tonight whether she had ever edited Wikipedia:

User clicks "edit"

User sees lots of wikitext.

Step 5 as above.

Steve _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Steve Bennett

9:04 p.m.

New subject: A Modest Proposal on grammar and parsers

On 11/15/07, Jim Hu jimhu@tamu.edu wrote:

...

Not ready for prime time, I'm sure.. but my feeble coding efforts have been going into this:
    http://www.mediawiki.org/wiki/Extension:TableEdit
Lots of things on the todo, but we do have it running on EcoliWiki.

I don't suppose you could set up a sandbox somewhere that anyone could play with? I was thinking about making a javascript table editor, glad to see you've beaten me to it.

Editing table wikitext sucks. I can never do it without the manual page open, and it seems incredibly fragile - the slightest mistake and half the table implodes. So this extension is extremely welcome.

Steve

David Gerard

15 Nov 15 Nov

5:22 a.m.

New subject: A Modest Proposal on grammar and parsers

On 15/11/2007, Steve Bennett stevagewp@gmail.com wrote:

...

Editing table wikitext sucks. I can never do it without the manual page open, and it seems incredibly fragile - the slightest mistake and half the table implodes. So this extension is extremely welcome.

Yeah. Basically it's a shorthand for table HTML, and isn't really much of an improvement over it - it's all the pain of table HTML just using different tokens.

So yes, a table constructor would be marvellous!

- d.

Jay R. Ashworth

10 Nov 10 Nov

3:17 p.m.

New subject: A Modest Proposal on grammar and parsers

On Sat, Nov 10, 2007 at 11:26:42PM +0100, Merlijn van Deen wrote:

...

Most importantly, I think we should stop storing wikitext. Storing wikitext makes it hard to make changes in the syntax, because it would break pretty much every existing page. Wikitext is an ambiguous way of storing 'the way it is meant'; XML is a clear way of doing this. As the text is compressed, using wikitext or XML does not make that big of a difference.

We did this one about 6 months ago, check the archives.

...

However, XML makes parsing much easier. Yes, it will need two steps, but when regenerating the page from the database, it's much easier (no ugly regexps, just a simple SAX parser). Besides, as a pywikipedia developer, I'd like to have XML output ;)

Sure, but we *still* need to regularize the parser before we can do that.

...

To summarize: We should switch to storing a much more descriptive format so changes in the wikitext format do not break anything: the wikitext can just be generated from the XML, in whichever format you want. This means it should be able to use (cleaned up) mediawiki wikitext, wikicreole or many other systems - per user. (Although as far as I can see wikicreole isn't available as context free grammar either..)

I should note that it seems likely to become harder to calculate diffs if we store the parse tree instead of the wikitext... but on this point I'll be willing to admit I might be entirely off base.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Simetrical

4:18 p.m.

New subject: A Modest Proposal on grammar and parsers

On 11/10/07, Merlijn van Deen valhallasw@arctus.nl wrote:

...

To change the format to XML (and updating the wikitext format at the same time) means we need four important things: an 'old wikitext'->XML converter, a XML->'good wikitext' converter, a 'good wikitext'->XML converter and a XML->HTML parser. (s/converter/parser, if you care about the exact words). The 'good wikitext' and html parsers should be fairly easy; the first is just plain hard.

No, the first isn't hard, for a simple reason: you just edit the current parser a bit. Instead of outputting HTML, it outputs the XML format of your choice. In other words, you change the handful of bits that actually create the output string, while leaving all the twisty logic untouched. Of course this would take some hours to do, but it's not that it would be a hard problem, it would just take a bit of work.

The hard part, IMO, is probably creating a wikitext format that can roundtrip with XML. Has anyone tried this kind of thing? Has it worked? Or really, to start with, has anyone created parseable wikitext period?

...

(Although as far as I can see wikicreole isn't available as context free grammar either..)

There was some discussion about this when Wikicreole inclusion was discussed. Apparently the parseability was completely ignored when formulating the syntax (as with MW syntax), so it's probably easier to parse than MediaWiki mainly by virtue of having fewer features.

On 11/10/07, Soo Reams soo@sooreams.com wrote:

...

I would volunteer but I probably lack both skill and time.

I must say, perhaps a bit cynically, that this is the best summary of the discussion that anyone's given so far. There's nothing wrong with talk, but it's not going to amount to much if nobody much is being talked *to*.

Steve Bennett

5:51 p.m.

New subject: A Modest Proposal on grammar and parsers

On 11/11/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

I must say, perhaps a bit cynically, that this is the best summary of the discussion that anyone's given so far. There's nothing wrong with talk, but it's not going to amount to much if nobody much is being talked *to*.

I'm planning to help. I don't think the parser is particularly hard to write, if we have a well-defined grammar. We do need buy-in from the major players, who have not been heard from yet.

Steve

Simetrical

6:03 p.m.

New subject: A Modest Proposal on grammar and parsers

On 11/10/07, Steve Bennett stevagewp@gmail.com wrote:

...

I'm planning to help. I don't think the parser is particularly hard to write, if we have a well-defined grammar.

Not a good premise, since we don't *have* a well-defined grammar.

Steve Bennett

6:15 p.m.

New subject: A Modest Proposal on grammar and parsers

On 11/11/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

Not a good premise, since we don't *have* a well-defined grammar.

True.

The existing, stalled, grammar project seemed to be predicated on the notion of writing a grammar that exactly matched the behaviour of the current parser.That turned out to be hard.

Why don't we focus our attention on writing the grammar for the new parser, in a way that is X% compatible with existing wikitext? We can then build both the grammar and the parser simultaneously, after a certain point.

One attractive feature of this method is that we can leave out the "really hard" bits of the grammar at first, then eventually decide whether we want them in the X% at all.

Steve

Jay R. Ashworth

11 Nov 11 Nov

7:41 a.m.

New subject: A Modest Proposal on grammar and parsers

On Sun, Nov 11, 2007 at 01:15:30PM +1100, Steve Bennett wrote:

...

Why don't we focus our attention on writing the grammar for the new parser, in a way that is X% compatible with existing wikitext? We can then build both the grammar and the parser simultaneously, after a certain point.

Sure.

Go see my message from Thursday about how to calculate "X%". :-)

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Jay R. Ashworth

7:40 a.m.

New subject: A Modest Proposal on grammar and parsers

On Sun, Nov 11, 2007 at 12:51:50PM +1100, Steve Bennett wrote:

...

I'm planning to help. I don't think the parser is particularly hard to write, if we have a well-defined grammar. We do need buy-in from the major players, who have not been heard from yet.

It's a bit circular: they're not going to buy in unless there's a reasonable demonstration of possible success **as they see it**.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

6094

Age (days ago)

6099

Last active (days ago)

wikitech-l@lists.wikimedia.org

14 comments

7 participants

tags (0)

participants (7)

David Gerard
Jay R. Ashworth
Jim Hu
Merlijn van Deen
Simetrical
Steve Bennett
Thomas Dalton