Cunningham's exploratory parsing

List overview All Threads
Download

newer

older

ParserPlayground gadget/extension...

MediaWiki parser in Python

Erik Rose

12 Jul 2011 12 Jul '11

1:46 a.m.

Say, while everybody's trying to figure out a formal grammar, have you had a look at Ward Cunningham's exploratory parsing kit? He gave me a demo at OSBridge, and it's a really handy tool. Basically, it's a web app with an asynchronous C backend. You paste a tentative PEG grammar into a textarea, and it runs through whatever corpus you want, showing you representative instances of how it does or does not match. He was running it against the full English Wikipedia on his laptop, and it took only half an hour or something—with results coming in as they were generated, of course.

Using that, they made a PEG-and-then-some implementation of MW syntax that parses darn near all of Wikipedia: https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg. (I call it "PEG-and-then-some" because it does have a lot of callbacks which might interlock with and affect the rule matching—not sure.)

Cheers, Erik

Show replies by date

Brion Vibber

12 Jul 12 Jul

2:17 a.m.

On Mon, Jul 11, 2011 at 4:46 PM, Erik Rose erik@mozilla.com wrote:

...

Say, while everybody's trying to figure out a formal grammar, have you had a look at Ward Cunningham's exploratory parsing kit? He gave me a demo at OSBridge, and it's a really handy tool. Basically, it's a web app with an asynchronous C backend. You paste a tentative PEG grammar into a textarea, and it runs through whatever corpus you want, showing you representative instances of how it does or does not match. He was running it against the full English Wikipedia on his laptop, and it took only half an hour or something—with results coming in as they were generated, of course.

...

Using that, they made a PEG-and-then-some implementation of MW syntax that parses darn near all of Wikipedia: https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg. (I call it "PEG-and-then-some" because it does have a lot of callbacks which might interlock with and affect the rule matching—not sure.)

It is indeed dang impressive -- I expect to be stealing at least some of those grammar rules. :)

We are however producing a different sort of intermediate structure rather than going straight to HTML output, so things won't be an exact match (especially where we do template stuff).

-- brion

Erik Rose

2:24 a.m.

On Jul 11, 2011, at 5:17 PM, Brion Vibber wrote:

...

We are however producing a different sort of intermediate structure rather than going straight to HTML output, so things won't be an exact match (especially where we do template stuff).

Nor are we going straight to HTML, which is one reason we didn't steal this stuff. :-)

Neil Kandalgaonkar

4:35 a.m.

Trevor & I talked with him extensively about this. BTW, around here, he's just Ward. :)

He too was disappointed that his team wrote rules to directly transform wikitext into HTML.

The parse-everything-in-Wikipedia thing isn't quite what it sounds like. If I recall correctly it works like this:

As part of his job at About.us, he was really looking for patterns of Wikitext that he could use to snag business information. One target was the Infobox on Wikipedia. So, the tool was a way of cataloging the various ways that people structure an Infobox template.

Because he wrote this in C, he added rules to the grammar to discard information in favor of keeping a data structure of constant size. That's mostly what the the <<< >>> in the grammar mean. Anyway, this then serves as a sampling of the majority of the structures one is interested in. The more rules you write, the more "unknown" stuff falls into the fixed size of structures that are unparsed. IIRC he agreed it might not be so useful if you were writing a grammar for PHP or JS (I assume the same is true for Python).

On 7/11/11 5:24 PM, Erik Rose wrote:

...

On Jul 11, 2011, at 5:17 PM, Brion Vibber wrote:

...
We are however producing a different sort of intermediate structure rather than going straight to HTML output, so things won't be an exact match (especially where we do template stuff).

Nor are we going straight to HTML, which is one reason we didn't steal this stuff. :-) _______________________________________________ Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar |) neilk@wikimedia.org

Karl Matthias

8:01 a.m.

I worked with Ward a bit on this exploratory stuff and am one of the authors of Kiwi. I met some of you at the Data Summit in Sebastopol back in February. I've been largely lurking here due to time constraints.

Ward's exploratory parser does work pretty much as Neil described. It's made for galloping through huge amounts of data while gathering samples of where a parser works and were it doesn't. I do think it would be useful for testing a grammar for correctness against the full Wikipedia. Keep in mind that it would need to also work inside the XML parser that Ward wrote if you want to parse the Wikipedia dumps (or using a different one around peg/leg).

Regarding Kiwi, I'm happy to help however I can with the little amount of time I have at the moment. Our grammar for Kiwi is pretty good I think and there are only a few places where we have to manipulate it and it's done in the normal way supported by peg/leg–we have a C predicate. Generally this is just to control which expressions have to start at the beginning of a line. We never turn on or off parts of the grammar explicitly. Rules either match or they don't. Because it was done as a separate parser, it does not support some things that require being built into the code (e.g. image resizing).

As for being disappointed that we built an HTML translator and not a parser that parses to an AST, this was never on the table. Thomas and I built this mostly in our free time with only a bit of time at work and we did it to solve a practical problem: how to render our Wikitext fast enough at AboutUs to remove our caching layer. I'm surprised, Neil, that you think Ward was disappointed with this as he was always supportive of our efforts and indeed introduced us to Peg and spent some time helping us get into writing grammars and understanding the pitfalls. I'm sorry it doesn't solve the problem you guys have off the shelf, but hopefully it helps open some doors, or at least serves as a model of how a grammar can be written.

If I can be of help, please just give me a shout.

Cheers, Karl

On Tue, Jul 12, 2011 at 4:35 AM, Neil Kandalgaonkar neilk@wikimedia.orgwrote:

...

Trevor & I talked with him extensively about this. BTW, around here, he's just Ward. :)

He too was disappointed that his team wrote rules to directly transform wikitext into HTML.

The parse-everything-in-Wikipedia thing isn't quite what it sounds like. If I recall correctly it works like this:

As part of his job at About.us, he was really looking for patterns of Wikitext that he could use to snag business information. One target was the Infobox on Wikipedia. So, the tool was a way of cataloging the various ways that people structure an Infobox template.

Because he wrote this in C, he added rules to the grammar to discard information in favor of keeping a data structure of constant size. That's mostly what the the <<< >>> in the grammar mean. Anyway, this then serves as a sampling of the majority of the structures one is interested in. The more rules you write, the more "unknown" stuff falls into the fixed size of structures that are unparsed. IIRC he agreed it might not be so useful if you were writing a grammar for PHP or JS (I assume the same is true for Python).

On 7/11/11 5:24 PM, Erik Rose wrote:

...
On Jul 11, 2011, at 5:17 PM, Brion Vibber wrote:

...
We are however producing a different sort of intermediate structure

rather than going straight to HTML output, so things won't be an exact match (especially where we do template stuff).

...
Nor are we going straight to HTML, which is one reason we didn't steal

this stuff. :-)

...

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar |) neilk@wikimedia.org

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Neil Kandalgaonkar

8:40 a.m.

I'm probably mis-remembering that... I probably was the one disappointed in it being a translation to HTML. Still I understand why you did it that way.

It's kind of amazing how we all have these projects we call parsers, and then they all do completely different things. :)

On 7/11/11 11:01 PM, Karl Matthias wrote:

...

I'm surprised, Neil, that you think Ward was disappointed with this as he was always supportive of our efforts and indeed introduced us to Peg and spent some time helping us get into writing grammars and understanding the pitfalls. I'm sorry it doesn't solve the problem you guys have off the shelf, but hopefully it helps open some doors, or at least serves as a model of how a grammar can be written.

If I can be of help, please just give me a shout.

Cheers, Karl

On Tue, Jul 12, 2011 at 4:35 AM, Neil Kandalgaonkar <neilk@wikimedia.org mailto:neilk@wikimedia.org> wrote:

Trevor & I talked with him extensively about this. BTW, around here,
he's just Ward. :)

He too was disappointed that his team wrote rules to directly transform
wikitext into HTML.

The parse-everything-in-Wikipedia thing isn't quite what it sounds like.
If I recall correctly it works like this:

As part of his job at About.us, he was really looking for patterns of
Wikitext that he could use to snag business information. One target was
the Infobox on Wikipedia. So, the tool was a way of cataloging the
various ways that people structure an Infobox template.

Because he wrote this in C, he added rules to the grammar to discard
information in favor of keeping a data structure of constant size.
That's mostly what the the <<< >>> in the grammar mean. Anyway, this
then serves as a sampling of the majority of the structures one is
interested in. The more rules you write, the more "unknown" stuff falls
into the fixed size of structures that are unparsed. IIRC he agreed it
might not be so useful if you were writing a grammar for PHP or JS (I
assume the same is true for Python).

On 7/11/11 5:24 PM, Erik Rose wrote:
>  On Jul 11, 2011, at 5:17 PM, Brion Vibber wrote:
> > We are however producing a different sort of intermediate
structure rather than going straight to HTML output, so things won't
be an exact match (especially where we do template stuff).
>
>  Nor are we going straight to HTML, which is one reason we didn't
steal this stuff. :-)
>  _______________________________________________
>  Wikitext-l mailing list
>  Wikitext-l@lists.wikimedia.org <mailto:Wikitext-l@lists.wikimedia.org>
>  https://lists.wikimedia.org/mailman/listinfo/wikitext-l

--
Neil Kandalgaonkar  |) <neilk@wikimedia.org
<mailto:neilk@wikimedia.org>>

_______________________________________________
Wikitext-l mailing list
Wikitext-l@lists.wikimedia.org <mailto:Wikitext-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l

-- Neil Kandalgaonkar |) neilk@wikimedia.org

Jay Ashworth

4:32 a.m.

----- Original Message -----

...

From: "Erik Rose" erik@mozilla.com

...

Say, while everybody's trying to figure out a formal grammar, have you had a look at Ward Cunningham's exploratory parsing kit? He gave me a demo at OSBridge, and it's a really handy tool. Basically, it's a web app with an asynchronous C backend. You paste a tentative PEG grammar into a textarea, and it runs through whatever corpus you want, showing you representative instances of how it does or does not match. He was running it against the full English Wikipedia on his laptop, and it took only half an hour or something—with results coming in as they were generated, of course.

If I correctly understood what you said just there...

*wow*.

Cheers, -- jra

-- Jay R. Ashworth Baylink jra@baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

4924

Age (days ago)

4925

Last active (days ago)

wikitext-l@lists.wikimedia.org

6 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Erik Rose
Jay Ashworth
Karl Matthias
Neil Kandalgaonkar