[Wikitext-l] New parser: Kiwi

Wed Feb 2 23:08:30 UTC 2011

Karl Matthias wrote:
> I'm one of the authors of the Kiwi parser and will be presenting it at
> the Data Summit on Friday.  The parser is pretty complete but certainly
> we could use some community support and we encourage feedback and
> participation!  It is a highly functional tool already but it can use
> some polish.  It does actually handle most wikitext, though not
> absolutely everything.
> 
> From your post I can see that you are experiencing a couple of design
> decisions we made in writing this parser.  We did not set out to match
> the exact HTML output of MediaWiki, only to output something that will
> look the same in the browser.  This might not be the best approach, but
> right now this is the case.  Our site doesn't have the same needs as
> Wikipedia so when in doubt we leaned toward what suited our needs and
> not necessarily ultimate tolerance of poor syntax (though it is somewhat
> flexible). 
I felt bad for pointing out issues just after first try. I understand
that you have a much smaller content than wikipedia, and can use just a
subset of the markup without about corner cases.
I approach it as a tool which could work for the bigger parser, though.
Currently, it looks as just another wiki syntax, looking similar to
MediaWiki one.

> Another design decision is that everything that you put in
> comes out wrapped in paragraph tags.  Usually this wraps the whole
> document, so if your whole document was just a heading, then yes it is
> wrapped in paragraph tags.  This is probably not the best way to handle
> this but it's what it currently does.  Feel free to contribute a
> different solution.

It doesn't seem to be legal html*, so I wouldn't justify it just as a
"design decision". Same could be argued for nested <p> tags.

* opening the <hX> seems to implicitely close the previous <p>, leading
to an unmatched </p>.

> Templates, as you probably know, require full integration with an
> application to work in the way that MediaWiki handles them, because they
> require access to the data store, and possibly other configuration
> information.  We built a parser that works independently of the data
> store (indeed, even on the command line in a somewhat degenerate form). 
> In order to do that, we had to decouple template retrieval from the
> parse.  If you take a look in the Ruby FFI examples, you will see a more
> elegant handling of templates(though it needs work).  When a document is
> parsed, the parser library makes available a list of templates that were
> found, the arguments passed to the template, and the unique replacement
> tag in the document for inserting the template once rendered. Those
> underscored tags that come out are not a bug, they are those unique
> tags.

I supposed that it was somehting like that, but it was odd that it did
such conversion instead of leaving them as literals in such case.
I used just the parser binary. I have been looking at the ruby code, and
despite of the foreign language, understanding a bit more of its work.

> Like templates, images require some different solutions if the parser is
> to be decoupled.  Our parser does not re-size images, store them, etc. 
> It just works with image URLs.  If your application requires images to
> be regularized, you would need to implement resizing them at upload, or
> lazily at load time, or whatever works in your scenario. 

A parser shouldn't really need to handle images. At most it would
provide a callback so that the app could do something with the image urls.

> More work is
> needed in this area, though if you check out http://kiwi.drasticcode.com
> you can see that most image support is working (no resizing).  You can
> also experiment with the parser there as needed.

The url mapping used there, make some titles impossible to use, such as
making an entry for [[Edit]] - http://en.wikipedia.org/wiki/Edit

> Hope that at least helps explain what we've done.  Again, feedback and
> particularly code contributions are appreciated!
> 
> Cheers,
> Karl

Just code lurking for now :)