[Wikitext-l] New parser: Kiwi

Thu Feb 3 04:47:29 UTC 2011

FYI: *we* are seeing your entire message, on-list
-- j

----- Original Message -----
> From: "Karl Matthias" <karl at matthias.org>
> To: wikitext-l at lists.wikimedia.org
> Sent: Tuesday, February 1, 2011 7:48:30 PM
> Subject: Re: [Wikitext-l] New parser: Kiwi
> Apologies... even the second attempt was truncated it seems. Here's
> one final try
> 
> Karl
> -----------
> Alan Post wrote:
> > Interesting. Is the PEG grammar available for this parser?
> 
> >
> > -Alan
> 
> It's at https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg
> 
> Get peg/leg from http://piumarta.com/software/peg/
> 
> 
> I just tried it and already found a bug on the first Hello World (it
> surrounds headers inside paragraphs).
> It strangely converts templates into underscored words. They may be
> expecting some other parser piece to restore it. I'm pretty sure there
> 
> are corner cases in the preprocessor (eg. just looking at the peg file
> they don't handle mixed case noincludes), but I don't think that
> should
> need to be handled by the parser itself.
> 
> The grammar looks elegant. I doubt it can really handle full wikitext.
> 
> But it would be so nice if it did...
> 
> 
> I'm one of the authors of the Kiwi parser and will be presenting it at
> the Data Summit on Friday. The parser is pretty complete but
> certainly we could use some community support and we encourage
> feedback and participation! It is a highly functional tool already
> but it can use some polish. It does actually handle most wikitext,
> though not absolutely everything.
> 
> From your post I can see that you are experiencing a couple of design
> decisions we made in writing this parser. We did not set out to match
> the exact HTML output of MediaWiki, only to output something that will
> look the same in the browser. This might not be the best approach,
> but right now this is the case. Our site doesn't have the same needs
> as Wikipedia so when in doubt we leaned toward what suited our needs
> and not necessarily ultimate tolerance of poor syntax (though it is
> somewhat flexible). Another design decision is that everything that
> you put in comes out wrapped in paragraph tags. Usually this wraps
> the whole document, so if your whole document was just a heading, then
> yes it is wrapped in paragraph tags. This is probably not the best
> way to handle this but it's what it currently does. Feel free to
> contribute a different solution.
> 
> Templates, as you probably know, require full integration with an
> application to work in the way that MediaWiki handles them, because
> they require access to the data store, and possibly other
> configuration information. We built a parser that works independently
> of the data store (indeed, even on the command line in a somewhat
> degenerate form). In order to do that, we had to decouple template
> retrieval from the parse. If you take a look in the Ruby FFI
> examples, you will see a more elegant handling of templates(though it
> needs work). When a document is parsed, the parser library makes
> available a list of templates that were found, the arguments passed to
> the template, and the unique replacement tag in the document for
> inserting the template once rendered. Those underscored tags that come
> out are not a bug, they are those unique tags. There is a switch to
> disable templates and in that case it just swallows them instead. So
> the template handling work flow (simplistically) is:
> 
> 1. Parse original document and generate list of templates,
> arguments, replacement tags
> 2. Fetch first template, if there is no recursion needed, insert
> into original document
> 3. Fetch next template, etc
> 
> We currently recurse 6 templates deep in the bindings we built for
> AboutUs.org (sysop-only at the moment). Template arguments don't work
> right now, but it's fairly trivial to do it. We just haven't done it
> yet.
> 
> Like templates, images require some different solutions if the parser
> is to be decoupled. Our parser does not re-size images, store them,
> etc. It just works with image URLs. If your application requires
> images to be regularized, you would need to implement resizing them at
> upload, or lazily at load time, or whatever works in your scenario.
> More work is needed in this area, though if you check out
> http://kiwi.drasticcode.com you can see that most image support is
> working (no resizing). You can also experiment with the parser there
> as needed.
> 
> Hope that at least helps explain what we've done. Again, feedback and
> particularly code contributions are appreciated!
> 
> Cheers,
> Karl
> 
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l