[Wikitext-l] New parser: Kiwi

Thu Feb 3 04:23:28 UTC 2011

Ah, and I see that people did receive the original: it's just the
archive that is broken.  Thanks for that.

Cheers,
Karl

On Tue, Feb 1, 2011 at 11:19 PM, Andreas Jonsson
<andreas.jonsson at kreablo.se> wrote:
>
> 2011-02-02 01:48, Karl Matthias skrev:
>> Apologies... even the second attempt was truncated it seems.  Here's
>> one final try
>
> You are hit by the same problem I was a few days ago on this list.  You
> have a line that starts with "From your" in the text.
>
> /Andreas
>
>
>> Karl
>> -----------
>>     Alan Post wrote:
>>     > Interesting.  Is the PEG grammar available for this parser?
>>
>>     >
>>     > -Alan
>>
>>     It's at https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg
>>
>>     Get peg/leg from http://piumarta.com/software/peg/
>>
>>
>>     I just tried it and already found a bug on the first Hello World (it
>>     surrounds headers inside paragraphs).
>>     It strangely converts templates into underscored words. They may be
>>     expecting some other parser piece to restore it. I'm pretty sure there
>>
>>     are corner cases in the preprocessor (eg. just looking at the peg file
>>     they don't handle mixed case noincludes), but I don't think that should
>>     need to be handled by the parser itself.
>>
>>     The grammar looks elegant. I doubt it can really handle full wikitext.
>>
>>     But it would be so nice if it did...
>>
>>
>> I'm one of the authors of the Kiwi parser and will be presenting it at
>> the Data Summit on Friday.  The parser is pretty complete but
>> certainly we could use some community support and we encourage
>> feedback and participation!  It is a highly functional tool already
>> but it can use some polish.  It does actually handle most wikitext,
>> though not absolutely everything.
>>
>>>From your post I can see that you are experiencing a couple of design
>> decisions we made in writing this parser.  We did not set out to match
>> the exact HTML output of MediaWiki, only to output something that will
>> look the same in the browser.  This might not be the best approach,
>> but right now this is the case.  Our site doesn't have the same needs
>> as Wikipedia so when in doubt we leaned toward what suited our needs
>> and not necessarily ultimate tolerance of poor syntax (though it is
>> somewhat flexible).  Another design decision is that everything that
>> you put in comes out wrapped in paragraph tags.  Usually this wraps
>> the whole document, so if your whole document was just a heading, then
>> yes it is wrapped in paragraph tags.  This is probably not the best
>> way to handle this but it's what it currently does.  Feel free to
>> contribute a different solution.
>>
>> Templates, as you probably know, require full integration with an
>> application to work in the way that MediaWiki handles them, because
>> they require access to the data store, and possibly other
>> configuration information.  We built a parser that works independently
>> of the data store (indeed, even on the command line in a somewhat
>> degenerate form).  In order to do that, we had to decouple template
>> retrieval from the parse.  If you take a look in the Ruby FFI
>> examples, you will see a more elegant handling of templates(though it
>> needs work).  When a document is parsed, the parser library makes
>> available a list of templates that were found, the arguments passed to
>> the template, and the unique replacement tag in the document for
>> inserting the template once rendered. Those underscored tags that come
>> out are not a bug, they are those unique tags.  There is a switch to
>> disable templates and in that case it just swallows them instead.  So
>> the template handling work flow (simplistically) is:
>>
>>    1. Parse original document and generate list of templates,
>> arguments, replacement tags
>>    2. Fetch first template, if there is no recursion needed, insert
>> into original document
>>    3. Fetch next template, etc
>>
>> We currently recurse 6 templates deep in the bindings we built for
>> AboutUs.org (sysop-only at the moment).  Template arguments don't work
>> right now, but it's fairly trivial to do it.  We just haven't done it
>> yet.
>>
>> Like templates, images require some different solutions if the parser
>> is to be decoupled.  Our parser does not re-size images, store them,
>> etc.  It just works with image URLs.  If your application requires
>> images to be regularized, you would need to implement resizing them at
>> upload, or lazily at load time, or whatever works in your scenario.
>> More work is needed in this area, though if you check out
>> http://kiwi.drasticcode.com you can see that most image support is
>> working (no resizing).  You can also experiment with the parser there
>> as needed.
>>
>> Hope that at least helps explain what we've done.  Again, feedback and
>> particularly code contributions are appreciated!
>>
>> Cheers,
>> Karl
>>
>> _______________________________________________
>> Wikitext-l mailing list
>> Wikitext-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>>
>
>
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>