[Wikitext-l] New parser: Kiwi
Karl Matthias
karl at matthias.org
Thu Feb 3 04:23:28 UTC 2011
Ah, and I see that people did receive the original: it's just the
archive that is broken. Thanks for that.
Cheers,
Karl
On Tue, Feb 1, 2011 at 11:19 PM, Andreas Jonsson
<andreas.jonsson at kreablo.se> wrote:
>
> 2011-02-02 01:48, Karl Matthias skrev:
>> Apologies... even the second attempt was truncated it seems. Here's
>> one final try
>
> You are hit by the same problem I was a few days ago on this list. You
> have a line that starts with "From your" in the text.
>
> /Andreas
>
>
>> Karl
>> -----------
>> Alan Post wrote:
>> > Interesting. Is the PEG grammar available for this parser?
>>
>> >
>> > -Alan
>>
>> It's at https://github.com/AboutUs/kiwi/blob/master/src/syntax.leg
>>
>> Get peg/leg from http://piumarta.com/software/peg/
>>
>>
>> I just tried it and already found a bug on the first Hello World (it
>> surrounds headers inside paragraphs).
>> It strangely converts templates into underscored words. They may be
>> expecting some other parser piece to restore it. I'm pretty sure there
>>
>> are corner cases in the preprocessor (eg. just looking at the peg file
>> they don't handle mixed case noincludes), but I don't think that should
>> need to be handled by the parser itself.
>>
>> The grammar looks elegant. I doubt it can really handle full wikitext.
>>
>> But it would be so nice if it did...
>>
>>
>> I'm one of the authors of the Kiwi parser and will be presenting it at
>> the Data Summit on Friday. The parser is pretty complete but
>> certainly we could use some community support and we encourage
>> feedback and participation! It is a highly functional tool already
>> but it can use some polish. It does actually handle most wikitext,
>> though not absolutely everything.
>>
>>>From your post I can see that you are experiencing a couple of design
>> decisions we made in writing this parser. We did not set out to match
>> the exact HTML output of MediaWiki, only to output something that will
>> look the same in the browser. This might not be the best approach,
>> but right now this is the case. Our site doesn't have the same needs
>> as Wikipedia so when in doubt we leaned toward what suited our needs
>> and not necessarily ultimate tolerance of poor syntax (though it is
>> somewhat flexible). Another design decision is that everything that
>> you put in comes out wrapped in paragraph tags. Usually this wraps
>> the whole document, so if your whole document was just a heading, then
>> yes it is wrapped in paragraph tags. This is probably not the best
>> way to handle this but it's what it currently does. Feel free to
>> contribute a different solution.
>>
>> Templates, as you probably know, require full integration with an
>> application to work in the way that MediaWiki handles them, because
>> they require access to the data store, and possibly other
>> configuration information. We built a parser that works independently
>> of the data store (indeed, even on the command line in a somewhat
>> degenerate form). In order to do that, we had to decouple template
>> retrieval from the parse. If you take a look in the Ruby FFI
>> examples, you will see a more elegant handling of templates(though it
>> needs work). When a document is parsed, the parser library makes
>> available a list of templates that were found, the arguments passed to
>> the template, and the unique replacement tag in the document for
>> inserting the template once rendered. Those underscored tags that come
>> out are not a bug, they are those unique tags. There is a switch to
>> disable templates and in that case it just swallows them instead. So
>> the template handling work flow (simplistically) is:
>>
>> 1. Parse original document and generate list of templates,
>> arguments, replacement tags
>> 2. Fetch first template, if there is no recursion needed, insert
>> into original document
>> 3. Fetch next template, etc
>>
>> We currently recurse 6 templates deep in the bindings we built for
>> AboutUs.org (sysop-only at the moment). Template arguments don't work
>> right now, but it's fairly trivial to do it. We just haven't done it
>> yet.
>>
>> Like templates, images require some different solutions if the parser
>> is to be decoupled. Our parser does not re-size images, store them,
>> etc. It just works with image URLs. If your application requires
>> images to be regularized, you would need to implement resizing them at
>> upload, or lazily at load time, or whatever works in your scenario.
>> More work is needed in this area, though if you check out
>> http://kiwi.drasticcode.com you can see that most image support is
>> working (no resizing). You can also experiment with the parser there
>> as needed.
>>
>> Hope that at least helps explain what we've done. Again, feedback and
>> particularly code contributions are appreciated!
>>
>> Cheers,
>> Karl
>>
>> _______________________________________________
>> Wikitext-l mailing list
>> Wikitext-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>>
>
>
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>
More information about the Wikitext-l
mailing list