Wikitext-l September 2010

wikitext-l@lists.wikimedia.org

6 participants
7 discussions

by Andreas Jonsson

Hi, I have set up a site for testing my parser implementation: http://libmwparser.kreablo.se/index.php/Libmwparsertest Please go ahead and edit. I have disabled most of the preprocessing, as it seems very hard to lift out the independent preprocessing from the parser preparation stuff. But it should be easy to write a new one with only the required functionality (which is parser functions, magic words, comment removal, and transclusion). It would still take a lot of work to make a version that could be substituted in place of the current parser with support for all features. But its a solid proof of concept. Best regards, Andreas Jonsson

13 years, 7 months

Parsing of image links

by Andreas Jonsson

The syntax of image links with caption is seriously flawed, but I think that I have found a reasonable solution for handling them: parse them as "inline blocks". To make an inline block out of the image link with caption, we first let it have its own block context in the lexer, in order to guarantee nexting order of internal block elements. This means that the end token cannot appear in the wrong block context: [[File:example.jpg|<table><td> this ]] is not an end token for the image link</table> but this ]] is I have already discussed the image links in the context of speculative execution in the lexer, to guarantee that any opened image link will be followed by an image link closing token. The max nesting level for links is limited to 2 to avoid pathological speculations. In the parser, inline blocks may appear in inlined text lines. They will break the inlined text line from the point of view of handling apostrophe parsing, however. Since block elements may appear in the image caption, they cannot be part of the lookahead that is performed for scanning for apostrophes. This means that in this example: text '' italic [[File:example.jpg| text ]] foo '' bar the text "text '' italic" and the text " foo '' bar" are processed separately when it comes to apostrophe parsing and the result will be: text italic<a ...><img ..></a>foo bar Which is different from the current parser, where we have: text italic<a ...><img ..></a>foo bar However, the behavior will be the same regardless of new lines in the caption: text '' italic [[File:example.jpg| text text ]] foo '' bar still: text italic<a ...><img ..></a>foo bar The original parser have problems: text italic<a ...><img ..></a>foo bar (My guess is that it first renders the inside of the alt attribute, which is cleaned up in the attribute sanitizing, and then it discovers that there is a missing and adds that in.) In the original parser, wikitext list elements cannot appear in image captions. It would, of course, be very easy to just disable the wikitext list tokens in the lexer to provide the same behavior, but this seems a bit inconsistent as any other block element may appear in the caption. If we instead, in the parser, push/pop the current list context to a stack when entering/leaving an "inlined block", we can support lists inside the caption with expected behavior in this case: * list [[File:example.jpg| * list item in image caption ]] * continuing outer list It is up to the listener to decide what to do with the link caption. Since it is fully parsed the listening application must be prepared for this. In html output, the caption is rendered inside an 'alt' text, unless there is a 'frame' or 'thumb' option and no explicit 'alt' option (in which case the caption is completely ignored). So the listener should have the ability to toggle rendering of markup on and off in order to render the caption inside the alt attribute. /Andreas

13 years, 7 months

On parsing tokenized wikitext

by Andreas Jonsson

In my previous post I covered the lexer. Here I will describe the parser, the parser context and the listener interface. After the lexer's extensive job att providing a reasonably well formed token stream, the parser's job becomes completely straightforward. == The parser For inlined elemements, the parser will just mindlessly report these to the context object: inline_element: word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag ; space: token = SPACE_TAB {IE(CX->onSpace(CX, $token->getText($token));)} ; etc. The lexer guarantees that a closing token will not appear before a corresponding opening token, and the parser context takes care of nesting formats and removing empty format tags. For block elements, the only special thing the parser need to pay attention to is the fact that end tokens may be missing. Therefore, end-of-file is always accepted instead of the closing token, for instance: html_div: token = HTML_DIV_OPEN { CX->beginHtmlDiv(CX, $token->custom); } block_element_contents (HTML_DIV_CLOSE|EOF) { CX->endHtmlDiv(CX); } ; The rule 'block_element_contents' covers all parser productions. The lexer will restrict which tokens that may appear. For instance 'HTML_DIV_CLOSE' will never appear before a corresponding 'HTML_DIV_OPEN'. Also, list items and table cells will not appear unless the current block context is correct. I have also introduced a max nesting level limit in the lexer, so stack space is also not an issue. == The parser context The parser context relays the parser events to a listener, but it will insert and remove events to produce a well formed output. For instance: text '' italic bold-italic bold text will result in an event stream to the listener that will look like this: text italic bold-italic bold text Two mechanisms are used to implement this: * The call to the "begin" method is delayed until some actual inlined content is produced. The call is never taken if an "end" event is recieved before such content. * The order of the formats is maintained so that inner formats can be closed and reopened when a non-matching end token is recieved. So, most of the parser context's methods look like this: static void beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr) { MW_DELAYED_CALL( context, beginHtmlStrong, endHtmlStrong, attr, NULL); MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, attr, NULL, false); MWLISTENER *l = &context->listener; l->beginHtmlStrong(l, attr); } static void endHtmlStrong(MWPARSERCONTEXT *context) { MW_SKIP_IF_EMPTY( context, beginHtmlStrong, endHtmlStrong, NULL); MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL); MWLISTENER *l = &context->listener; l->endHtmlStrong(l); } Block elements are already guaranteed by the lexer to be well nested, so the context typically does not need to do anything special about those. Only the wikitext list elements needs to be resolved by the context. == The listener The listening application needs to implement the MWLISTENER interface. I haven't added support for all features yet, but at the moment, there are 91 methods in this interface. They are trivial to implement, though. The only thing to think about is that it is the listener's responsibility to escape the contents of nowiki and special characters, and also to filter the attribute lists. /Andreas

13 years, 7 months

PHP wrapper library for libmwparser available.

by Andreas Jonsson

I have just commited an initial version of a php wrapper library for my parser. http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser An example of how it can be used: include("mwp.php"); $istream = MWParserOpenString("input", "Hello World!", MWPARSER_UTF8); $parser = new_MWPARSER($istream); $out = MWParseArticle($parser); print implode($out). "\n"; MWParserCloseInputStream($istream); $istream = MWParserOpenString("input", "{|\n|[[Hello|hello world!]]", MWPARSER_UTF8); MWParserReset($parser, $istream); $out = MWParseArticle($parser); print implode($out). "\n"; which gives the following output: Hello World! <table><tbody><tr><td>hello world!</td></tr></tbody></table> As you can see, I haven't sorted out the internal link resolution yet. But there is an efficient solution to this: make the database lookup after the lexer has run, before the parser runs. This is possible as all internal links are already known at that stage, and it would enable the parser to generate the links directly without any postprocessing. Since it doesn't completely replace the current parser, it will take a bit of surgery to insert it into an instance of MediaWiki. I haven't tried this yet. There is a lot of tedious work left to do before everything is completed. For instance, a large part of Sanitizer.php must be ported over to C in order to validate the html attributes. Best regards, /Andreas

13 years, 7 months

InlineEditor extension

by Jan Paul Posma

Hello, In commit 72458 I've added the InlineEditor extension. [1] This extension is a working implementation of the prototype(s) earlier posted on this list. It's not actually for use on live wikis, but more a proof-of-concept and framework to experiment with. I will explain the extension in detail for those of you who might be interested. == Design overview == The extension exists of several parts, structured in sub-directories like the UsabilityInitiative extension. The InlineEditor extension itself provides a framework for different edit modes to build on. It displays the edit modes, provides an interface to mark editable pieces of wikitext, provides a client-side inline editor which the edit modes *may* use, is configurable with several fallback options to the full/traditional editor, and handles previewing, publishing, undo and redo. Every other extension provides an edit mode for the InlineEditor extension. They hook into InlineEditorMark and InlineEditorDefineEditors. The first one is called whenever wikitext is passed through the extension, and all edit modes can mark their editable pieces. Once this is done, a few algorithms will combine this with information of previously edited pieces, generate both wikitext to run through the parser, and JSON which is passed to the client, which maps the editable pieces to the original wikitext. The other hook is to include CSS, JS and messages to the page. == Limitations == There are many things which are sub-optimal right now: * The editor is slow. Whenever changing a small element and previewing it, the entire page is reparsed. This will be fixed by parsing only the element if possible (i.e. references have side effects at the bottom of the page). * It's for now only possible to use the editor as primary editor, with a link to the full/traditional editor. There will be a configuration option whether to do this, or display a message at the top of the traditional edit page to switch to this editor. * I've not tested things in older browsers (or IE at all, for that matter). I only know it runs fine in Firefox and Chrome, but it may have bugs in other browsers. * The edit modes are really, really, basic right now. They may or may not screw things up. Most of them have just one or a few regular expressions which do well in general, but may fail at many edge-cases. * The editor may not handle all the messages and edge cases of the traditional editor. * The extensions is written for MediaWiki 1.16 but may or may not work with other versions. Also, I'm not sure at all whether the current set of edit modes is the way to go. Currently, they are mutually exclusive. Meaning that text marked by one editor is never included in text marked by another editor. However, maybe it's better to not have edit modes like this, but different granularity of editing. I.e. sentence => paragraph => block. This way the user will get familiar with more wikitext instead of always seeing small portions. The framework currently doesn't allow for overlap in markings, but I will work to make this possible. == Goals == Goal of this extension is to provide a framework to easily play with different modes of editing in-line. Feel free to write extensions that use this framework, or help with the framework itself. Any usability or technical suggestions are also welcome! I hope to get some documentation up on mediawiki.org anytime soon, but note that the code is heavily documented inline. Feel free to ask any questions: I'm probably forgetting to mention some things that may not be clear to everyone. Also, there is no public wiki at the moment to test this extension with, will work on that, but if someone else can enable it on a test wiki that would be great too! To install the extension(s), check the instructions in /trunk/extensions/InlineEditor/InlineEditor.php. Thanks for your time reading this! Regards, Jan Paul [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/72458

13 years, 7 months

LIME php parser

by James Salsman

I would like to use LIME -- http://sourceforge.net/projects/lime-php/ -- instead of a series of regular expression replacement statements to convert GIFT -- http://microformats.org/wiki/gift -- to the Quiz Extension format, both for purposes of maintainability and readability. However, I am concerned about the code review situation, and am not sure if it is reasonable to expect to depend on what would be a much more difficult code review. On the other hand, LIME has been stable for years, has two good reviews and the author seems reasonable: http://c2.com/cgi/wiki?IanKjos The included calculator example included is easily accessible, all my experiments with it so far have gone well, and I love the fact that it includes an option for native code compilation of inner loop code (lemon.c) but I am interested in using it to populate larger data structures and how it behaves in production PHP. Does anyone know anyone else who has used it? My understanding is that some subsets of the wikitext parser could easily be converted to a more formal grammar while others need to remain in PHP (e.g., transclusion), and I am familiar with many of wikitext's parsing ambiguity conflicts. I am not an expert in how to resolve such conflicts in LALR(1) grammars -- although I can squeak through the trial-and-error process. However, I am absolutely certain that moving wikitext parsing to a formal grammar would provide some serious opportunities for engineering improvements, also in maintainability, readability, and related efforts. Therefore, I am considering submitting LIME for code review, but I want to try something different. I would like to ask for community volunteers to review it first, with comments, before I submit it to be reviewed by the official development team. Are there any volunteers willing to provide a preliminary code review for LIME? Best regards, James Salsman

13 years, 7 months

Image links

by Andreas Jonsson

I have previosly written about speculative execution in the lexer. To exactly reproduce the behavior of the image links, not only one, but two speculations will be necessary. However, this is very complex and the use case is undocumented so I would like to simplify these. The original beheviour is as follows: the option list is split on the '|' character; the caption is the _last_ non-option in the list, if any. So, to reproduce this, a separate speculation has to be initiated for the caption. If another caption (non-option) is seen in the list, the speculation will fail. Furthermore, media link may nest one level. If a MEDIA_LINK or INTERNAL_LINK appears in the caption of the second level, the production will completely fail. I think that the following is a reasonable simplification: image links may not nest (although internal links and external links may appear in the caption of a media link), The _first_ non-option in the list is the caption and no options may appear after the caption. In this way, only one speculation is required for media links, and the lexer can handle the option list. This behavior seems consistent with the documentation at http://www.mediawiki.org/wiki/Help:Images. Is there any known use for putting an image inside an image caption, or is the restriction I propose here sufficient? Best regards, Andreas Jonsson

13 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Wikitext-l September 2010