Hi,
I have set up a site for testing my parser implementation:
http://libmwparser.kreablo.se/index.php/Libmwparsertest
Please go ahead and edit.
I have disabled most of the preprocessing, as it seems very hard to
lift out the independent preprocessing from the parser preparation
stuff. But it should be easy to write a new one with only the
required functionality (which is parser functions, magic words,
comment removal, and transclusion).
It would still take a lot of work to make a version that could be
substituted in place of the current parser with support for all
features. But its a solid proof of concept.
Best regards,
Andreas Jonsson
The syntax of image links with caption is seriously flawed, but I
think that I have found a reasonable solution for handling them: parse
them as "inline blocks".
To make an inline block out of the image link with caption, we first
let it have its own block context in the lexer, in order to guarantee
nexting order of internal block elements. This means that the end
token cannot appear in the wrong block context:
[[File:example.jpg|<table><td> this ]] is not an end token
for the image link</table> but this ]] is
I have already discussed the image links in the context of speculative
execution in the lexer, to guarantee that any opened image link will
be followed by an image link closing token. The max nesting level for
links is limited to 2 to avoid pathological speculations.
In the parser, inline blocks may appear in inlined text lines. They
will break the inlined text line from the point of view of handling
apostrophe parsing, however. Since block elements may appear in the
image caption, they cannot be part of the lookahead that is performed
for scanning for apostrophes. This means that in this example:
text '' italic [[File:example.jpg| text ]] foo '' bar
the text "text '' italic" and the text " foo '' bar" are processed
separately when it comes to apostrophe parsing and the result will be:
<p>text <i> italic</i><a ...><img ..></a>foo <i> bar </i></p>
Which is different from the current parser, where we have:
<p>text <i> italic<a ...><img ..></a>foo </i> bar</p>
However, the behavior will be the same regardless of new lines in the
caption:
text '' italic [[File:example.jpg| text
text ]] foo '' bar
still:
<p>text <i> italic</i><a ...><img ..></a>foo <i> bar </i></p>
The original parser have problems:
<p>text <i> italic<a ...><img ..></a>foo bar </i></i></p>
(My guess is that it first renders the </i> inside of the alt
attribute, which is cleaned up in the attribute sanitizing, and then
it discovers that there is a missing </i> and adds that in.)
In the original parser, wikitext list elements cannot appear in image
captions. It would, of course, be very easy to just disable the
wikitext list tokens in the lexer to provide the same behavior, but
this seems a bit inconsistent as any other block element may appear in
the caption. If we instead, in the parser, push/pop the current list
context to a stack when entering/leaving an "inlined block", we can
support lists inside the caption with expected behavior in this case:
* list [[File:example.jpg|
* list item in image caption ]]
* continuing outer list
It is up to the listener to decide what to do with the link caption.
Since it is fully parsed the listening application must be prepared
for this. In html output, the caption is rendered inside an 'alt'
text, unless there is a 'frame' or 'thumb' option and no explicit
'alt' option (in which case the caption is completely ignored). So
the listener should have the ability to toggle rendering of markup on
and off in order to render the caption inside the alt attribute.
/Andreas
In my previous post I covered the lexer. Here I will describe the
parser, the parser context and the listener interface. After the
lexer's extensive job att providing a reasonably well formed token
stream, the parser's job becomes completely straightforward.
== The parser
For inlined elemements, the parser will just mindlessly report these
to the context object:
inline_element:
word|space|special|br|html_entity|link_element|format|nowiki|table_of_contents|html_inline_tag
;
space: token = SPACE_TAB {IE(CX->onSpace(CX, $token->getText($token));)}
;
etc.
The lexer guarantees that a closing token will not appear before
a corresponding opening token, and the parser context takes care of
nesting formats and removing empty format tags.
For block elements, the only special thing the parser need to pay
attention to is the fact that end tokens may be missing. Therefore,
end-of-file is always accepted instead of the closing token, for
instance:
html_div:
token = HTML_DIV_OPEN
{
CX->beginHtmlDiv(CX, $token->custom);
}
block_element_contents
(HTML_DIV_CLOSE|EOF)
{
CX->endHtmlDiv(CX);
}
;
The rule 'block_element_contents' covers all parser productions. The
lexer will restrict which tokens that may appear. For instance
'HTML_DIV_CLOSE' will never appear before a corresponding
'HTML_DIV_OPEN'. Also, list items and table cells will not appear
unless the current block context is correct. I have also introduced
a max nesting level limit in the lexer, so stack space is also not an
issue.
== The parser context
The parser context relays the parser events to a listener, but it will
insert and remove events to produce a well formed output. For instance:
text '' italic <b><strong /> bold-italic
bold </b> text
will result in an event stream to the listener that will look like this:
text <i> italic <b> bold-italic </b></i>
<b> bold </b> text
Two mechanisms are used to implement this:
* The call to the "begin" method is delayed until some actual inlined
content is produced. The call is never taken if an "end" event is
recieved before such content.
* The order of the formats is maintained so that inner formats can be
closed and reopened when a non-matching end token is recieved.
So, most of the parser context's methods look like this:
static void
beginHtmlStrong(MWPARSERCONTEXT *context, pANTLR3_VECTOR attr)
{
MW_DELAYED_CALL( context, beginHtmlStrong, endHtmlStrong,
attr, NULL);
MW_BEGIN_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong,
attr, NULL, false);
MWLISTENER *l = &context->listener;
l->beginHtmlStrong(l, attr);
}
static void
endHtmlStrong(MWPARSERCONTEXT *context)
{
MW_SKIP_IF_EMPTY( context, beginHtmlStrong, endHtmlStrong, NULL);
MW_END_ORDERED_FORMAT(context, beginHtmlStrong, endHtmlStrong, NULL);
MWLISTENER *l = &context->listener;
l->endHtmlStrong(l);
}
Block elements are already guaranteed by the lexer to be well nested,
so the context typically does not need to do anything special about
those. Only the wikitext list elements needs to be resolved by the
context.
== The listener
The listening application needs to implement the MWLISTENER interface.
I haven't added support for all features yet, but at the moment, there
are 91 methods in this interface. They are trivial to implement,
though. The only thing to think about is that it is the listener's
responsibility to escape the contents of nowiki and special
characters, and also to filter the attribute lists.
/Andreas
I have just commited an initial version of a php wrapper library for
my parser.
http://svn.wikimedia.org/svnroot/mediawiki/trunk/parsers/libmwparser
An example of how it can be used:
include("mwp.php");
$istream = MWParserOpenString("input", "<strong id=hello>Hello
World!", MWPARSER_UTF8);
$parser = new_MWPARSER($istream);
$out = MWParseArticle($parser);
print implode($out). "\n";
MWParserCloseInputStream($istream);
$istream = MWParserOpenString("input", "{|\n|[[Hello|hello world!]]",
MWPARSER_UTF8);
MWParserReset($parser, $istream);
$out = MWParseArticle($parser);
print implode($out). "\n";
which gives the following output:
<p><strong id="hello">Hello World!</strong></p>
<table><tbody><tr><td><!-- BEGIN INTERNAL LINK [Hello] -->hello
world!<!-- END INTERNAL LINK --></td></tr></tbody></table>
As you can see, I haven't sorted out the internal link resolution yet.
But there is an efficient solution to this: make the database lookup
after the lexer has run, before the parser runs. This is possible
as all internal links are already known at that stage, and it would
enable the parser to generate the links directly without any
postprocessing.
Since it doesn't completely replace the current parser, it will take a
bit of surgery to insert it into an instance of MediaWiki. I haven't
tried this yet.
There is a lot of tedious work left to do before everything is
completed. For instance, a large part of Sanitizer.php must be ported
over to C in order to validate the html attributes.
Best regards,
/Andreas
Hello,
In commit 72458 I've added the InlineEditor extension. [1] This extension is a working implementation of the prototype(s) earlier posted on this list. It's not actually for use on live wikis, but more a proof-of-concept and framework to experiment with. I will explain the extension in detail for those of you who might be interested.
== Design overview ==
The extension exists of several parts, structured in sub-directories like the UsabilityInitiative extension. The InlineEditor extension itself provides a framework for different edit modes to build on. It displays the edit modes, provides an interface to mark editable pieces of wikitext, provides a client-side inline editor which the edit modes *may* use, is configurable with several fallback options to the full/traditional editor, and handles previewing, publishing, undo and redo.
Every other extension provides an edit mode for the InlineEditor extension. They hook into InlineEditorMark and InlineEditorDefineEditors. The first one is called whenever wikitext is passed through the extension, and all edit modes can mark their editable pieces. Once this is done, a few algorithms will combine this with information of previously edited pieces, generate both wikitext to run through the parser, and JSON which is passed to the client, which maps the editable pieces to the original wikitext. The other hook is to include CSS, JS and messages to the page.
== Limitations ==
There are many things which are sub-optimal right now:
* The editor is slow. Whenever changing a small element and previewing it, the entire page is reparsed. This will be fixed by parsing only the element if possible (i.e. references have side effects at the bottom of the page).
* It's for now only possible to use the editor as primary editor, with a link to the full/traditional editor. There will be a configuration option whether to do this, or display a message at the top of the traditional edit page to switch to this editor.
* I've not tested things in older browsers (or IE at all, for that matter). I only know it runs fine in Firefox and Chrome, but it may have bugs in other browsers.
* The edit modes are really, really, basic right now. They may or may not screw things up. Most of them have just one or a few regular expressions which do well in general, but may fail at many edge-cases.
* The editor may not handle all the messages and edge cases of the traditional editor.
* The extensions is written for MediaWiki 1.16 but may or may not work with other versions.
Also, I'm not sure at all whether the current set of edit modes is the way to go. Currently, they are mutually exclusive. Meaning that text marked by one editor is never included in text marked by another editor. However, maybe it's better to not have edit modes like this, but different granularity of editing. I.e. sentence => paragraph => block. This way the user will get familiar with more wikitext instead of always seeing small portions. The framework currently doesn't allow for overlap in markings, but I will work to make this possible.
== Goals ==
Goal of this extension is to provide a framework to easily play with different modes of editing in-line. Feel free to write extensions that use this framework, or help with the framework itself. Any usability or technical suggestions are also welcome!
I hope to get some documentation up on mediawiki.org anytime soon, but note that the code is heavily documented inline. Feel free to ask any questions: I'm probably forgetting to mention some things that may not be clear to everyone. Also, there is no public wiki at the moment to test this extension with, will work on that, but if someone else can enable it on a test wiki that would be great too!
To install the extension(s), check the instructions in /trunk/extensions/InlineEditor/InlineEditor.php. Thanks for your time reading this!
Regards,
Jan Paul
[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/72458
I would like to use LIME -- http://sourceforge.net/projects/lime-php/
-- instead of a series of regular expression replacement statements to
convert GIFT -- http://microformats.org/wiki/gift -- to the Quiz
Extension format, both for purposes of maintainability and
readability. However, I am concerned about the code review situation,
and am not sure if it is reasonable to expect to depend on what would
be a much more difficult code review.
On the other hand, LIME has been stable for years, has two good
reviews and the author seems reasonable:
http://c2.com/cgi/wiki?IanKjos
The included calculator example included is easily accessible, all my
experiments with it so far have gone well, and I love the fact that it
includes an option for native code compilation of inner loop code
(lemon.c) but I am interested in using it to populate larger data
structures and how it behaves in production PHP. Does anyone know
anyone else who has used it?
My understanding is that some subsets of the wikitext parser could
easily be converted to a more formal grammar while others need to
remain in PHP (e.g., transclusion), and I am familiar with many of
wikitext's parsing ambiguity conflicts. I am not an expert in how to
resolve such conflicts in LALR(1) grammars -- although I can squeak
through the trial-and-error process. However, I am absolutely certain
that moving wikitext parsing to a formal grammar would provide some
serious opportunities for engineering improvements, also in
maintainability, readability, and related efforts.
Therefore, I am considering submitting LIME for code review, but I
want to try something different. I would like to ask for community
volunteers to review it first, with comments, before I submit it to be
reviewed by the official development team.
Are there any volunteers willing to provide a preliminary code review for LIME?
Best regards,
James Salsman
I have previosly written about speculative execution in the lexer. To
exactly reproduce the behavior of the image links, not only one, but
two speculations will be necessary. However, this is very complex and
the use case is undocumented so I would like to simplify these.
The original beheviour is as follows: the option list is split on the
'|' character; the caption is the _last_ non-option in the list, if
any.
So, to reproduce this, a separate speculation has to be initiated for
the caption. If another caption (non-option) is seen in the list, the
speculation will fail.
Furthermore, media link may nest one level. If a MEDIA_LINK or
INTERNAL_LINK appears in the caption of the second level, the
production will completely fail.
I think that the following is a reasonable simplification: image links
may not nest (although internal links and external links may appear in
the caption of a media link), The _first_ non-option in the list is
the caption and no options may appear after the caption. In this way,
only one speculation is required for media links, and the lexer can
handle the option list. This behavior seems consistent with the
documentation at http://www.mediawiki.org/wiki/Help:Images.
Is there any known use for putting an image inside an image caption,
or is the restriction I propose here sufficient?
Best regards,
Andreas Jonsson