Good evening,
this week I looked at different ways of cajoling overlapping, improperly
nested or otherwise horrible but real-life wiki content into the WikiDom
structure for consumption by the visual editor currently in development.
So far, MediaWiki delegates the sanitization of those horrors to html
tidy, which employs (mostly) good heuristics to make sense of its input.
The [HTML5] spec finally standardized parsing and error recovery for
HTML, which seems to overlap widely with what we need for the new parser
(how far?). Open-source reference implementations of the parser spec are
available in Java [VNU] that compiles to C++ and Javascript
(http://livedom.validator.nu/) through GWT, and PHP and Python ports at
[HLib]. Modern browsers have similar implementations built in.
The reference parsers all use a relatively simple tokenizer in
combination with a mostly switch-based parser / tree builder that
constructs a cleaned-up DOM from the token stream. Tags are balanced and
matched using a random-access stack, with a separate list of open
formatting elements (very similar to the annotations in WikiDom). For
each parsing context and token combination an error recovery strategy
can be directly specified in a switch case.
The strength of this strategy is clearly the ease of implementing error
recovery. The big disadvantage is the absence of a nicely declarative
grammar, except perhaps a shallow one for the tokenizer. (Is there
actually an example of a parser with serious HTML-like error recovery
and an elegant grammar?)
In our specific visual editor application, performing a full error
recovery / clean-up while constructing the WikiDom is at odds with the
desire to round-trip wiki source. Performing full sanitation only in the
HTML serializer while doing none in the Wikitext serializer seems to be
a better fit. The WikiDom design with its support for overlapping
annotations allows the omission of most early sanitation for inline
elements. Block-level constructs however still need to be fully parsed
so that implicit scopes of inline elements can be determined (e.g.,
limiting the range of annotations to table cells) and a DOM tree can be
built. This tree then allows the visual editor to present some sensible,
editable outline of the document.
A possible implementation could use a simplified version of the current
PEG parser mostly as a combined Wiki and HTML tokenizer, that feeds a
token stream to a parser / tree builder modeled on the HTML5 parsers.
Separating the sanitation of inline and block-level elements to minimize
early sanitation seems to be quite doable.
What do you think about this general direction of building on HTML
parsers? Where should a wiki parser differ in its error recovery
strategy? How important is having a full grammar?
Gabriel
[HTML5] Parsing spec: http://dev.w3.org/html5/spec/Overview.html#parsing
[VNU] Ref impl. (Java, C++, JS): http://about.validator.nu/htmlparser/
Live JS parser demo: http://livedom.validator.nu/
[HLib] PHP and Python parsers: http://code.google.com/p/html5lib/
https://bugzilla.wikimedia.org/show_bug.cgi?id=6569
Gabriel mentioned that he'd like the list's input on this patch,
regarding how we treat nested lists like
; bla : blub
--
Sumana Harihareswara
Volunteer Development Coordinator
Wikimedia Foundation
Paul Graham is an investor in Stypi, another collaborative editor
similar to Etherpad.
At his behest they added a feature where you can not only replay the
edits, but you can see which edits are ultimately deleted in yellow.
http://www.stypi.com/hacks/13sentences
Just an interesting idea, I thought.
--
Neil Kandalgaonkar ( ) <neilk(a)wikimedia.org>
Hi,
today I started to look into generating something closer to WikiDom from the
parser in the ParserPlayground extension. For further testing and parser
development, changes to the structure will need to be mirrored in the
current serializers and renderers, which likely won't be used very long once
the editor integration gets underway.
The serializers developed in wikidom/lib/es seem to be just what would be
needed, so I am wondering if it would make sense to put some effort into
plugging those into the parser at this early stage while converting the
parser output to WikiDom. The existing round-trip and parser testing
infrastructure can then already be run against these serializers.
The split codebases make this a bit harder than necessary, so maybe this
would also be a good time to draw up a rough plan on how the integration
should look like. Swapping the serializers will soon break the existing
ParserPlayground extension, so a move to another extension or the wikidom
repository might make sense.
Looking forward to your thoughts,
Gabriel
Oliver Keyes asked whether the visual editor will include a citation
generator, and Trevor's response is useful enough I figured it should go
on this mailing list for reference.
-Sumana
-------- Original Message --------
Subject: Re: Question for the devs...
Date: Wed, 2 Nov 2011 09:46:37 -0700
From: Trevor Parscal <tparscal(a)wikimedia.org>
To: Oliver Keyes <okeyes(a)wikimedia.org>
CC: Sumana Harihareswara <sumanah(a)wikimedia.org>
Our plan for this kind of thing is pretty simple initially, we are just
going to have a way to create and edit <ref> tags. Because citations are
important, you might expect we would want to integrate a solution problem
into the editor. However, citation templates are templates, which means
they exist on-wiki, not in the core software. Because of this, it's
important to make sure the software that helps users make use of these
templates and the templates themselves live in the same places and can be
changed by the same people.
This is our plan to support the kind of work you are talking about:
- Templates will be editable as forms, automatically generated by
inspecting the transclusion code
- Not all possible parameters will be shown
- The order of named parameters may vary from one transclusion to
another
- The labels will be crudely converted to title case (zip_code
becomes Zip Code)
- No localization is possible for labels
- All fields are treated as text boxes
- Templates will be able to be supplemented with additional information
to influence the way they are edited
- All parameters can be shown, categorized, ordered, etc.
- Labels can be defined independently of parameter names
- Labels can be localized
- Fields can have basic types (such as text, distance, coordinates,
etc.) so the form can provide appropriate input types and helpers
(input
boxes, conversion calculators, maps, etc.)
- Gadgets will be able to hook into the editor to provide additional
functionality
- Toolbars and dialogs can be customized
- New template editing input-types and helpers can be added
- Many other things will be able to be customized using an extensive
API
So, essentially, we are looking to keep the software support for on-wiki
things (like templates) also on-wiki. We also expect that over time, all
major templates will have some meta information, improving the overall
experience for users of the wiki, so not all templates will require a
complex gadget to make use of them.
I hope this answers your questions. If not, please feel free to continue
poking.
- Trevor
On Wed, Nov 2, 2011 at 8:22 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
> Thanks! A citation generator - I assume they're referring to something
> like http://en.wikipedia.org/wiki/User:Apoc2400/refToolbar, which adds an
> extra button to allow for easy reference formatting. It gives you all the
> fields ("Author name" "book title" "ISBN" whatever), you fill them in, hit
> enter and it churns out a properly formatted reference in wikimarkup.
>
> --
> Oliver Keyes
> Community Liason, Product Development
> Wikimedia Foundation
>
>
Hello,
I'm having problems with functions. I wrote the next code in a wiki
page:
{{{#if: 0 | yes | not }}}
{{{#if: | yes | not }}}
{{#if: 0 | yes | not }}
And I get
<pre>si
</pre>
<pre>si
</pre>
<p>{{#if: 0 | si | no }}
</p>
But I supposed to get "yes" in the 3th example and that functions are
enclosed only with 2 parenthesis.
I have tested it in the Wikipedia and in another wiki that I have
installed recently (using MediaWiki 1.16.5). ¿Why the function is not
computed? ¿I need to configure some parameter in the wiki?
Thanks,
Daniel