New subject: MWDumper - CI + Artifactory

3 Feb 2012


      We, the Visual Editor team, have decided to move away from the custom
WikiDom format in favor of plain HTML DOM, which is already used
internally in the parser. The mapping of WikiText to the DOM was very
pragmatic so far, but now needs to be cleaned up before being used as an
external interface. Here are a few ideas for this.
Wikitext can be divided into shorthand notation for HTML elements and
higher-level features like templates, media display or categories.
The shorthand portion of wikitext maps quite directly to an HTML DOM.
Details like the handling of unbalanced tags while building the DOM
tree, remembering extra whitespace or wiki vs. html syntax for
round-tripping need to be considered, but appear to be quite manageable.
This should be especially true if some normalization in edge cases can
be tolerated. We plan to localize normalization (and thus mostly avoid
dirty diffs) by serializing only modified DOM sections while using the
original source for unmodified DOM parts. Attributes are used to track
the original source offsets of DOM elements.
Higher-level features can be represented in the HTML DOM using different
extension mechanisms:
* Introduce custom elements with specific attributes:
  <template href="Template:Bla' args=".../>
  For display or WYSIWYG  editing these elements then need to be
  expanded with the template contents, thumbnail html and so on.
  Unbalanced templates (table start/row/end) are very difficult
  to expand.
* Expand higher-level features to their presentational DOM, but
  identify and annotate the result using custom attributes. This is the
  approach we have taken so far in the JS parser [1]. Template
  arguments and similar information are stored as JSON in data
  attributes, which made their conversion to the JSON-based WikiDom
  format quite easy.
Both are custom solutions for internal use. For an external interface, a
standardized solution would be preferable. HTML5 microdata [2] seems to
fit our needs quite well.
Assuming a template that expands to a div and some content, this would
be represented like this:
<div itemscope
    itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' >
    <h2>A static header from the template</h2>
    <!-- The template argument 'name', expanded in the template -->
    <p itemprop='name' content='The wikitext name'>The rendered name</p>
</div>
In this case, an expanded template argument within (for example) an
infobox is identified inside the template-provided HTML structure, which
could enable in-place editing.
Unused arguments (which are not found in the template expansion) or
unexpanded templates can be represented using non-displaying meta elements:
<div itemscope
    itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate'
    id='uid-1' >
    <h2>A static header from the template</h2>
    <!-- The template argument 'name', expanded in the template -->
    <p itemprop='name' content='The wikitext name'>The rendered name</p>
    <meta itemprop='firstname' content='The wikitext firstname'>
</div>
The itemref mechanism can be used to tie together template data from a
single template that does not expand to a single subtree:
<div itemscope itemref='uid-1'>
  <!-- Some more template output from expansion of
http://en.wikipedia.org/wiki/Template:Sometemplate -->
</div>
The itemtype attributes in these examples all point to the template
location, which normally contains a plain-text documentation of the
template parameters and their semantics. The most common application of
microdata however references standardized schemas, often from
http://schema.org as those are understood by Google [3], Microsoft, and
Yahoo!. A mapping of semi-structured template arguments to a standard
schema is possible as demonstrated by http://dbpedia.org/. It appears to
be feasible to provide a similar mapping directly as microdata within
the template documentation, which could then potentially be used to add
standard schema information to regular HTML output when rendering a page.
The visual editor could also use schema information to customize the
editing experience for templates or images. Inline editing of fields in
infoboxes with schema-based help is one possibility, but in other cases
a popup widget might be more appropriate. Additional microdata in
template documentation sections could provide layout or other UI
information for these widgets.
There are still quite a few loose ends, but I think the general
direction of reusing standards as far as possible and hooking into the
thriving HTML5 ecosystem has many advantages. It allows us to reuse
quite a few libraries and infrastructure, and makes our own developments
(and data of course) more useful to others.
So- I hope you made it here without falling asleep!
What do you think about these ideas?
Gabriel
References:
[1]: http://www.mediawiki.org/wiki/Future/Parser_development
[2]:
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html
[3]:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170&am...
This text is on the wiki at
http://www.mediawiki.org/wiki/Future/HTML5_DOM_with_microdata

Mapping WikiText to HTML5 DOM with Microdata