New parser ponderings

List overview All Threads
Download

newer

older

Fwd from [Wikitech-l]: Cleaning up...

Templates

Magnus Manske

5 Aug 2011 5 Aug '11

9:45 a.m.

Inspired by Brion's slides (couldn't make it to Haifa myself), some random questions and musings:

- Is there a definition / "complete" example of the JSON output of the new parser somewhere? I didn't see it on the parser pages...

- Will there be multiple "resolutions" of parsing? One would be template name and key-value-pair parameters, another would be the template replaced with the corresponding wikitext, another one the template replaced with the corresponding wikitext parsed into JSON. Either all-in-one large JSON object, or one of those "on demand"? Also, extension tag/attributes/contents, rendered extension output, WikiSource transclusions etc.

- One of the functions I have issues with in WYSIFTW is copy&paste. Besides making it work in the new editor, would it be worth to add special behaviour for (cut|copy)/paste between articles? Like, automatically adding the source article link to the edit description, so the source of text can be traced, even if it's just manually?

- Toolserver access to full wiki text is a pain. Once the new parser is live (even if it's "only" in parallel with the old one), could we have new, fast access capability for both raw wikitext and parser JSON output on the toolserver? I mean that in addition to API parser output, which I take as a given here :-)

- Will there be a JSON-in-XML dump besides the current wikitext-in-XML one?

- Will there be an interface for to the parser for JavaScript tools /outside/ edit mode? I'm thinking "Add a reference", "insert image" etc. Just getting a char-based WikiText position from a mouse click would be very helpful indeed, so the user can click where he wants the reference in the rendered HTML, and JS can insert it at the corresponding WikiText position.

- A point discussed endlessly before: As a "side effect" of the new parser, will we store page-template-passed_value triplets in the database? Think {{Information}} on Commons.

- Will there be an import page or JS function for parser JSON objects? Think Word/OpenOffice export, or "paste HTML" (with JS HTML-to-JSON converter).

That should keep us busy for a while... ;-)

Magnus

Show replies by date

Trevor Parscal

9 Aug 9 Aug

4:39 p.m.

Many of Magnus's questions fall right into my area of work, so maybe I can answer a few.

<questions-and-answers>

- Is there a definition / "complete" example of the JSON output of the

...

new parser somewhere? I didn't see it on the parser pages...

Working model of this is here: http://svn.wikimedia.org/viewvc/mediawiki/trunk/parsers/wikidom/

We have an example document which includes various kinds of content. There's also a bunch of unit tests against the HTML and Wikitext serializers. Finally (and most importantly) there's a visual editor which can manipulate some of that DOM (soon all of it) with a graphical user interface. All of this code is what I'm working on. Inez from Wikia is also working with us on this 4 days a week.

...

Will there be multiple "resolutions" of parsing? One would be

template name and key-value-pair parameters, another would be the template replaced with the corresponding wikitext, another one the template replaced with the corresponding wikitext parsed into JSON. Either all-in-one large JSON object, or one of those "on demand"? Also, extension tag/attributes/contents, rendered extension output, WikiSource transclusions etc.

I can answer part of your question by explaining our plan for how the WikiDom will look when there's templates. A template call in WikiDom is just a template name with some parameters. The parameters are like documents, in that they are a series of blocks, just like a document is. The server could (and ideally will) render the templates into HTML, and pass that along with the parameter information. This will allow previews in the editor to be true to the final output, but also let the editor get at the parameters, change them, send them to the server for re-rendering, and then update the HTML representation. In this way, there can be different resolutions within a WikiDom structure.

...

One of the functions I have issues with in WYSIFTW is copy&paste.

Besides making it work in the new editor, would it be worth to add special behaviour for (cut|copy)/paste between articles? Like, automatically adding the source article link to the edit description, so the source of text can be traced, even if it's just manually?

So far copy-paste is working well, thanks to our approach to handling input. Underneath the EditSurface is a text input, which is visible but obscured. The text input is focused when the mouse interacts with the surface. When you type, we read the text from the input and insert it into the surface. When you select text, we fill the input with the plain text version of what you selected and set the input's selection to all. When you copy, well, nothing special needs to happen at all. When you paste, we treat it like typing. This works for plain text, but copy-pasting rich text will involve an extra couple of steps. When you copy, we will remember what the copied plain text looked like, and keep a formatted version around in memory. When you paste, if the pasted plain text is identical to the copied plain text then we can just use the in-memory formatted version. With some trickery, we may even be able to support this between tabs/windows. The neat thing about how we are handling this is that we have full control over what the plain text version of the text is, resolving lots of issues with browser and operating system copy/paste inconsistencies.

...

Will there be an interface for to the parser for JavaScript tools

/outside/ edit mode? I'm thinking "Add a reference", "insert image" etc. Just getting a char-based WikiText position from a mouse click would be very helpful indeed, so the user can click where he wants the reference in the rendered HTML, and JS can insert it at the corresponding WikiText position.

Once we have a fully-featured WikiDom representation that we can safely round-trip Wikitext through, the sky is the limit to what kinds of APIs could be wrapped around it.

</questions-and-answers>

Thanks for the questions! I have been working really hard on the visual editor code, and probably need to spend a bit more time talking to people about it and documenting my work. If anyone wants to get involved, please let me know - we mostly need more JavaScript experts.

- Trevor

Erik Rose

6:55 p.m.

...

I can answer part of your question by explaining our plan for how the WikiDom will look when there's templates. A template call in WikiDom is just a template name with some parameters. The parameters are like documents, in that they are a series of blocks, just like a document is. The server could (and ideally will) render the templates into HTML, and pass that along with the parameter information. This will allow previews in the editor to be true to the final output, but also let the editor get at the parameters, change them, send them to the server for re-rendering, and then update the HTML representation.

Though, as we learned the hard way, one should not assume that rendering a template separately yields the same result as substituting its source text into the caller and then rendering the whole thing at once. For example, we wanted to allow templates on support.mozilla.com to contain just single bullet points...

* Hello there

...for inclusion into lists in the caller. MW supports this. The naive approach of rendering the template on its own leaves you with an extra list (and possibly invalid markup—I forget). Our new parser (https://github.com/erikrose/mediawiki-parser) subs the parametrized template into the caller and then renders once.

I'm looking forward to seeing what you do with the WYSIWYG editor; it's quite ambitious!

Cheers, Erik Rose

Daniel Kinzler

11 Aug 11 Aug

4:19 a.m.

On 10.08.2011 01:55, Erik Rose wrote:

...

Though, as we learned the hard way, one should not assume that rendering a template separately yields the same result as substituting its source text into the caller and then rendering the whole thing at once. For example, we wanted to allow templates on support.mozilla.com to contain just single bullet points...

Hello there

...for inclusion into lists in the caller. MW supports this. The naive approach of rendering the template on its own leaves you with an extra list (and possibly invalid markup—I forget). Our new parser (https://github.com/erikrose/mediawiki-parser) subs the parametrized template into the caller and then renders once.

This is one of the most important and hotly debated questions abotu the new parser design: should it allow syntactically incomplete templates, or not?

If we allow this, we are basically stuck with representing text using one specific syntax internally, and we can not parse a page without resolving all templates. All we can do up front is a preprocessor pass (as we already do).

If we do not allow this, we can convert wikitext to wikidom independantly of the content of that database, that is, there is a stable mapping from syntax to dom. This would allow us to use the dom as the internal representation, and allow people to use whatever syntax they want, as long as we have a clean round trip for that syntax. it would also make it very easy to write generators for various kinds of output, like tex or pdf.

However, requiring syntactically complete templates would require a pretty heavy transition for existing content. Is it worth it? I think in the long run, yes.

-- daniel

Erik Rose

6:14 p.m.

...

This is one of the most important and hotly debated questions abotu the new parser design: should it allow syntactically incomplete templates, or not?

Yes, it's got all sorts of wonderful technical advantages: a DB-less translation from markup to DOM is nothing to be sneered at. Though I can throw out this cautionary tale: our old MW parser implementation didn't allow these "partial" templates, and our writers have spent maybe a dozen hours fruitlessly trying various crazy ways to factor up collections of common bullet points for use in lists of instructions.

Cheers, Erik

Magnus Manske

12 Aug 12 Aug

1:39 a.m.

On Fri, Aug 12, 2011 at 12:14 AM, Erik Rose erik@mozilla.com wrote:

...

...
This is one of the most important and hotly debated questions abotu the new parser design: should it allow syntactically incomplete templates, or not?

Yes, it's got all sorts of wonderful technical advantages: a DB-less translation from markup to DOM is nothing to be sneered at. Though I can throw out this cautionary tale: our old MW parser implementation didn't allow these "partial" templates, and our writers have spent maybe a dozen hours fruitlessly trying various crazy ways to factor up collections of common bullet points for use in lists of instructions.

There is the tempting but unclean (unclean!!) option of disallowing partial templates, then cleaning up certain constructs afterwards. A series of <ul></ul> nodes can be merged into one in DOM or text/regexp etc.

That would allow for mostly hack-less templates and clean parsing, at the price of bespoke markup magick fixes in the end. Just sayin'...

Magnus

Trevor Parscal

11:50 a.m.

On Thu, Aug 11, 2011 at 11:39 PM, Magnus Manske <magnusmanske@googlemail.com

...

wrote:

There is the tempting but unclean (unclean!!) option of disallowing partial templates, then cleaning up certain constructs afterwards. A series of <ul></ul> nodes can be merged into one in DOM or text/regexp etc.

See: http://www.mediawiki.org/wiki/Visual_editor/software_design#Constraints

Reforming wikitext should be something we are considering. We did it before, and ran batch conversion. Back then it was on a much smaller number of articles, but it's been done.

- Trevor

4854

Age (days ago)

4861

Last active (days ago)

wikitext-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Daniel Kinzler
Erik Rose
Magnus Manske
Trevor Parscal