We, the Visual Editor team, have decided to move away from the custom WikiDom format in favor of plain HTML DOM, which is already used internally in the parser. The mapping of WikiText to the DOM was very pragmatic so far, but now needs to be cleaned up before being used as an external interface. Here are a few ideas for this.
Wikitext can be divided into shorthand notation for HTML elements and higher-level features like templates, media display or categories.
The shorthand portion of wikitext maps quite directly to an HTML DOM. Details like the handling of unbalanced tags while building the DOM tree, remembering extra whitespace or wiki vs. html syntax for round-tripping need to be considered, but appear to be quite manageable. This should be especially true if some normalization in edge cases can be tolerated. We plan to localize normalization (and thus mostly avoid dirty diffs) by serializing only modified DOM sections while using the original source for unmodified DOM parts. Attributes are used to track the original source offsets of DOM elements.
Higher-level features can be represented in the HTML DOM using different extension mechanisms:
* Introduce custom elements with specific attributes: <template href="Template:Bla' args=".../> For display or WYSIWYG editing these elements then need to be expanded with the template contents, thumbnail html and so on. Unbalanced templates (table start/row/end) are very difficult to expand.
* Expand higher-level features to their presentational DOM, but identify and annotate the result using custom attributes. This is the approach we have taken so far in the JS parser [1]. Template arguments and similar information are stored as JSON in data attributes, which made their conversion to the JSON-based WikiDom format quite easy.
Both are custom solutions for internal use. For an external interface, a standardized solution would be preferable. HTML5 microdata [2] seems to fit our needs quite well.
Assuming a template that expands to a div and some content, this would be represented like this:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> </div>
In this case, an expanded template argument within (for example) an infobox is identified inside the template-provided HTML structure, which could enable in-place editing.
Unused arguments (which are not found in the template expansion) or unexpanded templates can be represented using non-displaying meta elements:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' id='uid-1' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> <meta itemprop='firstname' content='The wikitext firstname'> </div>
The itemref mechanism can be used to tie together template data from a single template that does not expand to a single subtree:
<div itemscope itemref='uid-1'> <!-- Some more template output from expansion of http://en.wikipedia.org/wiki/Template:Sometemplate --> </div>
The itemtype attributes in these examples all point to the template location, which normally contains a plain-text documentation of the template parameters and their semantics. The most common application of microdata however references standardized schemas, often from http://schema.org as those are understood by Google [3], Microsoft, and Yahoo!. A mapping of semi-structured template arguments to a standard schema is possible as demonstrated by http://dbpedia.org/. It appears to be feasible to provide a similar mapping directly as microdata within the template documentation, which could then potentially be used to add standard schema information to regular HTML output when rendering a page.
The visual editor could also use schema information to customize the editing experience for templates or images. Inline editing of fields in infoboxes with schema-based help is one possibility, but in other cases a popup widget might be more appropriate. Additional microdata in template documentation sections could provide layout or other UI information for these widgets.
There are still quite a few loose ends, but I think the general direction of reusing standards as far as possible and hooking into the thriving HTML5 ecosystem has many advantages. It allows us to reuse quite a few libraries and infrastructure, and makes our own developments (and data of course) more useful to others.
So- I hope you made it here without falling asleep!
What do you think about these ideas?
Gabriel
References: [1]: http://www.mediawiki.org/wiki/Future/Parser_development [2]: http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html [3]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170&am...
This text is on the wiki at http://www.mediawiki.org/wiki/Future/HTML5_DOM_with_microdata
This looks great - thanks for putting together a run down of our new direction.
Any ideas about what happens if a parser hook, parser function or template resolves to just plain text, without any wrapping HTML? Where does the microdata get stored? If we wrap it, how do we decide what to wrap it in?
- Trevor
On Thu, Feb 2, 2012 at 9:38 AM, Gabriel Wicke wicke@wikidev.net wrote:
We, the Visual Editor team, have decided to move away from the custom WikiDom format in favor of plain HTML DOM, which is already used internally in the parser. The mapping of WikiText to the DOM was very pragmatic so far, but now needs to be cleaned up before being used as an external interface. Here are a few ideas for this.
Wikitext can be divided into shorthand notation for HTML elements and higher-level features like templates, media display or categories.
The shorthand portion of wikitext maps quite directly to an HTML DOM. Details like the handling of unbalanced tags while building the DOM tree, remembering extra whitespace or wiki vs. html syntax for round-tripping need to be considered, but appear to be quite manageable. This should be especially true if some normalization in edge cases can be tolerated. We plan to localize normalization (and thus mostly avoid dirty diffs) by serializing only modified DOM sections while using the original source for unmodified DOM parts. Attributes are used to track the original source offsets of DOM elements.
Higher-level features can be represented in the HTML DOM using different extension mechanisms:
- Introduce custom elements with specific attributes:
<template href="Template:Bla' args=".../> For display or WYSIWYG editing these elements then need to be expanded with the template contents, thumbnail html and so on. Unbalanced templates (table start/row/end) are very difficult to expand.
- Expand higher-level features to their presentational DOM, but
identify and annotate the result using custom attributes. This is the approach we have taken so far in the JS parser [1]. Template arguments and similar information are stored as JSON in data attributes, which made their conversion to the JSON-based WikiDom format quite easy.
Both are custom solutions for internal use. For an external interface, a standardized solution would be preferable. HTML5 microdata [2] seems to fit our needs quite well.
Assuming a template that expands to a div and some content, this would be represented like this:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> </div>
In this case, an expanded template argument within (for example) an infobox is identified inside the template-provided HTML structure, which could enable in-place editing.
Unused arguments (which are not found in the template expansion) or unexpanded templates can be represented using non-displaying meta elements:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' id='uid-1' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> <meta itemprop='firstname' content='The wikitext firstname'> </div>
The itemref mechanism can be used to tie together template data from a single template that does not expand to a single subtree:
<div itemscope itemref='uid-1'> <!-- Some more template output from expansion of http://en.wikipedia.org/wiki/Template:Sometemplate --> </div>
The itemtype attributes in these examples all point to the template location, which normally contains a plain-text documentation of the template parameters and their semantics. The most common application of microdata however references standardized schemas, often from http://schema.org as those are understood by Google [3], Microsoft, and Yahoo!. A mapping of semi-structured template arguments to a standard schema is possible as demonstrated by http://dbpedia.org/. It appears to be feasible to provide a similar mapping directly as microdata within the template documentation, which could then potentially be used to add standard schema information to regular HTML output when rendering a page.
The visual editor could also use schema information to customize the editing experience for templates or images. Inline editing of fields in infoboxes with schema-based help is one possibility, but in other cases a popup widget might be more appropriate. Additional microdata in template documentation sections could provide layout or other UI information for these widgets.
There are still quite a few loose ends, but I think the general direction of reusing standards as far as possible and hooking into the thriving HTML5 ecosystem has many advantages. It allows us to reuse quite a few libraries and infrastructure, and makes our own developments (and data of course) more useful to others.
So- I hope you made it here without falling asleep!
What do you think about these ideas?
Gabriel
References: [1]: http://www.mediawiki.org/wiki/Future/Parser_development [2]: http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html [3]:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170&am...
This text is on the wiki at http://www.mediawiki.org/wiki/Future/HTML5_DOM_with_microdata
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 02/02/2012 07:52 PM, Trevor Parscal wrote:
Any ideas about what happens if a parser hook, parser function or template resolves to just plain text, without any wrapping HTML? Where does the microdata get stored? If we wrap it, how do we decide what to wrap it in?
Normally we would wrap the text into a span, and mark this span as being a wrapper in an attribute. The span would then hold all metadata. The extra wrapper might break some CSS selectors unfortunately, but that is a problem with any wrapper and hard to avoid (without those handy range annotations..).
In the parser, wrapping into a span by inserting tokens should be safe in any case (ignoring the potential CSS issues), but there might be a case for propagating that information to a surrounding paragraph if the span is later wrapped into a paragraph at the DOM level.
Also, any trailing text tokens in a template expansion will need to be wrapped as well to mark the end of the expansion. At the token level, all non-text tokens from an expansion are marked as coming from a template, so that the full expansion can be identified in the resulting DOM. This catches any text in between, but not normally plain text before or after the last non-text token. Hence the need for wrapping this too.
Gabriel
On 12-02-02 9:38 AM, Gabriel Wicke wrote:
[...]
Both are custom solutions for internal use. For an external interface, a standardized solution would be preferable. HTML5 microdata [2] seems to fit our needs quite well.
Assuming a template that expands to a div and some content, this would be represented like this:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> </div>
In this case, an expanded template argument within (for example) an infobox is identified inside the template-provided HTML structure, which could enable in-place editing.
Unused arguments (which are not found in the template expansion) or unexpanded templates can be represented using non-displaying meta elements:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' id='uid-1' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> <meta itemprop='firstname' content='The wikitext firstname'> </div>
The itemref mechanism can be used to tie together template data from a single template that does not expand to a single subtree:
<div itemscope itemref='uid-1'> <!-- Some more template output from expansion of http://en.wikipedia.org/wiki/Template:Sometemplate --> </div>
I'm tempted to say that rather than the template being the itemtype we should have a proper itemtype. <div itemscope itemtype=' http://www.mediawiki.org/microdata/wikitext/Transclusion%22%3E <meta itemprop="PageTitle" content="Template:Sometemplate"> [...] </div>
[...]
What do you think about these ideas?
Gabriel
References: [1]: http://www.mediawiki.org/wiki/Future/Parser_development [2]: http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html [3]: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170&am...
This text is on the wiki at http://www.mediawiki.org/wiki/Future/HTML5_DOM_with_microdata
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On 02/02/2012 08:36 PM, Daniel Friesen wrote:
I'm tempted to say that rather than the template being the itemtype we should have a proper itemtype.
<div itemscope itemtype=' http://www.mediawiki.org/microdata/wikitext/Transclusion"> <meta itemprop="PageTitle" content="Template:Sometemplate"> [...] </div>
Hmm- from the data modeling aspect, the template really determines the vocabulary used in the content. It would be hard to specify a common schema for http://www.mediawiki.org/microdata/wikitext/Transclusion, unless the vocabulary is reduced to a purely syntactical level.
Gabriel
On 12-02-02 12:16 PM, Gabriel Wicke wrote:
On 02/02/2012 08:36 PM, Daniel Friesen wrote:
I'm tempted to say that rather than the template being the itemtype we should have a proper itemtype.
<div itemscope itemtype=' http://www.mediawiki.org/microdata/wikitext/Transclusion"> <meta itemprop="PageTitle" content="Template:Sometemplate"> [...] </div>
Hmm- from the data modeling aspect, the template really determines the vocabulary used in the content. It would be hard to specify a common schema for http://www.mediawiki.org/microdata/wikitext/Transclusion, unless the vocabulary is reduced to a purely syntactical level.
Gabriel
Got a more verbose example of what could go wrong with an example so I can follow up with an example of how it could be done? Or more to the point. Got something I could look over on how you intended to model this in a way that wasn't based on syntax, so I can make a better example in context?
itemtype is really a meant for a real type, I'd really hate to see Microdata abused to the point where we abuse itemtype as a reference url and pretend that half of everything using itemtype comes from wiki syntax rather than a user entering microdata into WikiText.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Got a more verbose example of what could go wrong with an example so I can follow up with an example of how it could be done? Or more to the point. Got something I could look over on how you intended to model this in a way that wasn't based on syntax, so I can make a better example in context?
Template arguments, and especially the named variant, tend to have some loosely defined semantics apart from their syntactic identifier. These semantics are often explained in the template's documentation section, and can be mapped to a more formal ontology as in dbpedia [1].
A general schema for all templates could only reference things like the argument position or the argument name in an abstract way, but could not generally provide semantics for them. Note that this is different for things like images, where we can either define our own global schema, or directly reuse one from schema.org (e.g. http://schema.org/MediaObject), possibly with an additional extension for disambiguation purposes (/MW) as described in http://schema.org/docs/extension.html.
itemtype is really a meant for a real type, I'd really hate to see Microdata abused to the point where we abuse itemtype as a reference url and pretend that half of everything using itemtype comes from wiki syntax rather than a user entering microdata into WikiText.
Microdata items can be nested, so I don't see a problem with users or templates providing a mapping to more specific schemas like those of schema.org. Clashes of user-provided itemtypes with those used for editing purposes need to be prevented in the parser, but that is doable. Consumers are free to ignore itemtypes they don't know about, which is what Google etc are doing afaik- and what also motivated them to set up schema.org in the first place.
There might be ways to use schema information from the template documentation to add an additional, more general itemtype to the rendered template, but that is both still under development at WhatWG and not our highest priority right now.
Gabriel
[1]: http://mappings.dbpedia.org/index.php/How_to_edit_the_DBpedia_Ontology
On Fri, 03 Feb 2012 00:20:14 -0800, Gabriel Wicke wicke@wikidev.net wrote:
Got a more verbose example of what could go wrong with an example so I can follow up with an example of how it could be done? Or more to the point. Got something I could look over on how you intended to model this in a way that wasn't based on syntax, so I can make a better example in context?
Template arguments, and especially the named variant, tend to have some loosely defined semantics apart from their syntactic identifier. These semantics are often explained in the template's documentation section, and can be mapped to a more formal ontology as in dbpedia [1].
A general schema for all templates could only reference things like the argument position or the argument name in an abstract way, but could not generally provide semantics for them. Note that this is different for things like images, where we can either define our own global schema, or directly reuse one from schema.org (e.g. http://schema.org/MediaObject), possibly with an additional extension for disambiguation purposes (/MW) as described in http://schema.org/docs/extension.html.
itemtype is really a meant for a real type, I'd really hate to see Microdata abused to the point where we abuse itemtype as a reference url and pretend that half of everything using itemtype comes from wiki syntax rather than a user entering microdata into WikiText.
Microdata items can be nested, so I don't see a problem with users or templates providing a mapping to more specific schemas like those of schema.org. Clashes of user-provided itemtypes with those used for editing purposes need to be prevented in the parser, but that is doable. Consumers are free to ignore itemtypes they don't know about, which is what Google etc are doing afaik- and what also motivated them to set up schema.org in the first place.
There might be ways to use schema information from the template documentation to add an additional, more general itemtype to the rendered template, but that is both still under development at WhatWG and not our highest priority right now.
Gabriel
Hmmm... wait now I'm confused, are we talking about a Microdata DOM output that the Parser generates from WikiText. Or a completely tailored one where the template itself is authored in Microdata so that it can describe how a Visual Editor should edit it?
If you're talking about the latter, then I can almost understand itemtype being the template itself, and the transclusion describing the data according to that.
Though if you're talking about the former, and are talking about replacing {{{name}}} in Template:Foo transcluded by {{Foo|name=bar}} with <span itemprop="name">bar</span>. Then I'm saying that I don't like itemtype being abused to be the template name and itemname being abused to be the template argument name and instead of the template name and parameter names being abused as the schema of the template having a more verbose proper set of Microdata to describe it: # Template:Foo {{{bar}}}
# Page content {{Foo|bar=baz}}
# Result <div itemscope itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" id='uid-1'> <meta itemprop="PageTitle" content="Template:Foo"> <div itemprop="Argument" itemscope itemtype="http://www.mediawiki.org/microdata/wikitext/Argument"> <meta itemprop="Name" content="bar"> <span itemprop="Content">baz</span> </div> </div>
Maybe it would be easier to understand if our examples were a complete set of page content input WikiText, template content input WikiText, and the output DOM what we expect. Perhaps also what the intended goal is. I'm not quite sure if we're trying to describe templates in a way that the VisualEditor can extract the parameters from, edit them inline (if possible), or describe the output of a template in a way that can be read by machines for some separate purpose.
On 02/13/2012 03:27 AM, Daniel Friesen wrote:
Microdata items can be nested, so I don't see a problem with users or templates providing a mapping to more specific schemas like those of schema.org. Clashes of user-provided itemtypes with those used for editing purposes need to be prevented in the parser, but that is doable. Consumers are free to ignore itemtypes they don't know about, which is what Google etc are doing afaik- and what also motivated them to set up schema.org in the first place.
Hmmm... wait now I'm confused, are we talking about a Microdata DOM output that the Parser generates from WikiText. Or a completely tailored one where the template itself is authored in Microdata so that it can describe how a Visual Editor should edit it?
I considered the case where users manually add a microdata item in a template or page. The itemtype in that case can be anything, but would most likely be a standard type.
Then I'm saying that I don't like itemtype being abused to be the template name and itemname being abused to be the template argument name and instead of the template name and parameter names being abused as the schema of the template having a more verbose proper set of Microdata to describe it:
Could you elaborate why you consider one use of itemtype an abuse, while the other would be fine?
I'm not quite sure if we're trying to describe templates in a way that the VisualEditor can extract the parameters from, edit them inline (if possible), or describe the output of a template in a way that can be read by machines for some separate purpose.
We are trying to address all three with the same mechanism. In particular, we are trying to aid the discover of semantics associated with (many) template parameters for the benefit of search engines or projects like DBPedia and WikiData.
Gabriel
Dear Hashar ^demon, Krinkle and the other List Readers
Last week I committed to SVN a small maintenance update of MWDumper code and POM. I then place it into Jenkins - our continuous integration system. To help our users we are now publishing built version of the application - I'd be updating the extension page to use our CI as a source for the JAR (once tested).
To fully support third party developers must provide versioned MWDumper libraries. These project dependencies are called artifacts which are stored in a maven repository. Since search depends on MWDumper it is also a requirement for a fully automated build.
I expect that within a month the search project will start releasing versioned data sets packaged as Jar to share with other projects. A Maven repository seems ideal for this purpose since these will be large binaries which should not go into SVN.
I'm requesting: 1. To install (the open source) Artifactory http://www.jfrog.com/products.php Repository on the Jenkins Machine. (I've installed it on Tomcat and it took about 5 minutes during an tech evaluation last month) It's a war (web application that works under Tomcat, same as Jenkins) 2. The Jenkins Artifactory plug-in https://wiki.jenkins-ci.org/display/JENKINS/Artifactory+Plugin which lets a build publish the artifacts (dependencies) to artifactory.
Finally:
I am in the process of adding capability to generate edit via a Bot to simulate user updates for testing that search is updating correctly. If this type of testing is interesting to the PHP team we can collaborate on also make the bot(s) stress test a media wiki.
Yours
Oren Bochman
Search Project Lead E-mail: orenbochman@gmail.com
On Mon, 13 Feb 2012 00:13:21 -0800, Gabriel Wicke wicke@wikidev.net wrote:
On 02/13/2012 03:27 AM, Daniel Friesen wrote:
Microdata items can be nested, so I don't see a problem with users or templates providing a mapping to more specific schemas like those of schema.org. Clashes of user-provided itemtypes with those used for editing purposes need to be prevented in the parser, but that is doable. Consumers are free to ignore itemtypes they don't know about, which is what Google etc are doing afaik- and what also motivated them to set up schema.org in the first place.
Hmmm... wait now I'm confused, are we talking about a Microdata DOM output that the Parser generates from WikiText. Or a completely tailored one where the template itself is authored in Microdata so that it can describe how a Visual Editor should edit it?
I considered the case where users manually add a microdata item in a template or page. The itemtype in that case can be anything, but would most likely be a standard type.
Then I'm saying that I don't like itemtype being abused to be the template name and itemname being abused to be the template argument name and instead of the template name and parameter names being abused as the schema of the template having a more verbose proper set of Microdata to describe it:
Could you elaborate why you consider one use of itemtype an abuse, while the other would be fine?
An itemtype is supposed to be a proper type of what the data is. Something expected, well-known, predefined. If possible there is should be only one for some type of thing. And one should be able to query for it already knowing what that type is, like one would with an xmlns.
itemtype="http://en.wikipedia.org/wiki/Template:Cite" is not something pre-defined. It practically appears dynamically out of no-where with no forethought. And if someone copies the template then that exact same set of data has a completely different itemtype despite being the same thing.
Another point in this example. Template:Cite is actually a good example here.
In a normal itemtype you generally stick to one name for something. You have a citation type, and you have a "firstname" prop. And you can have multiples of them. ie: <span itemprop="firstname">Arnold</span> <span itemprop="firstname">Harold</span> (though in a real good type you'd likely have a separate itemtype to group all the info of a name into one itemprop="name" itemscope ...). However in a template we get this: |first=Arnold |first2=Harold Resulting in what you'd say would be: <span itemprop="first">Arnold</span> <span itemprop="first2">Harold</span>
That's nothing close to a properly defined itemtype that actually allows 3rd parties to extract data in any sane way. Nor is it something a Visual Editor would make use of without a wildcard hack where it examines every itemtype and decides that any url pointing back to the wiki is something it can edit. Anything that actually manages to extract data from that kind of thing is a hack at it's very core.
While when we use `itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion%22%60 and `itemprop="Argument" itemscope itemtype="http://www.mediawiki.org/microdata/wikitext/Argument%22%60 we have a predefined type. We're formally describing a transclusion of a template into another page, and the arguments used. The format of this is defined beforehand. We can add in extra data that would have been a hack before. Like the canonical pagename of the template. Perhaps even some metadata that is stored inside the template itself. For example say SemanticForms implemented some embedded editor form code. A template could add extra metadata saying that the template's content should be edited using a defined Semantic Forms. The Visual Editor would then use that information to embed a small area that allows Semantic Forms to be used to edit the template inline. Allowing editing of things that could potentially be to complex for the Visual Editor to understand how to make editable. Though that's really just an example off the top of my head, there are probably other things that could use metadata from the template to improve the Visual Editor's ability to make templates editable as intuitively as possible.
I'm not quite sure if we're trying to describe templates in a way that the VisualEditor can extract the parameters from, edit them inline (if possible), or describe the output of a template in a way that can be read by machines for some separate purpose.
We are trying to address all three with the same mechanism. In particular, we are trying to aid the discover of semantics associated with (many) template parameters for the benefit of search engines or projects like DBPedia and WikiData.
Gabriel
For those projects like DBPedia which already hack around trying to extract data from the parameters passed to a template using tricks to associate some sort of meaning to template parameters without getting that information from the wiki itself using a itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" is basically a formal way to extract the parameters of a template without having to do the unreliable work of attempting to parse the WikiText themselves. So it's still a usable improvement. For search engines and other 3rd parties, I don't believe any of them are going to want to go around to every wiki and start hardcoding into their code things like itemtype="http://mywiki.com/wiki/Template:Event" and itemtype="http://yourwiki.com/wiki/Template:OurEvent" both describing an event they would extract. I don't think we're going to get good metadata for general 3rd parties without actually embedding proper formal microdata into templates themselves.
On 02/13/2012 10:28 PM, Daniel Friesen wrote:
itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" is basically a formal way to extract the parameters of a template without having to do the unreliable work of attempting to parse the WikiText themselves. So it's still a usable improvement.
The main issue I have with this style of a purely structural itemtype is the limited pragmatic value compared to its significantly increased cost. A relatively light-weight fragment like
<div itemtype="http://en.wikipedia.org/wiki/Template:Foo" itemscope> <span itemprop="firstname">The first name</span> </div>
would be blown up to something like
<div itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" itemscope> <meta itemprop="source" data="http://en.wikipedia.org/wiki/Template:Foo" /> <span itemprop="Argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope> <meta itemprop="argname" content="firstname"> <span itemprop="argvalue">The first name</span> </span> </div>
This would increase the memory used for the DOM, slow down network transfers and processing and make it unlikely that we could leave this information in regular rendered pages.
For search engines and other 3rd parties, I don't believe any of them are going to want to go around to every wiki and start hardcoding into their code things like itemtype="http://mywiki.com/wiki/Template:Event" and itemtype="http://yourwiki.com/wiki/Template:OurEvent" both describing an event they would extract. I don't think we're going to get good metadata for general 3rd parties without actually embedding proper formal microdata into templates themselves.
Unfortunately, they would have to do the same hardcoding with a global Transclusion itemtype, as the only thing that allows an association of vocabulary semantics (the template source URL in the meta element) still contains the URL of the wiki. So the added complexity does not really simplify the extraction of semantically defined data.
To improve this, I am all in favor of adding schema and editor-specific information to templates. The most natural storage location for this extra information would be directly in the documentation section of the template it describes. This makes it easy to find and edit, and ensures that the schema is copied along with the template. Some of this extra information might even be usable to automatically add additional, globally defined (schema.org or similar) itemtypes to the rendered output, which can make the information directly available to search engines without any manual work on their part.
I also don't think that prefix matches on the itemtype instead of a full string match are quite as hard or hacky as you make it out to be. Search engines already routinely perform this in their crawlers to support schema extensions: http://schema.org/docs/extension.html.
A global itemtype hierarchy for templates could still be introduced along with a central repository of generally useful and semantically annotated templates. Something like http://mediawiki.org/md/Transclusion/Cite maybe, with the option to subclass as http://mediawiki.org/md/Transclusion/Cite/en.wikipedia.org if a local extension is needed.
For the editor project, we mainly need an efficient representation of the needed information with minimal changes to the rendered output. Any solution that requires us to add many additional elements will simply not work for us. The exact itemtype URL used on the other hand is easily adjusted if a useful global hierarchy emerges.
Gabriel
On Tue, 14 Feb 2012 03:21:56 -0800, Gabriel Wicke wicke@wikidev.net wrote:
On 02/13/2012 10:28 PM, Daniel Friesen wrote:
itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion"
is
basically a formal way to extract the parameters of a template without having to do the unreliable work of attempting to parse the WikiText themselves. So it's still a usable improvement.
The main issue I have with this style of a purely structural itemtype is the limited pragmatic value compared to its significantly increased cost. A relatively light-weight fragment like
<div itemtype="http://en.wikipedia.org/wiki/Template:Foo" itemscope> <span itemprop="firstname">The first name</span> </div>
would be blown up to something like
<div itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" itemscope> <meta itemprop="source" data="http://en.wikipedia.org/wiki/Template:Foo" /> <span itemprop="Argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope> <meta itemprop="argname" content="firstname"> <span itemprop="argvalue">The first name</span> </span> </div>
This would increase the memory used for the DOM, slow down network transfers and processing and make it unlikely that we could leave this information in regular rendered pages.
I don't think we can include this stuff in general page data anyways. Adding any level of additional implicit markup to something as absolutely basic as {{{1|}}} could completely destroy things. The css targeting changes, js targeting changes, and if the template author happens to have gone to the effort of nicely adding Microdata of their own, we destroy it.
# Template:Movie <div itemscope itemtype="http://schema.org/Movie"> '''Title:''' <span itemprop="name">{{{title}}}</span> </div>
# Page content {{Movie|title=Avatar}}
# Result <div itemscope itemtype="http://en.wikipedia.org/wiki/Template:Movie"> <div itemscope itemtype="http://schema.org/Movie"> '''Title:''' <span itemprop="name"><span itemprop="title">Avatar</span></span> </div> </div>
The result is absolute nonsense. The and the only real action that can be taken to retain the ability for the Visual Editor to keep the template's editability is to hide the schema.org metadata in another layer of metadata describing it resulting in the metadata the author wrote becoming useless. Hence given that 3rd parties that are aware of templates and explicitly want to extract data from their parameters can use an alternate method of querying for the mixed dom, the fact that generic 3rd parties are unlikely to want to hardcode anything to do with the unstable and nonsensical meanings of transclusion parameters, and the fact that we can easily destroy good valid metadata and user styles I don't think including this extra dom in general page views is a good idea anyways.
For search engines and other 3rd parties, I don't believe any of them are going to want to go around to every wiki and start hardcoding into their code things like itemtype="http://mywiki.com/wiki/Template:Event" and itemtype="http://yourwiki.com/wiki/Template:OurEvent" both describing an event they would extract. I don't think we're going to get good metadata for general 3rd parties without actually embedding proper formal microdata into templates themselves.
Unfortunately, they would have to do the same hardcoding with a global Transclusion itemtype, as the only thing that allows an association of vocabulary semantics (the template source URL in the meta element) still contains the URL of the wiki. So the added complexity does not really simplify the extraction of semantically defined data.
They have to do the hardcoding either way. I'm saying that generic 3rd parties aren't going to do any hardcoding of domain-specific-schemas at all whatever the syntax we use, and hence generic 3rd parties are a complete moot point for discussing whether we use template-url as itemtype or a formally defined itemtype.
And the goal of metadata formats like Microdata is not simple extraction, it's having formally defined metadata which can be extracted reliably with an intuitive and consistent format. That's not what itemprop="last2" is. If we just wanted simply extracted data, we wouldn't be using Microdata at all, we'd just shove everything into something simple like: <div data-wt-transclusion="/wiki/Template:Movie"> '''Title:''' <span data-wt-param="title">Avatar</span> </div>
To improve this, I am all in favor of adding schema and editor-specific information to templates. The most natural storage location for this extra information would be directly in the documentation section of the template it describes. This makes it easy to find and edit, and ensures that the schema is copied along with the template. Some of this extra information might even be usable to automatically add additional, globally defined (schema.org or similar) itemtypes to the rendered output, which can make the information directly available to search engines without any manual work on their part.
I also don't think that prefix matches on the itemtype instead of a full string match are quite as hard or hacky as you make it out to be. Search engines already routinely perform this in their crawlers to support schema extensions: http://schema.org/docs/extension.html.
Those are completely different levels of wildcarding.
With schema.org they're simply saying that every http://schema.org/Person/Subtype matched by http://schema.org/Person/* is treated as a http://schema.org/Person type. And itemprop="email/work" is treated as itemprop="email" is. There's still a perfectly good formal schema there.
What we're saying with itemtype="{templateurl}" is that every itemtype="http://en.wikipedia.org/wiki/*" is a itemtype="" of, well we don't even have a formal definition of what it is. We're just saying that if it matches that wildcard it's a template transclusion. And there's nothing to define what that is. We're also saying that every itemprop="*" inside of it is a template parameter. Absolutely no formal definition saying what kind of data goes there, how it should be treated etc. And we're also saying that you'll get things like itemprop="first" itemprop="last" itemprop="first2" itemprop="last2". And you're supposed to take "first" and "last" and combine them conceptually as one "name", and likewise you also have to explicitly take "first2" and "last2" and combine these conceptually, but they aren't of type "name2", they are also of type "name". This is not Microdata, this is a mess. The only relation it has to Microdata is the fact that Microdata's syntax is being abused as a container for it.
It's like encoding a video with H.264, the audio with AAC, putting it into a Matroska container, changing the file extension to .webm. And then saying it's .webm because the file extension says .webm and the container format looks like .webm's container format.
A global itemtype hierarchy for templates could still be introduced along with a central repository of generally useful and semantically annotated templates. Something like http://mediawiki.org/md/Transclusion/Cite maybe, with the option to subclass as http://mediawiki.org/md/Transclusion/Cite/en.wikipedia.org if a local extension is needed.
For the editor project, we mainly need an efficient representation of the needed information with minimal changes to the rendered output. Any solution that requires us to add many additional elements will simply not work for us. The exact itemtype URL used on the other hand is easily adjusted if a useful global hierarchy emerges.
Changing: Foo To: <span itemprop="foo">Foo</span> Is already an absolutely destroying change for anything that it would effect. Using: <span itemprop="Argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope> <meta itemprop="name" content="foo"> <span itemprop="value">Foo</span> </span> Will not destroy things in a way any worse than the other change will.
And it's the only way you'll be able to convey something like: <div itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" itemscope> <meta itemprop="PageName" content="Template:Foo"> <meta itemprop="RawText" content="Foo#This is some discarded data"> <b>Bar:</b> <span itemprop="argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope><!-- --><meta itemprop="name" content="bar"><!-- --><meta itemprop="default" content="Baz"><!-- --><span itemprop="value">Foo</span><!-- --></span> </div>
Which will allow the Visual Editor to restore the original WikiText and also have intuitive ques in the editor that will make it visually restore the default of "Baz" to the param text "Foo" when the user does something to indicate to the Visual Editor that it the user would probably want the Visual Editor to drop the param and show the default if they had actually known about things at the source level. And like I said before there are probably more things that would require extra metadata beyond what itemtype and itemprop hacks can provide which I can't even think up right now.
Gabriel
One thing I still don't get. In WikiText a <h2>Foo</h2> (normal extra markup omitted) can be expressed by both == Foo == and ==Foo==. I thought one of the key goals of the Visual Editor was that the Visual Editor would not get in the way of source level editors by mucking up content changing a ==Foo== to a == Foo == or a == Foo == to a ==Foo== when the Visual Editor user hasn't even touched that section, like just about every previous WYSIWYG editor has done. How is the Visual Editor supposed to do that when the dom we're talking about is lossy and doesn't contain any extra metadata giving that information.
Conflicts with user-defined microdata should be avoidable using multiple (URL-prefixed) names per itemprop, as described in the HTML spec. Multiple itemtypes might also be possible, but the details there are still a bit murky.
And it's the only way you'll be able to convey something like:
<div itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" itemscope> <meta itemprop="PageName" content="Template:Foo"> <meta itemprop="RawText" content="Foo#This is some discarded data"> <b>Bar:</b> <span itemprop="argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope><!-- --><meta itemprop="name" content="bar"><!-- --><meta itemprop="default" content="Baz"><!-- --><span itemprop="value">Foo</span><!-- --></span> </div>
Oh- we do add round-trip / meta-information in data attributes. The use of meta elements to represent otherwise absent or difficult parameters was also discussed earlier in this thread. The idea is to mark up properties inline where it makes sense (and that will be many cases), but revert to meta elements for anything deemed too difficult. I also don't see a need to represent default values as separate itemprops. The fact that some template parameter value came from the default value of an undefined template argument in the page does not seem to be very relevant for its semantics, and can be noted in an attribute as well.
How is the Visual Editor supposed to do that when the dom we're talking about is lossy and doesn't contain any extra metadata giving that information.
We have round-trip information for variable whitespace etc, but that still does not cover changes introduced by the need to transform tag soup into a tree. To minimize the effect of these changes in diffs, we currently plan to only re-serialize parts of the DOM that were actually marked as modified by the editor. Round-trip info contains original source offset ranges for elements, which makes it possible to splice in the original source for untouched DOM parts. The result should be a minimization and localization of any remaining normalization artifacts to avoid 'dirty diffs'- normalization changes in unmodified parts of the document.
Gabriel
On 02/02/12 18:38, Gabriel Wicke wrote:
Higher-level features can be represented in the HTML DOM using different extension mechanisms:
- Introduce custom elements with specific attributes: <template href="Template:Bla' args=".../>
You mean <mw:template href="Template:Bla' args=".../> :-)
- Expand higher-level features to their presentational DOM, but identify and annotate the result using custom attributes. (...)
Unused arguments (which are not found in the template expansion) or unexpanded templates can be represented using non-displaying meta elements:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' id='uid-1' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> <meta itemprop='firstname' content='The wikitext firstname'> </div>
The <p> item looks wrong, in microdata it wouldn't have the content attribute. Its value instead would be 'The rendered name'. Extracting the wikitext from there instead of a copy of the inner wikitext shouldn't be a problem, though. As far as it doesn't contain unbalanced syntax... (but that's a point where it seems safe to break compatibility).
I think it would look more like: <div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate'> <h1 itemprop="name">John Doe</h1> <p>Name: <span itemprop="name">John Doe</span></p> <p>Age: <span itemprop="age">21</span></p> </div>
And indeed it would look very cool to in-place edit the age, or name. But remember that it should be changing both locations!
However, it doesn't seem so easy for other items: [[Image:Photo of {{{name}}}.jpg|120px]]
{{#ifexist: {{{name}}} family| [[{{{name}}} family]] }}
{{#if: {{{wife|}}} | Married with [[{{{wife}}}]] | Single }}
As soon as it hits unrepresentable syntax, I think it should disable visual modification of that template. It's better to force editing of parameters than have the user that added the middle name, make the photo disappear after saving (but showed when he clicked save).
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' id='uid-1' > <h2>A static header from the template</h2> <!-- The template argument 'name', expanded in the template --> <p itemprop='name' content='The wikitext name'>The rendered name</p> <meta itemprop='firstname' content='The wikitext firstname'> </div>
The <p> item looks wrong, in microdata it wouldn't have the content attribute.
Yeah, you are right. The content attribute is specific to the meta element.
Its value instead would be 'The rendered name'. Extracting the wikitext from there instead of a copy of the inner wikitext shouldn't be a problem, though. As far as it doesn't contain unbalanced syntax... (but that's a point where it seems safe to break compatibility).
..or additional template expansions, as we don't intend to provide information about all recursively expanded templates.
I think it would look more like:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate'> <h1 itemprop="name">John Doe</h1> <p>Name: <span itemprop="name">John Doe</span></p> <p>Age: <span itemprop="age">21</span></p> </div>
Another possibility would be to use meta tags, or something like this example from the microdata spec: <data itemprop="product-id" value="9678AOU879">The Instigator 2000</data>
And indeed it would look very cool to in-place edit the age, or name. But remember that it should be changing both locations!
The template arguments are fortunately in the containing page, so we don't have to edit the template. But I am not sure if that is what you had in mind.
However, it doesn't seem so easy for other items: [[Image:Photo of {{{name}}}.jpg|120px]]
{{#ifexist: {{{name}}} family| [[{{{name}}} family]] }}
{{#if: {{{wife|}}} | Married with [[{{{wife}}}]] | Single }}
As soon as it hits unrepresentable syntax, I think it should disable visual modification of that template. It's better to force editing of parameters than have the user that added the middle name, make the photo disappear after saving (but showed when he clicked save).
So far we have only considered editing of arguments actually stored in the including page, but not templates. Even arguments would initially only be inline-editable if they are expanded directly rather than passed on to other templates. For now, passed-on arguments would be stored (as wikitext or tokens) in meta elements, and would be editable only using a widget. It should be possible to improve on that later.
In general, arguments (or template / image names etc) containing advanced things like additional templates are likely needed in an unexpanded form in addition to the expansion, which is what I was trying to demonstrate with the content attribute.
We also don't plan to support drilling down more than a single expansion layer for now. The rendered content of the template will be fully expanded to provide the visual layout, but will not contain information about lower templates or their arguments for editing purposes.
Gabriel
Gabriel Wicke wrote:
I think it would look more like:
<div itemscope itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate'> <h1 itemprop="name">John Doe</h1> <p>Name: <span itemprop="name">John Doe</span></p> <p>Age: <span itemprop="age">21</span></p> </div>
Another possibility would be to use meta tags, or something like this example from the microdata spec: <data itemprop="product-id" value="9678AOU879">The Instigator 2000</data>
And indeed it would look very cool to in-place edit the age, or name. But remember that it should be changing both locations!
The template arguments are fortunately in the containing page, so we don't have to edit the template. But I am not sure if that is what you had in mind.
I was noting that the editing would be changing both lines.
So far we have only considered editing of arguments actually stored in the including page, but not templates. Even arguments would initially only be inline-editable if they are expanded directly rather than passed on to other templates. For now, passed-on arguments would be stored (as wikitext or tokens) in meta elements, and would be editable only using a widget. It should be possible to improve on that later.
Yes, but a could be present in an editable form and an uneditable one. If the parser can't show the change in the uneditable one, it shouldn't allow its edition.
<h1 itemprop="name">John Doe</h1> <p>Name: <span itemprop="name">John Doe</span></p>
I was noting that the editing would be changing both lines.
Ah- I missed the identical itemprop ;)
So far we have only considered editing of arguments actually stored in the including page, but not templates. Even arguments would initially only be inline-editable if they are expanded directly rather than passed on to other templates. For now, passed-on arguments would be stored (as wikitext or tokens) in meta elements, and would be editable only using a widget. It should be possible to improve on that later.
Yes, but a could be present in an editable form and an uneditable one. If the parser can't show the change in the uneditable one, it shouldn't allow its edition.
That is a good point- the editor view would get inconsistent if it can't identify the second use of the argument. We could perhaps revert to the sledgehammer method of re-expanding the template top-down using a call to the parser.
There are also questions around editing of 'complex' arguments (which include templates or the like). If we simply revert to WikiText editing of those, then a modification that results in unbalanced WikiText would have non-local changes when re-parsed globally. We could perhaps try to enforce well-formedness in a validation (or parsing) step.
Gabriel
wikitext-l@lists.wikimedia.org