On Tue, 14 Feb 2012 03:21:56 -0800, Gabriel Wicke wicke@wikidev.net wrote:
On 02/13/2012 10:28 PM, Daniel Friesen wrote:
itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion"
is
basically a formal way to extract the parameters of a template without having to do the unreliable work of attempting to parse the WikiText themselves. So it's still a usable improvement.
The main issue I have with this style of a purely structural itemtype is the limited pragmatic value compared to its significantly increased cost. A relatively light-weight fragment like
<div itemtype="http://en.wikipedia.org/wiki/Template:Foo" itemscope> <span itemprop="firstname">The first name</span> </div>
would be blown up to something like
<div itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" itemscope> <meta itemprop="source" data="http://en.wikipedia.org/wiki/Template:Foo" /> <span itemprop="Argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope> <meta itemprop="argname" content="firstname"> <span itemprop="argvalue">The first name</span> </span> </div>
This would increase the memory used for the DOM, slow down network transfers and processing and make it unlikely that we could leave this information in regular rendered pages.
I don't think we can include this stuff in general page data anyways. Adding any level of additional implicit markup to something as absolutely basic as {{{1|}}} could completely destroy things. The css targeting changes, js targeting changes, and if the template author happens to have gone to the effort of nicely adding Microdata of their own, we destroy it.
# Template:Movie <div itemscope itemtype="http://schema.org/Movie"> '''Title:''' <span itemprop="name">{{{title}}}</span> </div>
# Page content {{Movie|title=Avatar}}
# Result <div itemscope itemtype="http://en.wikipedia.org/wiki/Template:Movie"> <div itemscope itemtype="http://schema.org/Movie"> '''Title:''' <span itemprop="name"><span itemprop="title">Avatar</span></span> </div> </div>
The result is absolute nonsense. The and the only real action that can be taken to retain the ability for the Visual Editor to keep the template's editability is to hide the schema.org metadata in another layer of metadata describing it resulting in the metadata the author wrote becoming useless. Hence given that 3rd parties that are aware of templates and explicitly want to extract data from their parameters can use an alternate method of querying for the mixed dom, the fact that generic 3rd parties are unlikely to want to hardcode anything to do with the unstable and nonsensical meanings of transclusion parameters, and the fact that we can easily destroy good valid metadata and user styles I don't think including this extra dom in general page views is a good idea anyways.
For search engines and other 3rd parties, I don't believe any of them are going to want to go around to every wiki and start hardcoding into their code things like itemtype="http://mywiki.com/wiki/Template:Event" and itemtype="http://yourwiki.com/wiki/Template:OurEvent" both describing an event they would extract. I don't think we're going to get good metadata for general 3rd parties without actually embedding proper formal microdata into templates themselves.
Unfortunately, they would have to do the same hardcoding with a global Transclusion itemtype, as the only thing that allows an association of vocabulary semantics (the template source URL in the meta element) still contains the URL of the wiki. So the added complexity does not really simplify the extraction of semantically defined data.
They have to do the hardcoding either way. I'm saying that generic 3rd parties aren't going to do any hardcoding of domain-specific-schemas at all whatever the syntax we use, and hence generic 3rd parties are a complete moot point for discussing whether we use template-url as itemtype or a formally defined itemtype.
And the goal of metadata formats like Microdata is not simple extraction, it's having formally defined metadata which can be extracted reliably with an intuitive and consistent format. That's not what itemprop="last2" is. If we just wanted simply extracted data, we wouldn't be using Microdata at all, we'd just shove everything into something simple like: <div data-wt-transclusion="/wiki/Template:Movie"> '''Title:''' <span data-wt-param="title">Avatar</span> </div>
To improve this, I am all in favor of adding schema and editor-specific information to templates. The most natural storage location for this extra information would be directly in the documentation section of the template it describes. This makes it easy to find and edit, and ensures that the schema is copied along with the template. Some of this extra information might even be usable to automatically add additional, globally defined (schema.org or similar) itemtypes to the rendered output, which can make the information directly available to search engines without any manual work on their part.
I also don't think that prefix matches on the itemtype instead of a full string match are quite as hard or hacky as you make it out to be. Search engines already routinely perform this in their crawlers to support schema extensions: http://schema.org/docs/extension.html.
Those are completely different levels of wildcarding.
With schema.org they're simply saying that every http://schema.org/Person/Subtype matched by http://schema.org/Person/* is treated as a http://schema.org/Person type. And itemprop="email/work" is treated as itemprop="email" is. There's still a perfectly good formal schema there.
What we're saying with itemtype="{templateurl}" is that every itemtype="http://en.wikipedia.org/wiki/*" is a itemtype="" of, well we don't even have a formal definition of what it is. We're just saying that if it matches that wildcard it's a template transclusion. And there's nothing to define what that is. We're also saying that every itemprop="*" inside of it is a template parameter. Absolutely no formal definition saying what kind of data goes there, how it should be treated etc. And we're also saying that you'll get things like itemprop="first" itemprop="last" itemprop="first2" itemprop="last2". And you're supposed to take "first" and "last" and combine them conceptually as one "name", and likewise you also have to explicitly take "first2" and "last2" and combine these conceptually, but they aren't of type "name2", they are also of type "name". This is not Microdata, this is a mess. The only relation it has to Microdata is the fact that Microdata's syntax is being abused as a container for it.
It's like encoding a video with H.264, the audio with AAC, putting it into a Matroska container, changing the file extension to .webm. And then saying it's .webm because the file extension says .webm and the container format looks like .webm's container format.
A global itemtype hierarchy for templates could still be introduced along with a central repository of generally useful and semantically annotated templates. Something like http://mediawiki.org/md/Transclusion/Cite maybe, with the option to subclass as http://mediawiki.org/md/Transclusion/Cite/en.wikipedia.org if a local extension is needed.
For the editor project, we mainly need an efficient representation of the needed information with minimal changes to the rendered output. Any solution that requires us to add many additional elements will simply not work for us. The exact itemtype URL used on the other hand is easily adjusted if a useful global hierarchy emerges.
Changing: Foo To: <span itemprop="foo">Foo</span> Is already an absolutely destroying change for anything that it would effect. Using: <span itemprop="Argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope> <meta itemprop="name" content="foo"> <span itemprop="value">Foo</span> </span> Will not destroy things in a way any worse than the other change will.
And it's the only way you'll be able to convey something like: <div itemtype="http://www.mediawiki.org/microdata/wikitext/Transclusion" itemscope> <meta itemprop="PageName" content="Template:Foo"> <meta itemprop="RawText" content="Foo#This is some discarded data"> <b>Bar:</b> <span itemprop="argument" itemtype="http://www.mediawiki.org/microdata/wikitext/Argument" itemscope><!-- --><meta itemprop="name" content="bar"><!-- --><meta itemprop="default" content="Baz"><!-- --><span itemprop="value">Foo</span><!-- --></span> </div>
Which will allow the Visual Editor to restore the original WikiText and also have intuitive ques in the editor that will make it visually restore the default of "Baz" to the param text "Foo" when the user does something to indicate to the Visual Editor that it the user would probably want the Visual Editor to drop the param and show the default if they had actually known about things at the source level. And like I said before there are probably more things that would require extra metadata beyond what itemtype and itemprop hacks can provide which I can't even think up right now.
Gabriel
One thing I still don't get. In WikiText a <h2>Foo</h2> (normal extra markup omitted) can be expressed by both == Foo == and ==Foo==. I thought one of the key goals of the Visual Editor was that the Visual Editor would not get in the way of source level editors by mucking up content changing a ==Foo== to a == Foo == or a == Foo == to a ==Foo== when the Visual Editor user hasn't even touched that section, like just about every previous WYSIWYG editor has done. How is the Visual Editor supposed to do that when the dom we're talking about is lossy and doesn't contain any extra metadata giving that information.