Forwarding this to the wikitext-l list just in case.
-------- Original Message -------- Subject: [Wikitech-l] Cutting MediaWiki loose from wikitext Date: Mon, 26 Mar 2012 16:45:51 +0200 From: Daniel Kinzler daniel@brightbyte.de Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org Organization: Wikimedia Deutschland e.V. To: Wikimedia developers wikitech-l@lists.wikimedia.org, mediawiki-l@lists.wikimedia.org CC: Lydia Pintscher lydia@pintscher.de, Abraham Taherivand abraham.taherivand@wikimedia.de
Hi all. I have a bold proposal (read: evil plan).
To put it briefly: I want to remove the assumption that MediaWiki pages contain always wikitext. Instead, I propose a pluggable handler system for different types of content, similar to what we have for file uploads. So, I propose to associate a "content model" identifier with each page, and have handlers for each model that provide serialization, rendering, an editor, etc.
The background is that the Wikidata project needs a way to store structured data (JSON) on wiki pages instead of wikitext. Having a pluggable system would solve that problem along with several others, like doing away with the special cases for JS/CSS, the ability to maintain categories etc separate from body text, manage Gadgets sanely on a wiki page, or several other things (see the link below).
I have described my plans in more detail on meta:
http://meta.wikimedia.org/wiki/Wikidata/Notes/ContentHandler
A very rough prototype is in a dev branch here:
http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/
Please let me know what you think (here on the list, preferably, not on the talk page there, at least for now).
Note that we *definitely* need this ability for Wikidata. We could do it differently, but I think this would be the cleanest solution, and would have a lot of mid- and long term benefits, even if it's a short term pain. I'm presenting my plan here to find out if I'm on the right track, and whether it is feasible to put this on the road map for 1.20. It would be my (and the Wikidata team's) priority to implement this and see it through before Wikimania. I'm convinced we have the manpower to get it done.
Cheers, Daniel
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Given we have pages that contain CSS, JS and JSON in the MediaWiki namespace, this seems more like a good idea for a nice refactoring project than a new idea altogether. Is the way we support these non-wikitext pages indeed a dirty hack atm?
- Trevor
On Mon, Mar 26, 2012 at 7:48 AM, Sumana Harihareswara <sumanah@wikimedia.org
wrote:
Forwarding this to the wikitext-l list just in case.
-------- Original Message -------- Subject: [Wikitech-l] Cutting MediaWiki loose from wikitext Date: Mon, 26 Mar 2012 16:45:51 +0200 From: Daniel Kinzler daniel@brightbyte.de Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org Organization: Wikimedia Deutschland e.V. To: Wikimedia developers wikitech-l@lists.wikimedia.org, mediawiki-l@lists.wikimedia.org CC: Lydia Pintscher lydia@pintscher.de, Abraham Taherivand abraham.taherivand@wikimedia.de
Hi all. I have a bold proposal (read: evil plan).
To put it briefly: I want to remove the assumption that MediaWiki pages contain always wikitext. Instead, I propose a pluggable handler system for different types of content, similar to what we have for file uploads. So, I propose to associate a "content model" identifier with each page, and have handlers for each model that provide serialization, rendering, an editor, etc.
The background is that the Wikidata project needs a way to store structured data (JSON) on wiki pages instead of wikitext. Having a pluggable system would solve that problem along with several others, like doing away with the special cases for JS/CSS, the ability to maintain categories etc separate from body text, manage Gadgets sanely on a wiki page, or several other things (see the link below).
I have described my plans in more detail on meta:
http://meta.wikimedia.org/wiki/Wikidata/Notes/ContentHandler
A very rough prototype is in a dev branch here:
http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/
Please let me know what you think (here on the list, preferably, not on the talk page there, at least for now).
Note that we *definitely* need this ability for Wikidata. We could do it differently, but I think this would be the cleanest solution, and would have a lot of mid- and long term benefits, even if it's a short term pain. I'm presenting my plan here to find out if I'm on the right track, and whether it is feasible to put this on the road map for 1.20. It would be my (and the Wikidata team's) priority to implement this and see it through before Wikimania. I'm convinced we have the manpower to get it done.
Cheers, Daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
I strongly disagree with removing wiki text. Several thousand bots depend on it. Especially Cluebot NG which is a highly sophisticated anti-vandalism bot. I rcommend going our current route where the visual editor writes the Wikimarkup realtime and vice versa.
Gesendet von Maximilian's iPhone. (Sent from Maximilian's iPhone.)
Am Mar 26, 2012 um 11:37 schrieb Trevor Parscal tparscal@wikimedia.org:
Given we have pages that contain CSS, JS and JSON in the MediaWiki namespace, this seems more like a good idea for a nice refactoring project than a new idea altogether. Is the way we support these non-wikitext pages indeed a dirty hack atm?
- Trevor
On Mon, Mar 26, 2012 at 7:48 AM, Sumana Harihareswara sumanah@wikimedia.org wrote: Forwarding this to the wikitext-l list just in case.
-------- Original Message -------- Subject: [Wikitech-l] Cutting MediaWiki loose from wikitext Date: Mon, 26 Mar 2012 16:45:51 +0200 From: Daniel Kinzler daniel@brightbyte.de Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org Organization: Wikimedia Deutschland e.V. To: Wikimedia developers wikitech-l@lists.wikimedia.org, mediawiki-l@lists.wikimedia.org CC: Lydia Pintscher lydia@pintscher.de, Abraham Taherivand abraham.taherivand@wikimedia.de
Hi all. I have a bold proposal (read: evil plan).
To put it briefly: I want to remove the assumption that MediaWiki pages contain always wikitext. Instead, I propose a pluggable handler system for different types of content, similar to what we have for file uploads. So, I propose to associate a "content model" identifier with each page, and have handlers for each model that provide serialization, rendering, an editor, etc.
The background is that the Wikidata project needs a way to store structured data (JSON) on wiki pages instead of wikitext. Having a pluggable system would solve that problem along with several others, like doing away with the special cases for JS/CSS, the ability to maintain categories etc separate from body text, manage Gadgets sanely on a wiki page, or several other things (see the link below).
I have described my plans in more detail on meta:
http://meta.wikimedia.org/wiki/Wikidata/Notes/ContentHandler
A very rough prototype is in a dev branch here:
http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/
Please let me know what you think (here on the list, preferably, not on the talk page there, at least for now).
Note that we *definitely* need this ability for Wikidata. We could do it differently, but I think this would be the cleanest solution, and would have a lot of mid- and long term benefits, even if it's a short term pain. I'm presenting my plan here to find out if I'm on the right track, and whether it is feasible to put this on the road map for 1.20. It would be my (and the Wikidata team's) priority to implement this and see it through before Wikimania. I'm convinced we have the manpower to get it done.
Cheers, Daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On Mon, Mar 26, 2012 at 12:50, Maximilian Doerr cybernet678@yahoo.com wrote:
I strongly disagree with removing wiki text. Several thousand bots depend on it. Especially Cluebot NG which is a highly sophisticated anti-vandalism bot. I recommend going our current route where the visual editor writes the Wikimarkup realtime and vice versa.
I suspect maybe you didn't understand Daniel's message? (or didn't read it?) I imagine most pages will still use the same markup we have today after his idea is implemented.
For the pages that are different, the bots can and should adapt.
-Jeremy
Well then I must've misunderstood. Please enlighten me.
Gesendet von Maximilian's iPhone. (Sent from Maximilian's iPhone.)
Am Mar 26, 2012 um 13:22 schrieb Jeremy Baron jeremy@tuxmachine.com:
On Mon, Mar 26, 2012 at 12:50, Maximilian Doerr cybernet678@yahoo.com wrote:
I strongly disagree with removing wiki text. Several thousand bots depend on it. Especially Cluebot NG which is a highly sophisticated anti-vandalism bot. I recommend going our current route where the visual editor writes the Wikimarkup realtime and vice versa.
I suspect maybe you didn't understand Daniel's message? (or didn't read it?) I imagine most pages will still use the same markup we have today after his idea is implemented.
For the pages that are different, the bots can and should adapt.
-Jeremy
Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 26.03.2012 20:40, Maximilian Doerr wrote:
Well then I must've misunderstood. Please enlighten me.
The idea is not to get rid of wikitext, but to support other content types too. At the moment, MediaWiki supports *only* wikitext. I'm proposing a way that would allow us to say "ok, this namespace contains tabular data we can make pretty graphs from" or "hey, this page is a VCARD, we can show a neat form for that", etc.
-- daniel
On Mon, Mar 26, 2012 at 9:50 AM, Maximilian Doerr cybernet678@yahoo.com wrote:
I strongly disagree with removing wiki text. Several thousand bots depend on it. Especially Cluebot NG which is a highly sophisticated anti-vandalism bot. I rcommend going our current route where the visual editor writes the Wikimarkup realtime and vice versa.
I think any problems can be solved with something similar (identical?) to HTTP content negotiation at the API layer.
For backwards-compatibility purposes, we could conceivably wrap the contents of non-Wikitext with parser tags. However, that doesn't *really* solve the problem for bots, but rather just exposes it as a problem that bots already have to deal with.
Rob
On Mon, 2012-03-26 at 12:36 -0700, Rob Lanphier wrote:
I think any problems can be solved with something similar (identical?) to HTTP content negotiation at the API layer.
For backwards-compatibility purposes, we could conceivably wrap the contents of non-Wikitext with parser tags. However, that doesn't *really* solve the problem for bots, but rather just exposes it as a problem that bots already have to deal with.
Rob
Are we talking about mime types for articles?
Amgine
On 26.03.2012 22:28, Amgine wrote:
Are we talking about mime types for articles?
Yes, pretty much. Though technically, the mime type would only describe the serialization format (e.g. "application/json"), not the data model (e.g. "wikidata entity record") - both bits of information are needed. But essentially, yes: pages will have types, and types have handlers for displaying, editing, etc.
Otoh this means that HTTP content negotiation is not going to work, or rather, it will only work to pick the the preferred serialization format for a given content. If the client doesn't know the content model, the data is going to be gibberish to it, no matter how it is serialized.
Which is why I propose that per default, any code or client unaware of the possibility of content that is not wikitext, will receive the equivalent of an empty page when requesting content of a page that isn't wikitext. No harm done that way.
-- daniel
On Mon, 26 Mar 2012 13:56:59 -0700, Daniel Kinzler daniel@brightbyte.de wrote:
On 26.03.2012 22:28, Amgine wrote:
Are we talking about mime types for articles?
Yes, pretty much. Though technically, the mime type would only describe the serialization format (e.g. "application/json"), not the data model (e.g. "wikidata entity record") - both bits of information are needed. But essentially, yes: pages will have types, and types have handlers for displaying, editing, etc.
Otoh this means that HTTP content negotiation is not going to work, or rather, it will only work to pick the the preferred serialization format for a given content. If the client doesn't know the content model, the data is going to be gibberish to it, no matter how it is serialized.
Which is why I propose that per default, any code or client unaware of the possibility of content that is not wikitext, will receive the equivalent of an empty page when requesting content of a page that isn't wikitext. No harm done that way.
-- daniel
Non-wikitext data is supposed to give extensions the ability to do things beyond WikiText. The data is always going to be an opaque form controlled by the extension. I don't think that low level serialized data should be visible at all to clients. Even if they know it's there. Just like database schemas change, I expect extensions to also want to alter the format of data as they add new features.
Also I've thought about something like this for quite awhile. One of the things I'd really like us to do is start using real metadata even within normal WikiText pages. We should really replace in-page [[Category:]] with a real string of category metadata. Which we can then use to provide good intuitive category interfaces. ([[Category:]] would be left in for templates, compatibility, etc...).
This case especially tells me that raw is not something that should be outputting the raw data, but should be something which is implemented by whatever implements the normal handling for that serialized data.
On 27.03.2012 02:19, Daniel Friesen wrote:
Non-wikitext data is supposed to give extensions the ability to do things beyond WikiText. The data is always going to be an opaque form controlled by the extension. I don't think that low level serialized data should be visible at all to clients. Even if they know it's there.
The serialized form of the data needs to be visible at least in the XML dump format. How else could we transfer non-wikitext content between wikis?
Using the serialized form may also make sense for editing via the web API, though I'm not sure yet what the best ways is here:
a) keep using the current general, text based interface with the serialized form of the content
or b) require a specialized editing API for each content type.
Going with a) has the advantage of that it will simply work with current API client code. However, if the client modifies the content and writes it back without being aware of the format, it may corrupt the data. So perhaps we should return an error when a client tries to edit a non-wikitext page "the old way".
The b) option is a bit annoying because it means that we have to define a potentially quite complex mapping between the content model and API's result model (nested php arrays). This is easy enough for Wikidata, which uses a JSON based internal model. But for, say, SVG... well, I guess the specialized mapping could still be "escaped XML as a string".
Note that if we allow a), we can still allow b) at the same time - for Wikidata, we will definitely implement a special purpose editing interface that supports stuff like "add value for language x to property y", etc.
Just like database schemas change, I expect extensions to also want to alter the format of data as they add new features.
Indeed. This is why in addition to a data model identifier, the serialization format is explicitly tracked in the database and will be present in dumps and via the web API.
Also I've thought about something like this for quite awhile. One of the things I'd really like us to do is start using real metadata even within normal WikiText pages. We should really replace in-page [[Category:]] with a real string of category metadata. Which we can then use to provide good intuitive category interfaces. ([[Category:]] would be left in for templates, compatibility, etc...).
That could be implemented using a "multipart" content type. But I don't want to get into this too deeply - multipart has a lot of cool uses, but it's beyond what we will do for Wikidata.
This case especially tells me that raw is not something that should be outputting the raw data, but should be something which is implemented by whatever implements the normal handling for that serialized data.
you mean action=raw? yes, I agree. action=raw should not return the actual serialized format. It should probably return nothing or an error for non-text content. For multipart pages it would just return the "main part", without the "extensions".
But the entire "multipart" stuff needs more thought. It has a lot of great applications, but it's beyond the scope of Wikidata, and it has some additional implications (e.g. can the old editing interface be used to edit "just the text" while keeping the attachments?).
-- daniel
Hi
Couple of things
1. JSON - that's not a very reader friendly format. Also not an ideal format for the search engine to consume. This is due to lack of Support for metadata and data schema. XML is universally supported, more human friendly and support a schema which can be useful way beyond their this initial . 2. Be bold but also be smart and give respect where it is due. Bots and everyone else who's written tools for and about MediaWiki, who made a basic assumption about the page structure would be broken. Many will not so readily adapt. 3. A project like wikidata - in its infancy should make every effort to be backwards compatible, It would be far wiser to be place wikidata into a page with wiki source using an custom <xml/> tag or even <cdata/> xhtml tag.
Oren Bochman
-----Original Message----- From: wikitext-l-bounces@lists.wikimedia.org [mailto:wikitext-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: Tuesday, March 27, 2012 9:14 AM To: Wikitext-l Cc: wikitech-l@lists.wikimedia.org; daniel@nadir-seen-fire.com Subject: Re: [Wikitext-l] Fwd: [Wikitech-l] Cutting MediaWiki loose from wikitext
On 27.03.2012 02:19, Daniel Friesen wrote:
Non-wikitext data is supposed to give extensions the ability to do things beyond WikiText. The data is always going to be an opaque form controlled by the extension. I don't think that low level serialized data should be visible at all to clients. Even if they know it's there.
The serialized form of the data needs to be visible at least in the XML dump format. How else could we transfer non-wikitext content between wikis?
Using the serialized form may also make sense for editing via the web API, though I'm not sure yet what the best ways is here:
a) keep using the current general, text based interface with the serialized form of the content
or b) require a specialized editing API for each content type.
Going with a) has the advantage of that it will simply work with current API client code. However, if the client modifies the content and writes it back without being aware of the format, it may corrupt the data. So perhaps we should return an error when a client tries to edit a non-wikitext page "the old way".
The b) option is a bit annoying because it means that we have to define a potentially quite complex mapping between the content model and API's result model (nested php arrays). This is easy enough for Wikidata, which uses a JSON based internal model. But for, say, SVG... well, I guess the specialized mapping could still be "escaped XML as a string".
Note that if we allow a), we can still allow b) at the same time - for Wikidata, we will definitely implement a special purpose editing interface that supports stuff like "add value for language x to property y", etc.
Just like database schemas change, I expect extensions to also want to alter the format of data as they add new features.
Indeed. This is why in addition to a data model identifier, the serialization format is explicitly tracked in the database and will be present in dumps and via the web API.
Also I've thought about something like this for quite awhile. One of the things I'd really like us to do is start using real metadata even within normal WikiText pages. We should really replace in-page [[Category:]] with a real string of category metadata. Which we can then use to provide good intuitive category interfaces. ([[Category:]] would be left in for templates, compatibility, etc...).
That could be implemented using a "multipart" content type. But I don't want to get into this too deeply - multipart has a lot of cool uses, but it's beyond what we will do for Wikidata.
This case especially tells me that raw is not something that should be outputting the raw data, but should be something which is implemented by whatever implements the normal handling for that serialized data.
you mean action=raw? yes, I agree. action=raw should not return the actual serialized format. It should probably return nothing or an error for non-text content. For multipart pages it would just return the "main part", without the "extensions".
But the entire "multipart" stuff needs more thought. It has a lot of great applications, but it's beyond the scope of Wikidata, and it has some additional implications (e.g. can the old editing interface be used to edit "just the text" while keeping the attachments?).
-- daniel
_______________________________________________ Wikitext-l mailing list Wikitext-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitext-l
On 27.03.2012 09:33, Oren Bochman wrote:
- JSON - that's not a very reader friendly format. Also not an ideal format
for the search engine to consume. This is due to lack of Support for metadata and data schema. XML is universally supported, more human friendly and support a schema which can be useful way beyond their this initial .
JSON is the internal serialization format. It will not be shown to the user or used to communicate with clients. Unless of course they use JSON for interaction with the web API, as most do.
The full text search engine will be fed a completely artificial view of the data. I agree that JSON wouldn't be good for that, though XML would be far worse still.
As to which format and data model to use to represent Wikidata records internally: that's a different discussion, independent of the idea of introducing ContentHandlers to MediaWiki. Please post to wikidata-l about that.
- Be bold but also be smart and give respect where it is due. Bots and
everyone else who's written tools for and about MediaWiki, who made a basic assumption about the page structure would be broken. Many will not so readily adapt.
I agree that backwards compatibility is very important. Which is why I took care not to break any code or client using the "old" interface on pages that contain wikitext (i.e. the standard/legacy case). The current interface (both, the web API as well as methods in MediaWiki core) will function exactly as before for all pages that contain wikitext.
For pages not containing wikitext, such code can not readily function. There are two options here (currently controlled by a global setting): pretend the page is empty (the default) or throw an error (probably better in case of the web API, but too strict for other uses).
- A project like wikidata - in its infancy should make every effort to be
backwards compatible, It would be far wiser to be place wikidata into a page with wiki source using an custom <xml/> tag or even <cdata/> xhtml tag.
I strongly disagree with that, it introduces more problems than it solves; Denny and I decided against this option specifically in the light of the experience he collected with embedding structured data in wikitext in Semantic MediaWiki and Shortipedia.
But again: that's a different discussion, please post your concerns to wikidata-l.
Regards, Daniel
JSON is the internal serialization format.
You're suggesting to use MediaWiki as a model :) What's stopping you from implementing it as a _file_ handler, not _article_ handler? I mean, _articles_ contain text (now wikitext). All non-human readable/editable/diffable data is stored as "files". Now they all are in File namespace, but maybe it's much more simpler to allow storing them in other namespaces and write file handlers for displaying/editing them than to break the idea of "article"?
On 27.03.2012 13:07, vitalif@yourcmc.ru wrote:
JSON is the internal serialization format.
You're suggesting to use MediaWiki as a model :) What's stopping you from implementing it as a _file_ handler, not _article_ handler?
Because of the actions I want to be able to perform on them, most importantly editing, but also having diff views for the history, automatic merge to avoid edit conflicts, etc.
These types of interaction is supported by mediawiki for "articles", but not for "files".
In constrast, files are rendered/thumbnailed (we don't need that), get included in articles with a box and caption (we don't want that), and can be accessed/downloaded directly as a file via http (we definitely don't want that).
So, what we want to do with the structured data fits much better with MediaWiki's concept of a "page" than with the concept of a "file".
I mean, _articles_ contain text (now wikitext). All non-human readable/editable/diffable data is stored as "files".
But that data WILL be readable/editable/diffable! That's the point! Just not as text, but as something else, using special viewers, editors, and differs. That's precisely the idea of the ContentHandler.
Now they all are in File namespace, but maybe it's much more simpler to allow storing them in other namespaces and write file handlers for displaying/editing them than to break the idea of "article"?
How does what I propose break the idea of an article? It just means that articles do not *necessarily* contain text. And it makes sure that whatever it is that is contained in the article can still be viewed, edited, and compared in a meaningful way.
-- daniel
On 03/26/2012 10:56 PM, Daniel Kinzler wrote:
On 26.03.2012 22:28, Amgine wrote:
Are we talking about mime types for articles?
Yes, pretty much. Though technically, the mime type would only describe the serialization format (e.g. "application/json"), not the data model (e.g. "wikidata entity record") - both bits of information are needed. But essentially, yes: pages will have types, and types have handlers for displaying, editing, etc.
+1 for making serialization / data model information explicit. Parsoid is also structured around per-input mime types, although currently only a generic 'text/wiki' placeholder type is implemented. Each input type has a specific parser pipeline associated with it, which eventually produces tokens (at this stage mostly synonymous with HTML tags) independent of input type for the last, shared token transformation phase.
There is no distinction between processing for displaying vs. editing currently, as we try to preserve all relevant information for editing using structured data (leaning towards RDFa) and attribute annotations in a displayable DOM. We try to support schema-like information for template editing. There might be a way to accommodate schema-like information for Wikidata information using the same setup.
Mime types similar to those described for XML in RFC 3023 could be used for JSON-serialized data. Maybe something like application/wikidata+json? The syntax looks a bit backwards to me, but at least there is the XML precedent, and anything with +json suffix can be handled as generic JSON when the specific data model is not known.
Gabriel
wikitext-l@lists.wikimedia.org