Cutting MediaWiki loose from wikitext

List overview All Threads
Download

newer

older

OAuth

Bugzilla Weekly Report

Daniel Kinzler

26 Mar 2012 26 Mar '12

4:45 p.m.

Hi all. I have a bold proposal (read: evil plan). To put it briefly: I want to remove the assumption that MediaWiki pages contain always wikitext. Instead, I propose a pluggable handler system for different types of content, similar to what we have for file uploads. So, I propose to associate a "content model" identifier with each page, and have handlers for each model that provide serialization, rendering, an editor, etc. The background is that the Wikidata project needs a way to store structured data (JSON) on wiki pages instead of wikitext. Having a pluggable system would solve that problem along with several others, like doing away with the special cases for JS/CSS, the ability to maintain categories etc separate from body text, manage Gadgets sanely on a wiki page, or several other things (see the link below). I have described my plans in more detail on meta: http://meta.wikimedia.org/wiki/Wikidata/Notes/ContentHandler A very rough prototype is in a dev branch here: http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/ Please let me know what you think (here on the list, preferably, not on the talk page there, at least for now). Note that we *definitely* need this ability for Wikidata. We could do it differently, but I think this would be the cleanest solution, and would have a lot of mid- and long term benefits, even if it's a short term pain. I'm presenting my plan here to find out if I'm on the right track, and whether it is feasible to put this on the road map for 1.20. It would be my (and the Wikidata team's) priority to implement this and see it through before Wikimania. I'm convinced we have the manpower to get it done. Cheers, Daniel

Show replies by date

Alex Brollo

26 Mar 26 Mar

6:18 p.m.

I agree that's hyronical to play with a powerful database-built project, and to have no access nor encouragement to organize our data as should be organized. But - we do use normal pages as data repository too, simply marking some specific areas of pages as "data areas". More, we use the same page both as normal wikitext container and "data container". Why not? Alex brollo (it.source)

Daniel Kinzler

10:51 p.m.

On 26.03.2012 18:18, Alex Brollo wrote:

...

Because it is not sufficient. There is no way to query such data efficiently, and there is no standard web API to access this data, not URLs to reference it (without the text around it). The proposal allows for structured data as page content, as well as any other type of page content, and it also potentially allows multiple types of data to exist as part of the same page (using some mechanism of "attachment" or "multipart"). -- daniel

John Erling Blad

7:18 p.m.

I like this idea, it solves a lot of problems. John On Mon, Mar 26, 2012 at 4:45 PM, Daniel Kinzler <daniel(a)brightbyte.de> wrote:

...

Brion Vibber

10:02 p.m.

I'm generally in favor of this plan. I haven't looked over the specific code experiments yet but the plan sounds solid. A few notes: * over time we'll want to do things like migrate File: pages from 'plain wikitext that happens to have an associated file' to 'structured data about a file'. This will be magnificent. * I wouldn't overmuch emphasize things like "oh you could have pages in markdown or tex!", though it does sound neat and all. :) * we need to make sure that import/export round-trips things consistently, including for "non-wikitext" stuff. Either that means making import/export content-aware, or shipping the serialized form through the export XML? As for timing; Daniel's hoping for something in the neighborhood of an August deployment. I think if we keep things minimal that should be feasible; it's somewhat similar to the migration of Image stuff with MediaHandler classes. I'm a bit uncertain about the idea of 'multipart' pages, though attached data YES YES in some clean way is needed. -- brion On Mon, Mar 26, 2012 at 7:45 AM, Daniel Kinzler <daniel(a)brightbyte.de>wrote;wrote:

...

Daniel Kinzler

10:26 p.m.

On 26.03.2012 22:02, Brion Vibber wrote:

...

I'm generally in favor of this plan. I haven't looked over the specific code experiments yet but the plan sounds solid.

YAY!

...

* over time we'll want to do things like migrate File: pages from 'plain wikitext that happens to have an associated file' to 'structured data about a file'. This will be magnificent.

I hope to get the WMNL guys excited about this idea, this would really rock for GLAM applications.

...

* I wouldn't overmuch emphasize things like "oh you could have pages in markdown or tex!", though it does sound neat and all. :)

Yes. For the records, i do *not* want to move Wikipedia format to another syntax. (Well, I wish it *used* another syntax, but that's a completely separate discussion).

...

* we need to make sure that import/export round-trips things consistently, including for "non-wikitext" stuff. Either that means making import/export content-aware, or shipping the serialized form through the export XML?

I intend the importer/exporter to use the serialized form, and to be aware only of the additional revision attributes specifying the content model and serialization format. How a wiki should react when importing content for an unknown handler is an open issue, though. Fail? Import a blank page? Import as wikitext?... But we don't need to solve that here and now.

...

As for timing; Daniel's hoping for something in the neighborhood of an August deployment. I think if we keep things minimal that should be feasible; it's somewhat similar to the migration of Image stuff with MediaHandler classes.

This is because of Wikidata's tight timeline. We'll be working hard on getting this ready soon.

...

I'm a bit uncertain about the idea of 'multipart' pages, though attached data YES YES in some clean way is needed.

That bit is mostly idle musing - "multipart" and "attachments" are *not* needed for Wikidata, though they open up several neat use cases. Thanks for the feedback Brion! -- daniel

Platonides

27 Mar 27 Mar

12:09 a.m.

I like the general idea (haven't gone through the detailed pages).

...

On 26.03.2012 22:02, Brion Vibber wrote: > * over time we'll want to do things like migrate File: pages from 'plain > wikitext that happens to have an associated file' to 'structured data about > a file'. This will be magnificent.

I think that File: pages that happen to be svg is a much easier approach.

...

I'm a bit uncertain about the idea of 'multipart' pages, though attached data YES YES in some clean way is needed.

That bit is mostly idle musing - "multipart" and "attachments" are *not* needed for Wikidata, though they open up several neat use cases.

It's just something to take into account when designing the extensibility.

...

A very rough prototype is in a dev branch here: http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/

It looks really evil publishing that svn branch just days after git migration :) I think that branch -created months ago- should be migrated to git, so we could all despair..^W benefit from git wonderful branching abilities. Best regards

Daniel Kinzler

8:39 a.m.

On 27.03.2012 00:09, Platonides wrote:

...

Indeed - when I asked Chad about that, he said "ask me again once the dust has settled". I'd be happy to have this in git. Or... well, maybe I'll just make a patch from that branch, make a fresh branch in git, and cherry pick the changes, trying to keep things minimal. Yea, that's probably the best thing to do. -- daniel

Tim Starling

12:33 a.m.

On 27/03/12 01:45, Daniel Kinzler wrote:

...

For the record: we've discussed this previously and I'm fine with it. It's a well thought-out proposal, and the only request I had was to ensure that the DB schema supports some similar projects that we have in the idea pile, like multiple parser versions. On 27/03/12 09:37, MZMcBride wrote:

...

For example, would the diff engine need to be rewritten so that people can monitor these pages for vandalism? Will these pages be editable in the same way as current wikitext pages? If not, will there be special editors for the various data types?

These questions are all answered on the notes page that Daniel linked to. The answers are yes, no and yes. -- Tim Starling

Daniel Kinzler

9:18 a.m.

On 27.03.2012 00:33, Tim Starling wrote:

...

Thanks Tim! The one important bit I'd like to hear from you is... do you think it is feasible to get this not only implemented but also reviewed and deployed by August?... We are on a tight schedule with Wikidata, and this functionality is a major blocker. I think implementing ContentHandlers for MediaWiki would have a lot of benefits for the future, but if it's not feasible to get it in quickly, I have to think of an alternative way to implement structured data storage. Thanks Daniel

Alex Brollo

9:47 a.m.

I can't understand details of this talk, but if you like take a look to the raw code of any ns0 page into it.wikisource and consider that "area dati" is removed from wikitext as soon as an user opens the page in edit mode, and re-builded as the user saves it; or take a look here: http://it.wikisource.org/wiki/MediaWiki:Variabili.js where date used into automation/help of edit are collected as js objects. Alex brollo

Daniel Kinzler

9:56 a.m.

On 27.03.2012 09:47, Alex Brollo wrote:

...

Yes. Basically, the ContentHandler proposal would introduce native support for this kind of thing into MediaWiki, instead of implementing it as a hack with JavaScript. Wouldn't it be nice to get input forms for this data, or have nice diffs of the structure, or good search results for data records?... Not to mention the ability to actually query for individual data fields :) -- daniel

MZMcBride

12:37 a.m.

Daniel Kinzler wrote:

...

To put it briefly: I want to remove the assumption that MediaWiki pages contain always wikitext. Instead, I propose a pluggable handler system for different types of content, similar to what we have for file uploads. So, I propose to associate a "content model" identifier with each page, and have handlers for each model that provide serialization, rendering, an editor, etc.

It's an ancient assumption that's built in to many parts of MediaWiki (and many outside tools and scripts). Is there any kind of assessment about the level of impact this would have? For example, would the diff engine need to be rewritten so that people can monitor these pages for vandalism? Will these pages be editable in the same way as current wikitext pages? If not, will there be special editors for the various data types? What other parts of the MediaWiki codebase will be affected and to what extent? Will text still go in the text table or will separate tables and infrastructure be used? I'm reminded a little of LiquidThreads for some reason. This idea sounds good, but I'm worried about the implementation details, particularly as the assumption you seek to upend is so old and ingrained.

...

The background is that the Wikidata project needs a way to store structured data (JSON) on wiki pages instead of wikitext. Having a pluggable system would solve that problem along with several others, like doing away with the special cases for JS/CSS, the ability to maintain categories etc separate from body text, manage Gadgets sanely on a wiki page, or several other things (see the link below).

How would this affect categories being stored in wikitext (alongside the rest of the page content text)? That part doesn't make any sense to me. MZMcBride

Daniel Kinzler

9:35 a.m.

On 27.03.2012 00:37, MZMcBride wrote:

...

It's an ancient assumption that's built in to many parts of MediaWiki (and many outside tools and scripts). Is there any kind of assessment about the level of impact this would have?

Not formally, just my own poking at the code base. There is a lot of places in the code that access revision text, and do something with it, not all can easily be found or changed (especially true for extensions). My proposal covers a compatibility layer that will cause legacy code to just see an empty page when trying to access the contents of a non-wikitext page. Only code aware of content models will see any non-wikitext content. This should avoid most problems, and should ensure that things will work as before at least for everything that is wikitext.

...

For example, would the diff engine need to be rewritten so that people can monitor these pages for vandalism?

A diff engine needs to be implemented for each content model. The existing engine(s) does not need to be rewritten, it will be used for all wikitext pages.

...

Will these pages be editable in the same way as current wikitext pages?

No. The entire point of this proposal is to be able to neatly supply specialized display, editing and diffing of different kinds of content.

...

If not, will there be special editors for the various data types?

Indeed.

...

What other parts of the MediaWiki codebase will be affected and to what extent?

A few classes (like Revision or WikiPage) need some major additions or changes, see the proposal on meta. Lots of places should eventually be changed to become aware of content models, but don't need to be adapted immediately (see above).

...

Will text still go in the text table or will separate tables and infrastructure be used?

Uh, did you read the proposal?... All content is serialized just before storing it. It is stored into the text table using the same code as before. The content model and serialization format is recorded in the revision table. Secondary data (index data, analogous to the link tables) may be extracted from the content and stored in separate database tables, or in some other service, as needed.

...

I'm reminded a little of LiquidThreads for some reason. This idea sounds good, but I'm worried about the implementation details, particularly as the assumption you seek to upend is so old and ingrained.

It's more like the transition to using MediaHandlers instead of assuming uploaded files to be images: existing concepts and actions are generalized to apply to more types of content. LiquidThreads introduces new concepts (threads, conversations) and interactions (re-arranging, summarazing, etc) and tries to integrate them with the concepts used for wiki pages. This seems far more complicated to me.

...

How would this affect categories being stored in wikitext (alongside the rest of the page content text)? That part doesn't make any sense to me.

Imagine a data model that works like mime/multipart email: you have a wrapper that contains the "main" text as well as "attachments". The whole shebang gets serialized and stored in the text table, as usual. For displaying, editing and visualizing, you have code that is aware of the multipart nature of the content, and puts the parts together nicely. However, the category stuff is a use case I'm just mentioning because it has bee requested so often in the past (namely, editing categories, interlanguage links, etc separately from the wiki text); this mechanism is not essential to the concept of ContentHandlers, and not something I plan to implement for the Wikidata project. It'S just somethign that will become much easier once we have ContentHandlers. -- daniel

Antoine Musso

11:26 a.m.

Daniel Kinzler wrote:

...

A very rough prototype is in a dev branch here: http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/

I guess we could have that migrated to Gerrit and review the project there. -- Antoine "hashar" Musso

Daniel Kinzler

11:43 a.m.

On 27.03.2012 11:26, Antoine Musso wrote:

...

Daniel Kinzler wrote:

A very rough prototype is in a dev branch here: http://svn.wikimedia.org/svnroot/mediawiki/branches/Wikidata/phase3/

I guess we could have that migrated to Gerrit and review the project there.

Sure, fine with me :) Though I will likely make a new branch and merge my changes again more cleanly. What's there now is really a proof of concept. But sure, have a look! -- daniel

Daniel Kinzler

30 Apr 30 Apr

10:05 a.m.

Hi all Moving forward, I have just committed a first patch for review: https://gerrit.wikimedia.org/r/#change,6101 Please have a look if you are interested. -- daniel

4391

days inactive

4426

days old

wikitech-l@lists.wikimedia.org

Manage subscription

16 comments

8 participants

tags (0)

participants (8)

Alex Brollo
Antoine Musso
Brion Vibber
Daniel Kinzler
John Erling Blad
MZMcBride
Platonides
Tim Starling