File licensing information support

List overview All Threads
Download

newer

older

Checkered background on file...

user email validation ready

Bryan Tong Minh

20 Jan 2011 20 Jan '11

8:05 p.m.

Hello,

As you may have noticed, Roan, Krinkle and me have started to more tightly integrate image licensing within MediaWiki. Our aim is to create a system where it should be easy to obtain the basic copyright information of an image in a machine readable format, as well as querying images with a certain copyright state (all images copyrighted by User:XY, all images licensed CC-BY-SA, etc)

At this moment we only intend to store author and license information, but nothing stops us from expanding this in the future.

We have put some information in a not so structured way at mw.org [1]. There are some issues open on the talk page [2]. Input is of course welcome, both here or preferably at the talk page.

Bryan

[1] http://www.mediawiki.org/wiki/Files_and_licenses_concept [2] http://www.mediawiki.org/wiki/Talk:Files_and_licenses_concept

Show replies by date

Platonides

21 Jan 21 Jan

midnight

Bryan Tong Minh wrote:

...

Hello,

As you may have noticed, Roan, Krinkle and me have started to more tightly integrate image licensing within MediaWiki. Our aim is to create a system where it should be easy to obtain the basic copyright information of an image in a machine readable format, as well as querying images with a certain copyright state (all images copyrighted by User:XY, all images licensed CC-BY-SA, etc)

At this moment we only intend to store author and license information, but nothing stops us from expanding this in the future.

We have put some information in a not so structured way at mw.org [1]. There are some issues open on the talk page [2]. Input is of course welcome, both here or preferably at the talk page.

Bryan

[1] http://www.mediawiki.org/wiki/Files_and_licenses_concept [2] http://www.mediawiki.org/wiki/Talk:Files_and_licenses_concept

I would have probably gone by the page_props route, passing the metadata from the wikitext to the tables via a parser function.

Conceptually, revision table shouldn't link to file_props. file_props should be linked with image instead.

I like the idea of an author manager, specially if it's done as a pseudo-namespace.

Roan Kattouw

12:46 a.m.

2011/1/21 Platonides Platonides@gmail.com:

...

Conceptually, revision table shouldn't link to file_props. file_props should be linked with image instead.

Maybe, but the current image/oldimage schema resembling cur/old is horrible. For instance, there is no way to uniquely identify an oldimage row. We talked about this for an hour and decided that we have some ideas for restructuring that, but that it's a huge operation that shouldn't block the license integration project.

Roan Kattouw (Catrope)

Platonides

9:44 a.m.

Roan Kattouw wrote:

...

2011/1/21 Platonides Platonides@gmail.com:

...
Conceptually, revision table shouldn't link to file_props. file_props should be linked with image instead.

Maybe, but the current image/oldimage schema resembling cur/old is horrible. For instance, there is no way to uniquely identify an oldimage row.

I agree. It should also be fixed.

...

We talked about this for an hour and decided that we have some ideas for restructuring that, but that it's a huge operation that shouldn't block the license integration project.

Roan Kattouw (Catrope)

If we wanted to map it to a page/revision format, it seems quite straightforward. I'm missing something, right?

Roan Kattouw

7:43 p.m.

2011/1/21 Platonides Platonides@gmail.com:

...

If we wanted to map it to a page/revision format, it seems quite straightforward. I'm missing something, right?

You're missing that migrating a live site (esp. Commons, with 8 million image rows and ~750k oldimage rows) from the old to the new schema would be a nightmare, and would probably involve setting stuff to read-only for a few hours.

Roan Kattouw (Catrope)

Brion Vibber

8:39 p.m.

On Fri, Jan 21, 2011 at 10:43 AM, Roan Kattouw roan.kattouw@gmail.comwrote:

...

2011/1/21 Platonides Platonides@gmail.com:

...
If we wanted to map it to a page/revision format, it seems quite straightforward. I'm missing something, right?

You're missing that migrating a live site (esp. Commons, with 8 million image rows and ~750k oldimage rows) from the old to the new schema would be a nightmare, and would probably involve setting stuff to read-only for a few hours.

If one's clever about it, this could probably actually be done on-the-fly in a reasonably non-evil fashion.

Image version data isn't used as widely as revisions; eg things like Special:Contributions always needed direct access to old revs looked up by author, whereas I think image old versions are pretty much only pulled up by title, via the image record. There are also relatively few revisions per file -- old images usually only have a few revisions, and cases of thousands of versions are I suspect very rare -- which would make the actual conversion work relatively lightweight for each file record.

Further optimizing by delaying on-demand migration of a record until write time could also keep it from being a sudden database & i/o sink. If indirect lookups won't be needed, we can just keep reading the existing image/oldimage records until they need to be updated on modification (or get updated by a background task at leisure).

-- brion

Platonides

11:46 p.m.

Roan Kattouw wrote:

...

2011/1/21 Platonides Platonides@gmail.com:

...
If we wanted to map it to a page/revision format, it seems quite straightforward. I'm missing something, right?

You're missing that migrating a live site (esp. Commons, with 8 million image rows and ~750k oldimage rows) from the old to the new schema would be a nightmare, and would probably involve setting stuff to read-only for a few hours.

Roan Kattouw (Catrope)

Do we agree in the target db schema? That's the important point.

Migrating a large site like commons is 'just' an operations issue. Making it readonly a bit wouldn't be a big issue, but could also for instance move to an intermediate point, where uploads are stored in both formats, while read only in the old one, while a script is moving records. Finally, flip the switch and drop the old tables.

Roan Kattouw

11:50 p.m.

2011/1/21 Platonides Platonides@gmail.com:

...

Do we agree in the target db schema? That's the important point.

We haven't thought about it in detail. But it would be a fairly large change and require changes throughout the software, as well as possibly elsewhere in the schema.

...

Migrating a large site like commons is 'just' an operations issue. Making it readonly a bit wouldn't be a big issue, but could also for instance move to an intermediate point, where uploads are stored in both formats, while read only in the old one, while a script is moving records. Finally, flip the switch and drop the old tables.

Sure, it can be dealt with. It's just that it'd be an epic upgrade :)

Roan Kattouw (Catrope)

Platonides

22 Jan 22 Jan

12:08 a.m.

Roan Kattouw wrote:

...

2011/1/21 Platonides Platonides@gmail.com:

...
Do we agree in the target db schema? That's the important point.

We haven't thought about it in detail. But it would be a fairly large change and require changes throughout the software, as well as possibly elsewhere in the schema.

...
Migrating a large site like commons is 'just' an operations issue. Making it readonly a bit wouldn't be a big issue, but could also for instance move to an intermediate point, where uploads are stored in both formats, while read only in the old one, while a script is moving records. Finally, flip the switch and drop the old tables.

Sure, it can be dealt with. It's just that it'd be an epic upgrade :)

Roan Kattouw (Catrope)

We already have 1.17 branched, so... who dares to create a branch and begin with it? :)

Michael Dale

21 Jan 21 Jan

3:36 a.m.

On 01/20/2011 05:00 PM, Platonides wrote:

...

I would have probably gone by the page_props route, passing the metadata from the wikitext to the tables via a parser function.

I would also say its probably best to pass metadata from the wikitext to the tables via a parser function. Similar to categories, and all other "user edited" metadata. This has the disadvantage that its not easy 'as easy' to edit via structured api entry point, but has the advantage of working well with all the existing tools, templates and versioning.

--michael

Platonides

9:35 a.m.

Michael Dale wrote:

...

On 01/20/2011 05:00 PM, Platonides wrote:

...
I would have probably gone by the page_props route, passing the metadata from the wikitext to the tables via a parser function.

I would also say its probably best to pass metadata from the wikitext to the tables via a parser function. Similar to categories, and all other "user edited" metadata. This has the disadvantage that its not easy 'as easy' to edit via structured api entry point, but has the advantage of working well with all the existing tools, templates and versioning.

--michael

Yes. I have been thinking on the Author case, as it seemed an easy start, and storing them inside the wikitext blob (hidden for users) looks to be the best way. Moving, versioning, diffing... is already handled for you. You just need to transform it before saving/rendering, and update the license table when the last version changes (I'm not convinced page_props wouldn't be good enough). The uglier bit is that we don't have run-once tags, which means a greater deviation from normal rendering.

Alex Brollo

9:45 a.m.

The interest of wikisource project for a formal and standardyzed set of book metadata (I presume from Dublin Core) into a database table is obviuos. Some preliminary tests into it.source suggest that templates and Labeled Section Transclusion extension could have a role as "existing wikitext conteiners for semantized variables"; the latter perhaps more interesting than the former one, since their content can be accessed directly from any page

I'd like that book metadata would be considered from the beginning of this interesting project.

Alex

Michael Dale

6:42 p.m.

On 01/21/2011 02:45 AM, Alex Brollo wrote:

...

The interest of wikisource project for a formal and standardyzed set of book metadata (I presume from Dublin Core) into a database table is obviuos. Some preliminary tests into it.source suggest that templates and Labeled Section Transclusion extension could have a role as "existing wikitext conteiners for semantized variables"; the latter perhaps more interesting than the former one, since their content can be accessed directly from any page

I'd like that book metadata would be considered from the beginning of this interesting project.

Alex

This quickly dove tails into Semantic MediaWiki discussion... which there are other threads on this list to reference. There is a wiki data summit / meeting coming up, where these issues will likely be discussed. Maybe we could start eliciting requirements and needs of projects like what you describe for wikisource and others that have been listed elsewhere on a pre-meeting project page, this way we can be sure to hit on all these items during the meeting.

--michael

Bryan Tong Minh

22 Jan 22 Jan

8:15 p.m.

On Fri, Jan 21, 2011 at 3:36 AM, Michael Dale mdale@wikimedia.org wrote:

...

On 01/20/2011 05:00 PM, Platonides wrote:

...
I would have probably gone by the page_props route, passing the metadata from the wikitext to the tables via a parser function.

I would also say its probably best to pass metadata from the wikitext to the tables via a parser function. Similar to categories, and all other "user edited" metadata. This has the disadvantage that its not easy 'as easy' to edit via structured api entry point, but has the advantage of working well with all the existing tools, templates and versioning.

This is actually the biggest decision that has been made, the rest is mostly implementation details. (Please note that I'm not presenting you with a fait accompli, it is of course still possible to change this)

Handling metadata separately from wikitext provides two main advantages: it is much more user friendly, and it allows us to properly validate and parse data.

Having a clear separate input text field "Author: ____" is much more user friendly {{#fileauthor:}}, which is so to say, a type of obscure MediaWiki jargon. I know that we could probably hide it behind a template, but that is still not as friendly as a separate field. I keep on hearing that especially for newbies, a big blob of wikitext is plain scary. We regulars may be able to quickly parse the structure in {{Information}}, but for newbies this is certainly not so clear. We actually see that from the community there is a demand for separating the meta data from the wikitext -- this is after all why they implemented the uselang= hacked upload form with a separate text box for every meta field.

Also, a separate field allows MediaWiki to understand what a certain input really means. {{#fileauthor:[[User:Bryan]]}} means nothing to MediaWiki or re-users, but "Author: Bryan___ [checkbox] This is a Commons username" can be parsed by MediaWiki to mean something. It also allows us to mass change for example the author. If I want to change my attribution from "Bryan" to "Bryan Tong Minh", I would need to edit the wikitext of every single upload, whereas in the new system I go to Special:AuthorManager and change the attribution.

...

Similar to categories, and all other"user edited" metadata.

Categories is a good example of why metadata does not belong in the wikitext. If you have ever tried renaming a category... you need to edit every page in the category and rename it in the wikitext. Commons is running multiple bots to handle category rename requests.

All these advantage outweigh the pain of migration (which could presumably be handled by bots) in my opinion.

Best regards, Bryan

Platonides

9:04 p.m.

An internally handled parser function doesn't conflict with showing it as a textbox.

We could for instance store it as a hidden page prefix.

Data stored in the text blob: "Author: [[Author:Bryan]] License: GPL --- {{Information| This is a nice picture I took }} {{Deletion request|Copyvio from http://www.example.org%7D%7D "

Data shown when clicking edit:

Author: <input type="text value="Bryan" /> License: <select>GPL</select>

<textarea name="textbox1"> {{Information| This is a nice picture I took }} {{Deletion request|Copyvio from http://www.example.org%7D%7D </textarea>

Why do I like such approach? * You don't need to create a new way for storing the history of such metadata. * Old versions are equally viewable. * Things like edit conflicts are already handled. * Diffing could be done directly with the blobs. * Import/export automatically works. * Extendable for more metadata. * Readable for tools/wikis unaware of the new format.

On the other hand: * It breaks the concept of "everything is in the source". * Parsing is different based on the namespace. A naive parsing as "License: GPL" instead of showing an image and a GPL excerpt, would be acceptable, but if incomplete markup is stored there, the renderings would be completely different. Could be skipped if placing the metadata inside a tag. But what happens if the tag is inserted elsewhere in the page? MediaWiki doesn't have run-once tags.

PS: The field author would be just a pointer to the author page, so you wouldn't need to edit everything on any case.

Krinkle

9:47 p.m.

On Jan 22, 2011 at 21:04 Platonides wrote:

...

An internally handled parser function doesn't conflict with showing it as a textbox.

We could for instance store it as a hidden page prefix.

Data stored in the text blob: "Author: [[Author:Bryan]] License: GPL

{{Information| This is a nice picture I took }} {{Deletion request|Copyvio from http://www.example.org%7D%7D "

Data shown when clicking edit:

Author: <input type="text value="Bryan" /> License: <select>GPL</select>

<textarea name="textbox1"> {{Information| This is a nice picture I took }} {{Deletion request|Copyvio from http://www.example.org}} </textarea>

So PHP would extract {{#author:4}} and {{#license:12}} from the textblob when showing the editpage. And show the remaining wikitext in the <textarea> and the author/ license as seperate form elements. And upon saving, generate "{{#author:4}} {{#license:12}}\n" again and prepend to the textblob.

Double instances of these would be ignored (ie. stripped automatically since they're not re-inserted to the textblob upon saving). One small downside would be that if someone would edit the textarea manually to do stuff with author and license, the next edit would re-arrange them since they're extracted and re-insterted thus showing messy diffs. (not a major point as long as it's done independant from JavaScript, which it can be if done from core / php).

If that's what you meant, I think it is an interesting concept that should not be ignored, however personally I am not yet convinced this is the way to go. But when looking at the complete picture of up/down sides, this could be something to consider.

-- Krinkle

Platonides

11:09 p.m.

Krinkle wrote:

...

So PHP would extract {{#author:4}} and {{#license:12}} from the textblob when showing the editpage. And show the remaining wikitext in the <textarea> and the author/ license as seperate form elements. And upon saving, generate "{{#author:4}} {{#license:12}}\n" again and prepend to the textblob.

Double instances of these would be ignored (ie. stripped automatically since they're not re-inserted to the textblob upon saving). One small downside would be that if someone would edit the textarea manually to do stuff with author and license, the next edit would re-arrange them since they're extracted and re-insterted thus showing messy diffs. (not a major point as long as it's done independant from JavaScript, which it can be if done from core / php).

If that's what you meant, I think it is an interesting concept that should not be ignored, however personally I am not yet convinced this is the way to go. But when looking at the complete picture of up/down sides, this could be something to consider.

-- Krinkle

That's an alternative approach. I was thinking in accepting them only at the beginning of the page, but extracting from everywhere is also an alternative.

Magnus Manske

23 Jan 23 Jan

1:38 a.m.

On Sat, Jan 22, 2011 at 10:09 PM, Platonides Platonides@gmail.com wrote:

...

Krinkle wrote:

...
So PHP would extract {{#author:4}} and {{#license:12}} from the textblob when showing the editpage. And show the remaining wikitext in the <textarea> and the author/ license as seperate form elements. And upon saving, generate "{{#author:4}} {{#license:12}}\n" again and prepend to the textblob.

Double instances of these would be ignored (ie. stripped automatically since they're not re-inserted to the textblob upon saving). One small downside would be that if someone would edit the textarea manually to do stuff with author and license, the next edit would re-arrange them since they're extracted and re-insterted thus showing messy diffs. (not a major point as long as it's done independant from JavaScript, which it can be if done from core / php).

If that's what you meant, I think it is an interesting concept that should not be ignored, however personally I am not yet convinced this is the way to go. But when looking at the complete picture of up/down sides, this could be something to consider.

-- Krinkle

That's an alternative approach. I was thinking in accepting them only at the beginning of the page, but extracting from everywhere is also an alternative.

OK, my 2 cents:

I would be in favour of extracting data from the {{Information}} template via the parser, but we talked about this over a year ago at the Paris meeting, and it was deemed too complicated (black caching magick etc.), and noone has stepped forward to do anything along those line, so I guess it's dead and buried.

Things like {{#author:4}} seem to be a nice hack to Get Things Done (TM). As was mentioned before, the temptation is great to expand it into a generic triplet storage a la Semantic MediaWiki, but that would probably complicate things to an extend where nothing gets done, again.

But one thing comes to mind: If someone implements an abstraction layer ("4" to a specific author) anyway, it should be dead simple to use it for tags as well. Just allow multiple {{#tag}}s per page (as opposed to {{#author}}), done. The same code that will allow for editing author and license information centrally should make it possible to edit tag information, i18n for example, so the tag display could be in the current user language (with "en" fallback). Search for tags i18n-style could be possible as well, if the translation information is encoded machine-readable as well, e.g. as language links ([[de:Pferd]] on the [[Tag:Horse]] page).

It might be too much to try to activate all of that in the first round, but IMHO the code should keep the use as tags in mind; it would be dreadful to waste such an opportunity.

Cheers, Magnus

Dmitriy Sintsov

1:34 p.m.

* Magnus Manske magnusmanske@googlemail.com [Sun, 23 Jan 2011 00:38:53 +0000]:

...

On Sat, Jan 22, 2011 at 10:09 PM, Platonides Platonides@gmail.com wrote:

...
Krinkle wrote:

...
So PHP would extract {{#author:4}} and {{#license:12}} from the textblob when showing the editpage. And show the remaining wikitext in the <textarea> and the author/ license as seperate form elements. And upon saving, generate "{{#author:4}} {{#license:12}}\n" again

and

...

...
...
prepend to the textblob.

Double instances of these would be ignored (ie. stripped

automatically

...
...
since they're not re-inserted to the textblob upon saving). One small downside would be that if someone would edit the textarea manually to do stuff with author and license, the next edit would re-arrange them since

they're

...

...
...
extracted and re-insterted thus showing messy diffs. (not a major point as long as it's done independant from JavaScript, which it can be if done from core / php).

If that's what you meant, I think it is an interesting concept that should not be ignored, however personally I am not yet convinced this is the way to go. But when looking at

the

...

...
...
complete picture of up/down sides, this could be something to consider.

-- Krinkle

That's an alternative approach. I was thinking in accepting them

only

...

at

...
the beginning of the page, but extracting from everywhere is also an alternative.

OK, my 2 cents:

I would be in favour of extracting data from the {{Information}} template via the parser, but we talked about this over a year ago at the Paris meeting, and it was deemed too complicated (black caching magick etc.), and noone has stepped forward to do anything along those line, so I guess it's dead and buried.

Things like {{#author:4}} seem to be a nice hack to Get Things Done (TM). As was mentioned before, the temptation is great to expand it into a generic triplet storage a la Semantic MediaWiki, but that would probably complicate things to an extend where nothing gets done, again.

But one thing comes to mind: If someone implements an abstraction layer ("4" to a specific author) anyway, it should be dead simple to use it for tags as well. Just allow multiple {{#tag}}s per page (as opposed to {{#author}}), done. The same code that will allow for editing author and license information centrally should make it possible to edit tag information, i18n for example, so the tag display could be in the current user language (with "en" fallback). Search for tags i18n-style could be possible as well, if the translation information is encoded machine-readable as well, e.g. as language links ([[de:Pferd]] on the [[Tag:Horse]] page).

You are correct - triplets definition are always meant to be as much generic as possible, something like categorizing or tagging. It is better to define them separately from templates and to include the references to their values in the template. Such way it would not complicate the parsing too much. Although one might want to have a fancy visual forms to edit these, that probably brings caching issues?

...

It might be too much to try to activate all of that in the first round, but IMHO the code should keep the use as tags in mind; it would be dreadful to waste such an opportunity.

Dmitriy

Happy-melon

24 Jan 24 Jan

2:06 a.m.

"Platonides" Platonides@gmail.com wrote in message news:ihfd31$buv$1@dough.gmane.org...

...

An internally handled parser function doesn't conflict with showing it as a textbox.

We could for instance store it as a hidden page prefix.

Data stored in the text blob: "Author: [[Author:Bryan]] License: GPL

{{Information| This is a nice picture I took }} {{Deletion request|Copyvio from http://www.example.org%7D%7D "

Data shown when clicking edit:

Author: <input type="text value="Bryan" /> License: <select>GPL</select>

<textarea name="textbox1"> {{Information| This is a nice picture I took }} {{Deletion request|Copyvio from http://www.example.org}} </textarea>

Eeewwwwww....

What's any different between this and a {{#author: }} parser function apart from the inability to access it from the wikitext? As noted, it's perfectly possible for the data to be in a separate field on the upload form, either by default or by per-wiki hackery. This is likely to result in as many "why can't I edit the bits of wikitext which diff, history, transclusion (let's not forget the enormous can of worms mucking around with the wikitext will open up there), etc assure me is there??" questions as it solves "what does this brace structure do?" ones.

--HM

Platonides

6:49 p.m.

Happy-melon wrote:

...

Eeewwwwww....

What's any different between this and a {{#author: }} parser function apart from the inability to access it from the wikitext? As noted, it's perfectly possible for the data to be in a separate field on the upload form, either by default or by per-wiki hackery. This is likely to result in as many "why can't I edit the bits of wikitext which diff, history, transclusion (let's not forget the enormous can of worms mucking around with the wikitext will open up there), etc assure me is there??" questions as it solves "what does this brace structure do?" ones.

--HM

Good point about transclusion. That question wouldn't be asked since they would be editable above, just in a different input box than the main content.

Lars Aronsson

7:06 a.m.

On 01/22/2011 08:15 PM, Bryan Tong Minh wrote:

...

Having a clear separate input text field "Author: ____" is much more user friendly {{#fileauthor:}}, which is so to say, a type of obscure MediaWiki jargon.

I disagree. In real life, there are always more compliated cases, where an author is not an author, but two authors or a sculptor, or one painter and one photographer. These things never fit in a single "author" field, and the same goes for any other separated fields. But the free-form Wikipedia can handle all real-world cases in plain human language.

Various "expert systems" based on "artificial intelligence" existed since the 1980s, but none of them produced a universal encyclopedia. Only the text-based Wikipedia did. After this humiliating fact, the same AI people (now dressed as "semantic web" scholars) come and claim that they too could have built Wikipedia, if it only were more structured. They are wrong, of course. Lack of structure is precisely what built Wikipedia.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

shi zhao

7:20 a.m.

plese see cc rel: http://labs.creativecommons.org/2011/ccrel-guide/ http://labs.creativecommons.org/2011/ccrel-guide/ Chinese wikipedia: http://zh.wikipedia.org/ My blog: http://shizhao.org twitter: https://twitter.com/shizhao

[[zh:User:Shizhao]]

2011/1/24 Lars Aronsson lars@aronsson.se

...

On 01/22/2011 08:15 PM, Bryan Tong Minh wrote:

...
Having a clear separate input text field "Author: ____" is much more user friendly {{#fileauthor:}}, which is so to say, a type of obscure MediaWiki jargon.

I disagree. In real life, there are always more compliated cases, where an author is not an author, but two authors or a sculptor, or one painter and one photographer. These things never fit in a single "author" field, and the same goes for any other separated fields. But the free-form Wikipedia can handle all real-world cases in plain human language.

Various "expert systems" based on "artificial intelligence" existed since the 1980s, but none of them produced a universal encyclopedia. Only the text-based Wikipedia did. After this humiliating fact, the same AI people (now dressed as "semantic web" scholars) come and claim that they too could have built Wikipedia, if it only were more structured. They are wrong, of course. Lack of structure is precisely what built Wikipedia.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Dmitriy Sintsov

7:56 a.m.

* Lars Aronsson lars@aronsson.se [Mon, 24 Jan 2011 07:06:02 +0100]:

...

On 01/22/2011 08:15 PM, Bryan Tong Minh wrote:

...
Having a clear separate input text field "Author: ____" is much more user friendly {{#fileauthor:}}, which is so to say, a type of

obscure

...

...
MediaWiki jargon.

I disagree. In real life, there are always more compliated cases, where an author is not an author, but two authors or a sculptor, or one painter and one photographer. These things never fit in a single "author" field, and the same goes for any other separated fields. But the free-form Wikipedia can handle all real-world cases in plain human language.

Various "expert systems" based on "artificial intelligence" existed since the 1980s, but none of them produced a universal encyclopedia. Only the text-based Wikipedia did. After this humiliating fact, the same AI people (now dressed as "semantic web" scholars) come and claim that they too could have built Wikipedia, if it only were more structured. They are wrong, of course. Lack of structure is precisely what built Wikipedia.

One may have not just a single triple for that, but the list / set of triples for the same person in a different role (different kind of author). There are two extremes - not to have any structure or to be overly structural. If there are few extra fields for an image description, why don't generalize it for all kinds of measured data - geographical, historical, population statistics, financial and economical data and so on? Why only the images are allowed to have structural and measurable data? However, I don't think that Wikipedia should have AI, because it requires huge computing power, and the problem is that AI algorithms are not efficient enough. To have the data structured is not a bad thing. It probably should not even try to do SPARQL, but offer these things to external sits. Don't make complex queries, leave it for offline tools / bots or toolserver. Semantic bots are a good idea - they might mine the data finding the cross-sets. It should be even lighter than SMW. However, I might be wrong. Dmitriy

Michael Dale

8:18 p.m.

On 01/22/2011 01:15 PM, Bryan Tong Minh wrote:

...

Handling metadata separately from wikitext provides two main advantages: it is much more user friendly, and it allows us to properly validate and parse data.

This assumes wikitext is simply a formatting language, really its a data storage, structure and presentation language. You can already see this in place by the evolution of templates as both data and presentation containers. It seems like a bad idea to move away from leveraging flexible data properties used in presentation.

In commons for we have Template:Information that links out into numerous data triples for assets presentation. ( ie Template:Artwork, Template:Creator, Template:Book with sub data relationships like Artwork.Location referencing the Institution template. If tied to SMW backed you could say "give me artwork in room "Pavillion de Beauvais" at the "louvre", that is missing a "created on" date.

We should focus on apis for template editing, Extension:Page_Object_Model seemed like a step in the right direction but not Something that let you edit structured data across nested template objects and we could stack validation ontop of that would let us leverage everything that has been done and keep things wide open for what's done in the future.

Most importantly we need clean high level apis that we can build GUIs on, so that the "flexibility" of the system does not hurt usability and functionality.

...

Having a clear separate input text field "Author: ____" is much more user friendly {{#fileauthor:}}, which is so to say, a type of obscure MediaWiki jargon. I know that we could probably hide it behind a template, but that is still not as friendly as a separate field. I keep on hearing that especially for newbies, a big blob of wikitext is plain scary. We regulars may be able to quickly parse the structure in {{Information}}, but for newbies this is certainly not so clear. We actually see that from the community there is a demand for separating the meta data from the wikitext -- this is after all why they implemented the uselang= hacked upload form with a separate text box for every meta field.

I don't know... see all the templates mentioned above... To be sure, I think we need better interfaces for interacting with templates.

...

Also, a separate field allows MediaWiki to understand what a certain input really means. {{#fileauthor:[[User:Bryan]]}} means nothing to MediaWiki or re-users, but "Author: Bryan___ [checkbox] This is a Commons username" can be parsed by MediaWiki to mean something. It also allows us to mass change for example the author. If I want to change my attribution from "Bryan" to "Bryan Tong Minh", I would need to edit the wikitext of every single upload, whereas in the new system I go to Special:AuthorManager and change the attribution.

A semantic mediwiki like system retains this "meaning" for mediawiki to interact with at any stage of data [re]presentation, and of course supports flexible "meaning" types.

...

...
Similar to categories, and all other"user edited" metadata.

Categories is a good example of why metadata does not belong in the wikitext. If you have ever tried renaming a category... you need to edit every page in the category and rename it in the wikitext. Commons is running multiple bots to handle category rename requests.

All these advantage outweigh the pain of migration (which could presumably be handled by bots) in my opinion.

Unless your category was template driven, in which case you just update the template ;) If your category was instead magically associated with the page outside of template built wiki page text, how do you build procedurally build data associations?

--michael

Krinkle

25 Jan 25 Jan

12:31 a.m.

Before I respond to the recent new ideas, concepts and suggestions. I'd like to explain a few things about the backend (atleast the way it's currently planned to be)

The mw_authors table contains unique authors by either a name or a userid. And optionally a custom attribution can be given (fallback to authorname, user real_name or user_name) Also optionally a url can be given (fallback to nothing or userpage).

The mw_license table contains the different licenses a wiki allows to be used. Their canonical name (eg. "GFDL", "CC-BY-SA-3.0" etc.), url to legal code and usage count[1].

mw_file_props is a table that keeps previous versions of file_props as well. And is linked to mw_revision by fp_id in rev_fileprops_id (like mw_text is linked in rev_text_id).

Both authors and licenses are uniquely identified by their id. This makes it easy to change stuff later on in an AuthorManager (eg. different url, username change etc.). The texts and complete titles of the licenses are stored in interface messages (for internationalization). MediaWiki:License-<uniq>-text could for example contain {{Cc-by-sa-3.0|attribution=$2}} on Wikimedia Commons.

If we store the links in the wikitext (like {{#fileauthor:}} and {{#filelicense:}}, the advantages are basically two things: 1) It has all features of editing and revisioning (better history, edit conflict, diff view, etc.) 2) No need for a revisionized mw_file_props, we can store the current values in mw_page_props

Possible down side is that a diff like - {{#fileauthor:2}} {{filelicense:12}} + {{#fileauthor:10}} {{#fileauthor:12}} {{#filelicense: doesn't mean very much. I.m.h.o The solution is not to store the actual names in wikitext so that the diffs are better, but to either not store it in wikitext at all, or customize the behaviour everywhere: * edit form: extract parserfunction calls from wikitext before anything else, and put it in seperate form elements * diff view: get the names of those authors and licenses and somehow include it in the diff view This could be done a bit like AbuseFilter's diff between filter versions (ie. before "Line 1", would be "Author" and "License") * saving form: convert back to {{#parserfunction:}} calls and prepending it to wikitext * action=raw: ? * action=render: ? * api-parse: ? right now I think storing it in wikitext and customizing it everywhere like shown above is not worth the trouble and would likely bring it's own troubles. Keeping it seperate from wikitext is more work once but I think it pays off. But again, nothing is final yet. Everything is possible.

-- Krinkle

[1]: The usage count (mw_license.lic_count) is a bit like edit count (increased/decreased when saving files)

Platonides

1:11 a.m.

Krinkle wrote:

...

Before I respond to the recent new ideas, concepts and suggestions. I'd like to explain a few things about the backend (atleast the way it's currently planned to be)

The mw_authors table contains unique authors by either a name or a userid. And optionally a custom attribution can be given (fallback to authorname, user real_name or user_name) Also optionally a url can be given (fallback to nothing or userpage).

The mw_license table contains the different licenses a wiki allows to be used. Their canonical name (eg. "GFDL", "CC-BY-SA-3.0" etc.), url to legal code and usage count[1].

mw_file_props is a table that keeps previous versions of file_props as well. And is linked to mw_revision by fp_id in rev_fileprops_id (like mw_text is linked in rev_text_id).

Both authors and licenses are uniquely identified by their id. This makes it easy to change stuff later on in an AuthorManager (eg. different url, username change etc.). The texts and complete titles of the licenses are stored in interface messages (for internationalization). MediaWiki:License-<uniq>-text could for example contain {{Cc-by-sa-3.0|attribution=$2}} on Wikimedia Commons.

If we store the links in the wikitext (like {{#fileauthor:}} and {{#filelicense:}}, the advantages are basically two things:

It has all features of editing and revisioning (better history,

edit conflict, diff view, etc.) 2) No need for a revisionized mw_file_props, we can store the current values in mw_page_props

Possible down side is that a diff like

{{#fileauthor:2}} {{filelicense:12}}

{{#fileauthor:10}} {{#fileauthor:12}} {{#filelicense:

doesn't mean very much. I.m.h.o The solution is not to store the actual names in wikitext so that the diffs are better, but to either not store it in wikitext at all, or customize the behaviour everywhere:

Why? Storing the property "filelicense: GPL" directly in wikitext is not bad. It's also a relief when we want to delete licenses later. Same with Author. Take that as a key into a NS_AUTHOR namespace. Going to Special:LicenseManager/5 in order to change GPL license data is just added complexity over using the short name "GPL".

Dmitriy Sintsov

11:40 a.m.

* Michael Dale mdale@wikimedia.org [Mon, 24 Jan 2011 13:18:00 -0600]:

...

We should focus on apis for template editing, Extension:Page_Object_Model seemed like a step in the right direction but not Something that let you edit structured data across nested template objects and we could stack validation ontop of that would let us leverage everything that has been done and keep things wide open

for

...

what's done in the future.

Most importantly we need clean high level apis that we can build GUIs on, so that the "flexibility" of the system does not hurt usability

and

...

functionality.

Michael is correct - API module to extract data from already existing nested templates and to replace the data (when needed) probably is the only thing that is required to make Wikipedia more structural and semantical. Then, the whole collecting and analyzing of triples can be off-loaded to externals bots and tools. Great idea, imho. Dmitriy

Bryan Tong Minh

30 Jan 30 Jan

3:09 p.m.

Hi,

There have been a lot of mails since I last had the the time to reply, so I'll reply to some points in a single mail.

On Sat, Jan 22, 2011 at 9:04 PM, Platonides Platonides@gmail.com wrote:

...

An internally handled parser function doesn't conflict with showing it as a textbox.

We could for instance store it as a hidden page prefix.

No. I strongly feel that using the wikitext to store hidden metadata is a bad idea. See HM's reply later in the thread.

...

Eeewwwwww....

What's any different between this and a {{#author: }} parser function apart from the inability to access it from the wikitext? As noted, it's perfectly possible for the data to be in a separate field on the upload form, either by default or by per-wiki hackery. This is likely to result in as many "why can't I edit the bits of wikitext which diff, history, transclusion (let's not forget the enormous can of worms mucking around with the wikitext will open up there), etc assure me is there??" questions as it solves "what does this brace structure do?" ones.

--HM

...

PS: The field author would be just a pointer to the author page, so you wouldn't need to edit everything on any case.

A good point, {{#fileauthor:}} could indeed just point to the a page in the Author: namespace.

Now that I think of it, if we go this way, there is no reason to restrict this licensing information to Files.

On Sun, Jan 23, 2011 at 1:38 AM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Things like {{#author:4}} seem to be a nice hack to Get Things Done (TM). As was mentioned before, the temptation is great to expand it into a generic triplet storage a la Semantic MediaWiki, but that would probably complicate things to an extend where nothing gets done, again.

SMW may perhaps be the ultimate solution, but I do not believe that activation of SMW is going to happen in the near or mid term feature, and indeed waiting for SMW will probably mean that nothing is going to happen.

I think the consensus is that we want to store the copyright metadata in the wikitext and not separately.

The biggest problem is how to define "second-level" properties. For example, a file has a license, say GFDL-1.2 and the license in turn has a legal URL such as http://fsf.org/gfdl-1.2 or something. This could be solved by {{#filelicense:GFDL-1.2}} pointing to a license defined in Special:LicenseManger, with all its properties there. Another solution would be to define a new namespace such as License: and have the properties defined in there somehow. The same problem applies to authors as well of course.

Regards, Bryan

5073

Age (days ago)

5083

Last active (days ago)

wikitech-l@lists.wikimedia.org

28 comments

12 participants

tags (0)

participants (12)

Alex Brollo
Brion Vibber
Bryan Tong Minh
Dmitriy Sintsov
Happy-melon
Krinkle
Lars Aronsson
Magnus Manske
Michael Dale
Platonides
Roan Kattouw
shi zhao