We are currently attempting to refactor some specific modifications to the standard MW code we use (1.13.2) into an extension so we can upgrade to a more recent maintained version. One modification we have keeps a flag in the revisions table specifying that article text was imported from WP. This flag generates an attribution statement at the bottom of the article that acknowledges the import.
I don't want to start a discussion about the various legal issues surrounding text licensing. However, assuming we must acknowledge use of licensed text, a legitimate technical issue is how to associate state with an article in a way that records the import of licensed text. I bring this up here because I assume we are not the only site that faces this issue.
Some of our users want to encode the attribution information in a template. The problem with this approach is anyone can come along and remove it. That would mean the organization legally responsible for the site would entrust the integrity of site content to any arbitrary author. We may go this route, but for the sake of this discussion I assume such a strategy is not viable. So, the remainder of this post assumes we need to keep such licensing state in the db.
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core.
Dan Nessett wrote: (...)
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Seems a good approach.
Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core.
Many sites could benefit, but I'd place it into an extension for now. Preferably on our svn. Note that not everything that many people use belongs to core (eg. ParserFunctions).
Support for licenses in the database would be a huge boon to Wikimedia Commons, for all the reasons you state. Commons' licensing is not uniform and making it easy to search and sort would be better for everyone.
Currently we display licenses in templates, which has many drawbacks.
I'd like it to be more concrete than just a page_prop -- for instance, you also want to associate properties with the licenses themselves, such as "requires attribution". So that would mean another table.
On 9/10/10 4:11 PM, Dan Nessett wrote:
We are currently attempting to refactor some specific modifications to the standard MW code we use (1.13.2) into an extension so we can upgrade to a more recent maintained version. One modification we have keeps a flag in the revisions table specifying that article text was imported from WP. This flag generates an attribution statement at the bottom of the article that acknowledges the import.
I don't want to start a discussion about the various legal issues surrounding text licensing. However, assuming we must acknowledge use of licensed text, a legitimate technical issue is how to associate state with an article in a way that records the import of licensed text. I bring this up here because I assume we are not the only site that faces this issue.
Some of our users want to encode the attribution information in a template. The problem with this approach is anyone can come along and remove it. That would mean the organization legally responsible for the site would entrust the integrity of site content to any arbitrary author. We may go this route, but for the sake of this discussion I assume such a strategy is not viable. So, the remainder of this post assumes we need to keep such licensing state in the db.
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core.
Let me just say that getting licensing information in to a database would definitely have advantages for Commons, Wikisource, and a variety of third parties. It would also enable the development of many potentially useful extensions to Mediawiki, such as creating automatically updated attribution statements, as you mentioned.
However, in practice, there are some considerations that will limit its ability to improve upon the template system in many use cases.
Content importing is usually done by ordinary users, and those users doing the importing would generally need to be able to set those flags. Further, unless one wants to bathe in license tag errors and vandalism, the same class of users also need the ability to unset or change those flags when problems occur.
One could restrict the set of people allowed to modify license tags to just admins, or just some other intermediate user class (i.e. something between admins and newbies), but that limitation may or may not work well in practice. For example, it would be impractical to impose many restrictions on a site like Commons where a significant amount of content comes from infrequent contributors with little or no established presence in the community.
In addition, if people are editing and changing attribution flags then there is a natural need to have version histories for license flags. In the existing system, this is accomplished by the revision histories of the pages showing changes in templates. This isn't ideal (for example the history of license changes isn't easily searchable), but it does fill a critical need. One could create a new log of attribution changes, but it's effectiveness would be limited unless one can see and revert to the attribution as it existed in the past (which is not a feature generally enabled by logs).
This is not to say that having attribution info in the database isn't useful. Personally, I think it would be very useful. However, any scheme to augment or replace other means of conveying attribution information will need to carefully consider the variety of ways that this information is used and managed.
-Robert Rohde
On 11 September 2010 08:40, Robert Rohde rarohde@gmail.com wrote:
In addition, if people are editing and changing attribution flags then there is a natural need to have version histories for license flags. In the existing system, this is accomplished by the revision histories of the pages showing changes in templates. This isn't ideal (for example the history of license changes isn't easily searchable), but it does fill a critical need. One could create a new log of attribution changes, but it's effectiveness would be limited unless one can see and revert to the attribution as it existed in the past (which is not a feature generally enabled by logs).
Perhaps something like how the interface for file revisions works - that is, a licensing-history tab (or somesuch ?) with state links, change data, diff links, and "revert to" buttons? But yes, this would be quite distinct from normal logs. :-)
The opportunities such a system would potentially give would be hugely cool e.g. in-line automatic RDFa hooks on the licensing of images on-page (and in-text, though I think community members for WMF wikis might complain), or "only let me see the actually free bits of this wiki", excluding the non-free components.
James
On Sat, Sep 11, 2010 at 1:11 AM, Dan Nessett dnessett@yahoo.com wrote:
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Note that the page_props table is parser-owned. This means that entries for a specific page are cleared and reinserted when a page is reparsed. You should take that into account when using the page_props table.
I'm not sure if page_props is the correct way to go. Copyright is associated to a specific text or image revision. Therefore, it seems more obvious to put the licensing in the revision and image table and their respective archive tables.
Bryan
There's a long outstanding bug [1] to ensure that accurate attribution is maintained when templates are substituted. I don't think this is the same as maintaining attribution of external imports, but it may be that whatever solution is implemented for one can be generalised to allow the other.
Conrad
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=6785
On 11 September 2010 01:51, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
On Sat, Sep 11, 2010 at 1:11 AM, Dan Nessett dnessett@yahoo.com wrote:
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Note that the page_props table is parser-owned. This means that entries for a specific page are cleared and reinserted when a page is reparsed. You should take that into account when using the page_props table.
I'm not sure if page_props is the correct way to go. Copyright is associated to a specific text or image revision. Therefore, it seems more obvious to put the licensing in the revision and image table and their respective archive tables.
Bryan
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, 10 Sep 2010 23:11:27 +0000, Dan Nessett wrote:
We are currently attempting to refactor some specific modifications to the standard MW code we use (1.13.2) into an extension so we can upgrade to a more recent maintained version. One modification we have keeps a flag in the revisions table specifying that article text was imported from WP. This flag generates an attribution statement at the bottom of the article that acknowledges the import.
I don't want to start a discussion about the various legal issues surrounding text licensing. However, assuming we must acknowledge use of licensed text, a legitimate technical issue is how to associate state with an article in a way that records the import of licensed text. I bring this up here because I assume we are not the only site that faces this issue.
Some of our users want to encode the attribution information in a template. The problem with this approach is anyone can come along and remove it. That would mean the organization legally responsible for the site would entrust the integrity of site content to any arbitrary author. We may go this route, but for the sake of this discussion I assume such a strategy is not viable. So, the remainder of this post assumes we need to keep such licensing state in the db.
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core.
The discussion about whether to support license data in the database has settled down. There seems to be some support. So, I think the next step is to determine the best technical approach. Below I provide a strawman proposal. Note that this is only to foster discussion on technical requirements and approaches. I have nothing invested in the strawman.
Implementation location: In an extension
Permissions: include two new permissions - 1) addlicensedata, and 2) modifylicensedata. These are pretty self-explanatory. Sites that wish to give all users the ability to provide and modify licensing data would assign these permissions to everyone. Sites that wish to allow all users to add licensing data, but restrict those who are allowed to modify it, would give the first permission to everyone and the second to a limited group.
Database schema: Add a "licensing" table to the db with the following columns - 1) revision_or_image, 2) revision_id, 3) image_id, 4) content_source, 5) license_id, 6) user_id.
The first three columns identify the revision or image to which the licensing data is associated. I am not particularly adept with SQL, so there may be a better way to do this. The content_source column is a string that is a URL or other reference that specifies the source of the content under license. The license_id identifies the specific license for the content. The user_id identifies the user that added the licensing information. The user_id may be useful if a site wishes to allow someone who added the licensing information to delete or modify it. However, there are complications with this. Since IP addresses are easily spoofed, it would mean this entry should only be valid for logged in users.
Add a "license" table with the following columns - 1) license_id, 2) license_text, 3) license name and 4) license_version. The license_id in the licensing table references rows in this table.
One complication is when a page or image is reverted, the licensing table must be modified to reflect the current state.
Data manipulation: The extension would use suitable hooks to insert, modify and render licensing data. Insertion and modification would probably use a relevant Edit Page or Article Management hook. Rendering would probably use a Page Rendering Hook.
Page rendering: You probably don't want to dump licensing data directly onto a page. Instead, it is preferable to output a short licensing statement like:
"Content on this page uses licensed content. For details, see licensing data."
The phrase "licensing data" would be a link to a special page that accesses the licensing table and displays the license data associated with the page.
Dan Nessett wrote:
The discussion about whether to support license data in the database has settled down. There seems to be some support. So, I think the next step is to determine the best technical approach. Below I provide a strawman proposal. Note that this is only to foster discussion on technical requirements and approaches. I have nothing invested in the strawman.
Implementation location: In an extension
Permissions: include two new permissions - 1) addlicensedata, and 2) modifylicensedata. These are pretty self-explanatory. Sites that wish to give all users the ability to provide and modify licensing data would assign these permissions to everyone. Sites that wish to allow all users to add licensing data, but restrict those who are allowed to modify it, would give the first permission to everyone and the second to a limited group.
Database schema: Add a "licensing" table to the db with the following columns - 1) revision_or_image, 2) revision_id, 3) image_id, 4) content_source, 5) license_id, 6) user_id.
The first three columns identify the revision or image to which the licensing data is associated.
That's ugly. I would prefer having one licensing table for revisions and another for images (btw, there's no such thing as image_id they are identified by name, or the id of their description page, plus timestamp if you also want to address old versions).
The content_source column is a string that is a URL or other reference that specifies the source of the content under license. The license_id identifies the specific license for the content. The user_id identifies the user that added the licensing information. The user_id may be useful if a site wishes to allow someone who added the licensing information to delete or modify it. However, there are complications with this. Since IP addresses are easily spoofed, it would mean this entry should only be valid for logged in users.
The user id could be stored at the logging table. You may want to add a licensing_id to identify rows on this table.
Add a "license" table with the following columns - 1) license_id, 2) license_text, 3) license name and 4) license_version. The license_id in the licensing table references rows in this table.
You could begin by hardcoding the available licenses in the extension, and then add support for the license table. There is a number of issues there: When can you remove a license? (maybe never once it is used), which licenses are shown as available? Do you have licenses which will "change" (eg. when you may want to change the default license from "CC-BY-SA 3.0 or later"to "CC-BY-SA 4.0 or later") ? Note that the license_version could also be part of the license_name. To make it useful you probably need a boolean to mark that it is an "or later" licensing.
One complication is when a page or image is reverted, the licensing table must be modified to reflect the current state.
If you are associating licenses with revisions (instead of pages), you don't need to change the state in the licensing table on further edits (just copy the license of the previous revision).
Data manipulation: The extension would use suitable hooks to insert, modify and render licensing data. Insertion and modification would probably use a relevant Edit Page or Article Management hook. Rendering would probably use a Page Rendering Hook.
Page rendering: You probably don't want to dump licensing data directly onto a page. Instead, it is preferable to output a short licensing statement like:
"Content on this page uses licensed content. For details, see licensing data."
The phrase "licensing data" would be a link to a special page that accesses the licensing table and displays the license data associated with the page.
That's fine. You could even use "Content on this page uses licensed content from XXXX under [[Special:Licenses/YYY|YYY license]]"
Do you want to support multilicensing? You could have revisions with data coming from several sources. That means you must allow duplicated revision_id in the licensing table.
Дана Tuesday 14 September 2010 21:01:40 Dan Nessett написа:
Database schema: Add a "licensing" table to the db with the following columns - 1) revision_or_image, 2) revision_id, 3) image_id, 4) content_source, 5) license_id, 6) user_id.
The first three columns identify the revision or image to which the licensing data is associated. I am not particularly adept with SQL, so there may be a better way to do this. The content_source column is a string that is a URL or other reference that specifies the source of the content under license. The license_id identifies the specific license for the content. The user_id identifies the user that added the licensing information. The user_id may be useful if a site wishes to allow someone who added the licensing information to delete or modify it. However, there are complications with this. Since IP addresses are easily spoofed, it would mean this entry should only be valid for logged in users.
Add a "license" table with the following columns - 1) license_id, 2) license_text, 3) license name and 4) license_version. The license_id in the licensing table references rows in this table.
How about a more generalised, more wiki solution?
Instead of "licensing" table, use "revisionlinks" table that would track what revision of a page was linking to what revision of another page.
rl_from: revision that is linking rl_from_page: page in which the revision was included rl_to: revision that is being linked to rl_to_page: page that is being linked to rl_type: template, category, article...
You could then use this to find what revision of a template was linked by what revision of a page. If used for a license template, this would effectively track licenses.
If this would be too database intensive, it could be used only for some pages (for example, only those with a specific magic word).
On 9/15/10 10:11 AM, Nikola Smolenski wrote:
How about a more generalised, more wiki solution?
Instead of "licensing" table, use "revisionlinks" table that would track what revision of a page was linking to what revision of another page.
I can't immediately see why this would be a bad idea, although it seems like a pretty radical idea to solve just this licensing problem. Maintaining a table of every link in every version of every page seems pretty expensive to me, even if it's limited to just some kinds of pages. It does open the door to a new way of thinking about wikis though.
That said, such a system is not quite the same as a table in a relational database.
- We would need to build separate search, metadata, and indexing systems if we wanted to do anything useful with link information.
- It is harder to enforce constraints.
That said, I've been thinking about metadata systems for wikis and this is an interesting idea.
Nikola Smolenski wrote:
How about a more generalised, more wiki solution?
Instead of "licensing" table, use "revisionlinks" table that would track what revision of a page was linking to what revision of another page.
rl_from: revision that is linking rl_from_page: page in which the revision was included rl_to: revision that is being linked to rl_to_page: page that is being linked to rl_type: template, category, article...
You could then use this to find what revision of a template was linked by what revision of a page. If used for a license template, this would effectively track licenses.
If this would be too database intensive, it could be used only for some pages (for example, only those with a specific magic word).
This is a complete overkill. And you can't even be sure that it is consistent. Page A includes {{GFDL}}, to which [[Image:Goatse]] is added (and reverted 5 seconds later). That would mean Page B (and thousands more) should have an entry in your table for Goatse. In fact, that won't appear. I see the benefit for marking some pages as "record anything that ever links here" but then you start getting requests for listing revisions older than the date on which the revision was tagged as such.
I'm not even sure why we would want to keep the licensing per revision in such case*. If it's added via templates/links, then the history can be seen in the page and also the licensing. You would only want to track which templates are licenses (that's what some toolserver projects already do). That would certainly be a much easier goal than what Dan proposed.
Even with licensing information in a separate table, when could keep them per page, and add dummy revisions where needed.
I think this is a great start and I am willing to start drawing up plans about this on-wiki somewhere. I am rather slammed with pressing deadlines right now but I just wanted to contribute a little to the discussion.
On Fri, 10 Sep 2010 23:11:27 +0000, Dan Nessett wrote:
Implementation location: In an extension
Permissions: include two new permissions - 1) addlicensedata, and 2) modifylicensedata.
Sounds good to me.
Database schema: Add a "licensing" table to the db with the following columns - 1) revision_or_image, 2) revision_id, 3) image_id, 4) content_source, 5) license_id, 6) user_id.
The first three columns identify the revision or image to which the licensing data is associated.
revision_or_image is a wart, as others pointed out. Each image has its own dedicated wiki page so it is probably more useful to use that.
Otherwise your schema is already similar to work I've been doing with UploadWizard, building on typical licensing workflows and templates.
There are "Deeds" which are composed of:
- Source -- some information that tells us where this came from. Currently we use a variety of wiki templates here. It could be a URL, a bibliographic record, anything.
- Author, which again is rather free-form. It can be a particular user on the wiki. But it also common to use a Creator template for a famous artist, or to simply write in the name in plaintext.
- License, which is just a template license, but in this new world should be something like license_id.
If we want to get more structured, it would be nice to also record the Uploader, since that is not the same thing as the Author or Source. Things may get complex when image replacement happens as you noted.
Right now our major use case is when the uploader is the author. But it will not always be so. In the case where the uploader asserts that someone else has okayed their work to be distributed under a free license, we want the author in question, or OTRS, to be able to check off that this was verified. OTRS has a workflow like this already, and in the Multimedia project we had plans to simplify this but I'm afraid it's unlikely I will get to that that soon.
Add a "license" table with the following columns - 1) license_id, 2) license_text, 3) license name and 4) license_version. The license_id in the licensing table references rows in this table.
Sounds good to me although I would also add boolean columns that are useful to describe the salient features of the license in machine readable terms, like "attribution_required", "share_alike", etc. That will help with searching and with a uniform licensing display (see below). Another column which gives us the wiki-hosted image of a small icon for the license may also be helpful.
Licensing gets very complicated when it comes to country-by-country laws, so it may be useful to record the legal regime under which the deed falls, which could be something like a country code.
Page rendering: You probably don't want to dump licensing data directly onto a page. Instead, it is preferable to output a short licensing statement like:
No, you almost certainly do want to describe the terms of the license (broadly) right on the main page for the content.
The description should be functional, from a potential re-user's point of view, in very plain language. As in:
WANT TO USE THIS IMAGE?
You are free to use this image for any purpose, even in works that you sell. If you use this image, you must credit the author, Joe Blow joeblow@sample.com http://joeblow.sample.com/ . If you use this image, you must allow others to share the image in the same way.
Read more about the licensing terms here: <link to our template for cc-by-sa 3.0>
That's why machine-readable license properties will help. Even for images which don't allow re-use at all, it should say so quite clearly. (We host Wikimedia trademarked images, for instance.)
PS: I apologize if the threading is screwed up here -- WMF mail was down for a few hours so I missed this message.
Neil Kandalgaonkar wrote:
I think this is a great start and I am willing to start drawing up plans about this on-wiki somewhere. I am rather slammed with pressing deadlines right now but I just wanted to contribute a little to the discussion.
(...)
Database schema: Add a "licensing" table to the db with the following columns - 1) revision_or_image, 2) revision_id, 3) image_id, 4) content_source, 5) license_id, 6) user_id.
The first three columns identify the revision or image to which the licensing data is associated.
revision_or_image is a wart, as others pointed out. Each image has its own dedicated wiki page so it is probably more useful to use that.
But then you encounter a GFDL / CC-BY-SA description of a CC-BY-any reupload of a PD image.
Otherwise your schema is already similar to work I've been doing with UploadWizard, building on typical licensing workflows and templates.
There are "Deeds" which are composed of:
- Source -- some information that tells us where this came from.
Currently we use a variety of wiki templates here. It could be a URL, a bibliographic record, anything.
- Author, which again is rather free-form. It can be a particular user
on the wiki. But it also common to use a Creator template for a famous artist, or to simply write in the name in plaintext.
- License, which is just a template license, but in this new world
should be something like license_id.
If we want to get more structured, it would be nice to also record the Uploader, since that is not the same thing as the Author or Source. Things may get complex when image replacement happens as you noted.
That would be already stored in the upload log.
Right now our major use case is when the uploader is the author. But it will not always be so. In the case where the uploader asserts that someone else has okayed their work to be distributed under a free license, we want the author in question, or OTRS, to be able to check off that this was verified. OTRS has a workflow like this already, and in the Multimedia project we had plans to simplify this but I'm afraid it's unlikely I will get to that that soon.
Note that another common case is that the author was the uploader on a different wiki.
PS: I apologize if the threading is screwed up here -- WMF mail was down for a few hours so I missed this message.
Seems well-threaded here :)
On Fri, 10 Sep 2010 23:11:27 +0000, Dan Nessett wrote:
We are currently attempting to refactor some specific modifications to the standard MW code we use (1.13.2) into an extension so we can upgrade to a more recent maintained version. One modification we have keeps a flag in the revisions table specifying that article text was imported from WP. This flag generates an attribution statement at the bottom of the article that acknowledges the import.
I don't want to start a discussion about the various legal issues surrounding text licensing. However, assuming we must acknowledge use of licensed text, a legitimate technical issue is how to associate state with an article in a way that records the import of licensed text. I bring this up here because I assume we are not the only site that faces this issue.
Some of our users want to encode the attribution information in a template. The problem with this approach is anyone can come along and remove it. That would mean the organization legally responsible for the site would entrust the integrity of site content to any arbitrary author. We may go this route, but for the sake of this discussion I assume such a strategy is not viable. So, the remainder of this post assumes we need to keep such licensing state in the db.
After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited.
Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core.
One thing I haven't seen so far (probably because it doesn't belong on Wikitech) is a discussion of the policy requirements. In open source software development, you have to carry forward licenses even if you substantially change the code content. The only way around this is a "clean room" implementation (e.g., how BSD Unix got around AT&T's original licensing for Unix).
Is this also true for textual content? If so, then once you import such content into an article you are obliged to carry forward any licensing conditions on that import on for all subsequent revisions.
Where is the proper place to discuss these kinds of questions?
wikitech-l@lists.wikimedia.org