Hi,
I massively re-use medias from commons and I'm unable to simply (automatically) get the related author and license "attached" to each document I copy. As far as I know this is not possible (for example using the API, or dealing directly with information in the DB coming with the dumps).
I'm sorry if this topic was already well discussed in the past. If this is the case, please share the right pointer with me and simply ignore the rest.
So, I fail to respect the license/copyright in my derivative works. I'm not comfortable with that situation. This is a problem for me... but, because our goal, as a movement, is to provide reusable content ; I consider this is also a global problem. Those information are mandatory to respect the law, we should provide a way to retrieve them easily.
I do not see any solution without saving both license/author in the database for each document... and building afterward code to deal with this new properties. Do we have project in that direction? Maybe decisions were already token regarding this topic?
Regards Emmanuel
2012/10/11 Emmanuel Engelhart emmanuel@engelhart.org:
Hi,
I massively re-use medias from commons and I'm unable to simply (automatically) get the related author and license "attached" to each document I copy. As far as I know this is not possible (for example using the API, or dealing directly with information in the DB coming with the dumps).
It should be possible to extract the license templates from the dumps, and even from the API. It is more difficult to do this with the author, especially with pictures moved from Wikipedias, but still, there is a lot of code out there written specifically to extract this kind of information.
I did something last year for exporting the files from WLMRO to Europeana: http://code.google.com/p/wikiro/source/browse/trunk/robots/python/pywikipedi... It was done very quickly and it probably has some bugs. Platonides also has something made for these statistics: http://toolserver.org/~platonides/wlm/users.php , although I couldn't tell you where to get the source from.
Having this information (along with other meta-data like coordinates etc.) in the database and API would be useful, but it shouldn't stop you from respecting the licenses' requirements.
Strainu
Le 11/10/2012 17:46, Strainu a écrit :
Having this information (along with other meta-data like coordinates etc.) in the database and API would be useful
I obviously agree, but I want to insist on one point: Author/license are not metadata like the others. Although it's *legal* to reuse/spread/copy a content without any of its metadata ; it's *illegal* to do it without the author/license (in most of the cases). That why we could maybe have a differentiated approach.
Emmanuel
I think this is supposed to be the page summarizing the issue and path to go: https://www.mediawiki.org/wiki/Files_and_licenses_concept
Nemo
On 11/10/12 17:46, Strainu wrote:
I did something last year for exporting the files from WLMRO to Europeana: http://code.google.com/p/wikiro/source/browse/trunk/robots/python/pywikipedi... It was done very quickly and it probably has some bugs. Platonides also has something made for these statistics: http://toolserver.org/~platonides/wlm/users.php , although I couldn't tell you where to get the source from.
It's basically a regex to extract the user from the author field... plus 120 special cases of people who don't put a link to their user page or use a template, plus 4 special cases for users which use custom templates instead of {{information}}, plus another for Talmoryair which uses {{Artwork}} instead of {{Information}}. OTOH it has parsed -hopefully quite correctly- 363k images from 15000 authors.
We can make it a general library if you want. I think the wrong use that happened to be most common were people who wanted to change the attribution to their username to their name, so they changed [[User:Foo|Foo]] to [[User:John Doe|John Doe]]... which is completely wrong. Specially when there was an account named «John Doe». In some cases, it was clear that when user JDoe changed the author field to «John Doe», he refered to himself. But if Guy85 put «John Doe», is it his real name, a friend, or some random guy? (I was not just caring about how they wanted to be credited but also who was being credited)
Le 16/10/2012 23:04, Platonides a écrit :
On 11/10/12 17:46, Strainu wrote:
I did something last year for exporting the files from WLMRO to Europeana: http://code.google.com/p/wikiro/source/browse/trunk/robots/python/pywikipedi... It was done very quickly and it probably has some bugs. Platonides also has something made for these statistics: http://toolserver.org/~platonides/wlm/users.php , although I couldn't tell you where to get the source from.
It's basically a regex to extract the user from the author field... plus 120 special cases of people who don't put a link to their user page or use a template, plus 4 special cases for users which use custom templates instead of {{information}}, plus another for Talmoryair which uses {{Artwork}} instead of {{Information}}. OTOH it has parsed -hopefully quite correctly- 363k images from 15000 authors.
Yes, this is what I meant: not what I called a handful solution, neither in the principle nor in the implementation. What would be the disadvantages having these two information in the DB?
Emmanuel
PS: Thank you for proposing your help for the scripting, but this not really my purpose with this email - I really hope to find a clean solution instead of having to code/use some painful scripts on millions of pictures.
On 17/10/12 09:54, Emmanuel Engelhart wrote:
Yes, this is what I meant: not what I called a handful solution, neither in the principle nor in the implementation. What would be the disadvantages having these two information in the DB?
Emmanuel
PS: Thank you for proposing your help for the scripting, but this not really my purpose with this email - I really hope to find a clean solution instead of having to code/use some painful scripts on millions of pictures.
Placing that code in MediaWiki itself (ie. an extension) and storing it in page_props.
Note however that there are some pictures with multiple authors (derivative works, collages...) and those are harder to determine and store (a simple field for the author is not enough).
On 17 October 2012 09:02, Platonides Platonides@gmail.com wrote:
Note however that there are some pictures with multiple authors (derivative works, collages...) and those are harder to determine and store (a simple field for the author is not enough).
And some images would need a credit *trail*. And specified credit for some CC images gets wacky too.
Run this past commons-l and commons VP so people can come up with troublesome edge cases?
- d.
Le 17/10/2012 10:07, David Gerard a écrit :
On 17 October 2012 09:02, Platonides Platonides@gmail.com wrote:
Note however that there are some pictures with multiple authors (derivative works, collages...) and those are harder to determine and store (a simple field for the author is not enough).
And some images would need a credit *trail*. And specified credit for some CC images gets wacky too.
I do not think we need to cite all the authors (could be indeed a long list) to respect the terms of the licence or the law. A simple text fied should be enough. The potential detailed information (IMO only needed in edge cases <1%) belong to the metadata.
Emmanuel
On 17 October 2012 09:15, Emmanuel Engelhart emmanuel@engelhart.org wrote:
Le 17/10/2012 10:07, David Gerard a écrit :
And some images would need a credit *trail*. And specified credit for some CC images gets wacky too.
I do not think we need to cite all the authors (could be indeed a long list) to respect the terms of the licence or the law. A simple text fied should be enough. The potential detailed information (IMO only needed in edge cases <1%) belong to the metadata.
I think it would be a *very* good idea to map out the edge cases before blithely assuming they won't matter.
(How many images is 1% of Commons?)
- d.
Le 17/10/2012 10:07, David Gerard a écrit :
On 17 October 2012 09:02, Platonides Platonides@gmail.com wrote:
Note however that there are some pictures with multiple authors (derivative works, collages...) and those are harder to determine and store (a simple field for the author is not enough).
And some images would need a credit *trail*. And specified credit for some CC images gets wacky too.
I do not think we need to cite all the authors (could be indeed a long list) to respect the terms of the licence or the law. A simple text fied should be enough. The potential detailed information (IMO only needed in edge cases <1%) belong to the metadata.
Emmanuel
Le 17/10/2012 10:07, David Gerard a écrit :
On 17 October 2012 09:02, Platonides Platonides@gmail.com wrote:
Note however that there are some pictures with multiple authors (derivative works, collages...) and those are harder to determine and store (a simple field for the author is not enough).
And some images would need a credit *trail*. And specified credit for some CC images gets wacky too.
I do not think we need to cite all the authors (could be indeed a long list) to respect the terms of the licence or the law. A simple text filed must be enough for this purpose. The potential additional detailed information (IMO only needed in edge cases <1%) belong to the meta-data.
Emmanuel
Hi,
we should provide a way to retrieve [Commons license & author] easily.
This is tracked at [[bugzilla:17503]]
Though it is not as good as what you are looking for, license is well exposed through HTML elements. Author is as well (but fail in many cases).
Some external tools already rely on this machine readable data. More information at [2].
-- Jean-Frédéric
[1] https://bugzilla.wikimedia.org/show_bug.cgi?id=17503 [2] https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
2012/10/16 Jean-Frédéric jeanfrederic.wiki@gmail.com:
[2] https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
Awsome Jean-Ferderic, thanks!
Strainu
wikitech-l@lists.wikimedia.org