Commons API

List overview All Threads
Download

newer

older

Re: [Commons-l] [MediaWiki-CVS]...

Re: [Commons-l] Credentialing

Brianna Laugher

30 Mar 2008 30 Mar '08

9:10 a.m.

Hi,

There is an interesting Firefox extension called Zemanta, that works with some blogging platforms, to suggest images to match a blog post you type. One of the sources they use is Commons. See this post (comments) for a description of how it works and what it's lacking: http://brianna.modernthings.org/article/97/zemanta-wikimedia-commons-for-bloggers

In particular, "If you have an idea how to correctly capture wikipedia images attribution (something that would assure at least 50% correct coverage from 2.8M images), please help us! ;)"

Really, we can't blame people too much for not providing attribution, when we don't give that information in a standard way, or give a standard way of accessing it.

Now is as good a time as any to formally write an API to recommend for other people to use. Aside from the MediaWiki API, there are three main things I can think of that are often needed to be automated: * identify any "problem tags" (files with deletion markers shouldn't be used or indexed by third parties) * extract license name(s) and URL for a given file * extract author attribution string for a given file

So I propose we put our heads together and figure out the most robust algorithm for each of these, and provide some sample code for each.

I made a start here:

http://commons.wikimedia.org/wiki/Commons:API

Contributions and feedback welcome...

cheers, Brianna

-- They've just been waiting in a mountain for the right moment: http://modernthings.org/

Show replies by date

Bryan Tong Minh

30 Mar 30 Mar

9:26 a.m.

On Sun, Mar 30, 2008 at 4:10 PM, Brianna Laugher brianna.laugher@gmail.com wrote:

...

Hi,

There is an interesting Firefox extension called Zemanta, that works with some blogging platforms, to suggest images to match a blog post you type. One of the sources they use is Commons. See this post (comments) for a description of how it works and what it's lacking: http://brianna.modernthings.org/article/97/zemanta-wikimedia-commons-for-bloggers

In particular, "If you have an idea how to correctly capture wikipedia images attribution (something that would assure at least 50% correct coverage from 2.8M images), please help us! ;)"

Really, we can't blame people too much for not providing attribution, when we don't give that information in a standard way, or give a standard way of accessing it.

Now is as good a time as any to formally write an API to recommend for other people to use. Aside from the MediaWiki API, there are three main things I can think of that are often needed to be automated:

identify any "problem tags" (files with deletion markers shouldn't

be used or indexed by third parties)

extract license name(s) and URL for a given file

extract author attribution string for a given file

So I propose we put our heads together and figure out the most robust algorithm for each of these, and provide some sample code for each.

I made a start here:

http://commons.wikimedia.org/wiki/Commons:API

Contributions and feedback welcome...

cheers, Brianna

-- They've just been waiting in a mountain for the right moment: http://modernthings.org/

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

I already started something some time ago on http://commons.wikimedia.org/wiki/Commons:Machine_readability. It allows you to extract all information provided by the {{Information}} template and some other templates. It's not yet finished; I'm still think what is the easiest way to fetch license information.

Bryan

Brianna Laugher

9:37 a.m.

On 31/03/2008, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

I already started something some time ago on http://commons.wikimedia.org/wiki/Commons:Machine_readability. It allows you to extract all information provided by the {{Information}} template and some other templates.

Nice. I think they can be merged together?

It's not yet finished; I'm still

...

think what is the easiest way to fetch license information.

I still think it is typically fewer steps to do it by categories (maybe even better now with hidden cats, although that's probably not universal for license categories yet). But, we may discuss it at http://commons.wikimedia.org/w/index.php?title=Commons_talk:API/license

Brianna

-- They've just been waiting in a mountain for the right moment: http://modernthings.org/

Bryan Tong Minh

9:50 a.m.

On Sun, Mar 30, 2008 at 4:37 PM, Brianna Laugher brianna.laugher@gmail.com wrote:

...

On 31/03/2008, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...
I already started something some time ago on http://commons.wikimedia.org/wiki/Commons:Machine_readability. It allows you to extract all information provided by the {{Information}} template and some other templates.

Nice. I think they can be merged together?

Nope. Your and my API work from an entirely different viewpoint. My API uses the existing infrastructure of Commons and does not depend on changes in the software. Your API requires updates to the software, which is the most ideal model for the future, but impossible to achieve in a short timespan. Therefore, both should exist separately. My API is something that works right now and provides a somewhat straightforward way to fetch information now, but is not sustainable for the future.

To get to a real working API the first thing we need is to store the meta data as author, license, etc in the database, rather than putting it all together in one text field. You don't want an API that parses text.

Bryan

Daniel Kinzler

5:43 p.m.

Bryan Tong Minh wrote:

...

To get to a real working API the first thing we need is to store the meta data as author, license, etc in the database, rather than putting it all together in one text field. You don't want an API that parses text.

Here is something I have been thinking about for a while, which could make this kind of storage feasible: http://brightbyte.de/page/WikiData_light. It's only an idea at the moment, but I believe it would be doable without too much trouble, and could be made to scale. It's less powerfull than full-fletched WikiData or Semantic MediaWiki, but it's far less comple and much easier to integrate with Wikipedia operations - and I believe it would be flexible and powerful enough to be useful.

Oh, and just for the record, let me mention http://commons.wikimedia.org/wiki/Commons:Tag_categories here. From what I see people don't follow the "all templates must be in those categories directly" bit, and are using subcategories - so getting the right info takes a bit more processing, but that, too, would be doable by evaluating the tag category hierarchy every week or so. That would most probably be a toolserver-based solution.

-- Daniel

Bryan Tong Minh

31 Mar 31 Mar

4:15 a.m.

On Mon, Mar 31, 2008 at 12:43 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Bryan Tong Minh wrote:

...
To get to a real working API the first thing we need is to store the meta data as author, license, etc in the database, rather than putting it all together in one text field. You don't want an API that parses text.

Here is something I have been thinking about for a while, which could make this kind of storage feasible: http://brightbyte.de/page/WikiData_light. It's only an idea at the moment, but I believe it would be doable without too much trouble, and could be made to scale. It's less powerfull than full-fletched WikiData or Semantic MediaWiki, but it's far less comple and much easier to integrate with Wikipedia operations - and I believe it would be flexible and powerful enough to be useful.

Looks like a quite good and not really hard to implement idea to me. Especially since we now have the page_props table, which would be ideal for this.

...

Oh, and just for the record, let me mention http://commons.wikimedia.org/wiki/Commons:Tag_categories here. From what I see people don't follow the "all templates must be in those categories directly" bit, and are using subcategories - so getting the right info takes a bit more processing, but that, too, would be doable by evaluating the tag category hierarchy every week or so. That would most probably be a toolserver-based solution.

The problem is that people don't actually read that page. And when they categorize, they think quite obviously that the tag is overcategorized when it appears both in a sub cat as well as in a parent cat. Some people suggested to me to use [[Category:All license tags]], which is a subcat of [[Category:License tags]] itself, and have [[Category:License tags]] on be used as category for categories. Unfortunately this will break many tools that depend on [[Commons:Tag categories]].

Bryan

Magnus Manske

8:31 a.m.

Ignoring all that careful planning ;-) I hacked a simple API: http://tools.wikimedia.de/~magnus/commonsapi.php http://tools.wikimedia.de/~magnus/commonsapi.php?image=Sa-warthog.jpg

For an image, it returns an XML text with * URL of page and file * qualityimage/features image status * the components of the {{Infobox}} parts * a list of descriptions in all available languages * a list of categories * a list of licenses (which for now are categories that fall in a certain pattern) * a simplified upload log

It's probably full of bugs, and not very elegant; I'm screenscraping the page even for things like categories, which could be taken much better from a separate MediaWiki API call. It's also missing information (like file type, size, etc.) that can be retrieved through our normal API.

Future enhancements could also include attributes per license (link to original license text and logo, need to print the license, mention the author, use the same license again etc.).

Cheers, Magnus

Bryan Tong Minh

8:57 a.m.

On Mon, Mar 31, 2008 at 3:31 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...

Ignoring all that careful planning ;-) I hacked a simple API: http://tools.wikimedia.de/~magnus/commonsapi.php http://tools.wikimedia.de/~magnus/commonsapi.php?image=Sa-warthog.jpg

For an image, it returns an XML text with

URL of page and file

qualityimage/features image status

the components of the {{Infobox}} parts

a list of descriptions in all available languages

a list of categories

a list of licenses (which for now are categories that fall in a

certain pattern)

a simplified upload log

It's probably full of bugs, and not very elegant; I'm screenscraping the page even for things like categories, which could be taken much better from a separate MediaWiki API call. It's also missing information (like file type, size, etc.) that can be retrieved through our normal API.

Future enhancements could also include attributes per license (link to original license text and logo, need to print the license, mention the author, use the same license again etc.).

Cheers, Magnus

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

You'll need an xml escape function, not an url escape one ;)

Magnus Manske

9:05 a.m.

On Mon, Mar 31, 2008 at 2:57 PM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:

...

On Mon, Mar 31, 2008 at 3:31 PM, Magnus Manske magnusmanske@googlemail.com wrote:

...
Ignoring all that careful planning ;-) I hacked a simple API: http://tools.wikimedia.de/~magnus/commonsapi.php http://tools.wikimedia.de/~magnus/commonsapi.php?image=Sa-warthog.jpg

For an image, it returns an XML text with

URL of page and file

qualityimage/features image status

the components of the {{Infobox}} parts

a list of descriptions in all available languages

a list of categories

a list of licenses (which for now are categories that fall in a

certain pattern)

a simplified upload log

It's probably full of bugs, and not very elegant; I'm screenscraping the page even for things like categories, which could be taken much better from a separate MediaWiki API call. It's also missing information (like file type, size, etc.) that can be retrieved through our normal API.

Future enhancements could also include attributes per license (link to original license text and logo, need to print the license, mention the author, use the same license again etc.).

Cheers, Magnus

...

Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

You'll need an xml escape function, not an url escape one ;)

Yes, but I got entity errors, so this was easier ;-)

Magnus

Brianna Laugher

9:07 a.m.

On 01/04/2008, Magnus Manske magnusmanske@googlemail.com wrote:

...

Ignoring all that careful planning ;-) I hacked a simple API: http://tools.wikimedia.de/~magnus/commonsapi.php http://tools.wikimedia.de/~magnus/commonsapi.php?image=Sa-warthog.jpg

Wow... So obvious, it didn't even occur to me... of course, put Magnus on the task! He solves it before you've even finished formulating the question...

...

Future enhancements could also include attributes per license (link to original license text and logo, need to print the license, mention the author, use the same license again etc.).

Can we please set this up as a serious priority project. version control, multi-contributor environment, (bug tracker?), test sets wikimedia SVN, toolserver or stable toolserver?

man, forget the waffle... just code, eh :)

cheers Brianna

-- They've just been waiting in a mountain for the right moment: http://modernthings.org/

Magnus Manske

10:12 a.m.

On Mon, Mar 31, 2008 at 3:07 PM, Brianna Laugher brianna.laugher@gmail.com wrote:

...

On 01/04/2008, Magnus Manske magnusmanske@googlemail.com wrote:

...
Ignoring all that careful planning ;-) I hacked a simple API: http://tools.wikimedia.de/~magnus/commonsapi.php http://tools.wikimedia.de/~magnus/commonsapi.php?image=Sa-warthog.jpg

Wow... So obvious, it didn't even occur to me... of course, put Magnus on the task! He solves it before you've even finished formulating the question...

And all thanks to our friend the almighty caffeine ;-)

...

...
Future enhancements could also include attributes per license (link to original license text and logo, need to print the license, mention the author, use the same license again etc.).

Can we please set this up as a serious priority project. version control, multi-contributor environment, (bug tracker?), test sets wikimedia SVN, toolserver or stable toolserver?

I'll make it prettier (code and output) later today. I can also add it to the MediaWiki SVN. Not sure if the toolserver personal SVN would cut it.

For now, I think I've fixed the urlencode issue, and I've also added "location awareness", that is, {{location}} et al are recognized and added to the output; see the end of http://tools.wikimedia.de/~magnus/commonsapi.php?image=ChathamHDY0016.JPG

...

man, forget the waffle... just code, eh :)

Of course, that strategy runs the risk of actually producing results ;-)

Cheers, Magnus

Magnus Manske

3:30 p.m.

Update: * Now using MediaWiki API for additional information * revised sectioning of output: ** file (including image dimensions, URL, and stuff from {{Information}}) ** meta (EXIF data etc.) ** description (in multiple languages, if available) ** licenses (some licenses now carry lots of additional information, like link to license text/description, need to mention author, keep file under same/similar license, need to distribute full text of license, license logo etc), and "self-made" attribute ** versions (all versions that were uploaded, with date, size, dimensions, uploader) * nice XML error message if the requested file doesn't exist * code now in MediaWiki SVN, under trunk/tools/commonsapi

This needs to be adapted to other templates (see [1] for an example), but it does degrade gracefully (omits information if it can't find it); license information should always be present.

License "finetuning" will be a problem, especially with language variants (e.g., CC-BY-2.5-IT); this might have to be solved programmatically to cover all cases (sigh).

If someone wants to co-maintain it on the stable toolserver, welcome! :-) Otherwise, please help extending the software and fixing bugs, or just try to break it and report the crime scene to me ;-)

Cheers, Magnus

[1] http://commons.wikimedia.org/wiki/Image:Gesammelte_Werke_(Thoma)_1_307.jpg

Brianna Laugher

10:16 p.m.

On 01/04/2008, Magnus Manske magnusmanske@googlemail.com wrote:

...

code now in MediaWiki SVN, under trunk/tools/commonsapi

[...]

...

If someone wants to co-maintain it on the stable toolserver, welcome! :-) Otherwise, please help extending the software and fixing bugs, or just try to break it and report the crime scene to me ;-)

Great. So... I'm kind of confused about how it's in wikimedia svn and yet runs at your toolserver account. are you keeping distinct copies?

should we discuss it at wikitech, here, mediawiki-api, or a new mailing list? (I'm inclined to a new list, to avoid lots of traffic for uninterested people)

Also I'll be happy to put my hand up for comaintenance on the stable ts - maybe when it gets a bit more stable? :P

Brianna

-- They've just been waiting in a mountain for the right moment: http://modernthings.org/

Platonides

30 Mar 30 Mar

5:41 p.m.

Bryan Tong Minh wrote:

...

I already started something some time ago on http://commons.wikimedia.org/wiki/Commons:Machine_readability. It allows you to extract all information provided by the {{Information}} template and some other templates. It's not yet finished; I'm still think what is the easiest way to fetch license information.

Bryan

The most reliable method used to be checking the categories for license tags. Maybe create a list of license tags, from Category:License_tags then look for them on article page?

5939

Age (days ago)

5941

Last active (days ago)

commons-l@lists.wikimedia.org

13 comments

5 participants

tags (0)

participants (5)

Brianna Laugher
Bryan Tong Minh
Daniel Kinzler
Magnus Manske
Platonides