Hi all,
I've been working on an api module/extension to extract metadata from commons image description pages, and display it in the API. I know this is an area that various people have thought about from time to time, so I thought it would be of interest to this list.
The specific goals I have: *Should be usable for a light box type feature ("MediaViewer") that needs to display information like Author and license. [1] (This is primary use case) *Should be generic where possible, so that better metadata access can be had by all wikis, even if they don't follow commons conventions. For example, should generically support exif data from files where possible/appropriate, overriding the exif data when more reliable sources of information are available. *Should be compatible with a future wikidata on commons thing. [2] **In particular, I want to read existing description page formatting, not try and force people to use new parser functions or formatting conventions, since they may become outdated in near future when wikidata comes **Hopefully Wikidata would be able to hook into my system (Well at the same time providing its own native interface) *Since descriptions on commons are formatted data (Wikilinks are especially common) it needs to be able to output formatted data. I think html is the most easy to use format. Much more easy to use than say wikitext (However this is perhaps debatable)
What I've come up with is a new api metadata property (Currently pending review in gerrit) called extmetadata that has a hook extensions can hook into. [3] [4] [5] Additionally I developed an extension for reading information from commons description pages. [6]
It combines information from both the file's metadata, and from any extensions. For example, if the Exif data has an author specified ("Artist" in exif speak), and the commons description page also has one specified, the description page takes precedence, under the assumption its more reliable. The module outputs html, since that's the type of data stored in the image description page (Except that it uses full urls instead of local ones).
The downside to this is in order to effectively get metadata out of commons given the current practises, one essentially has to screen scrape and do slightly ugly things (Look ahead for a brighter tomorrow with wikidata!)
As an example, given a query like api.php?action=query&prop=imageinfo&iiprop=extmetadata&titles=File:Schwedenfeuer_Detail_04.JPG&format=xmlfm&iiextmetadatalanguage=en it would produce something like [7]
So thoughts? /me eagerly awaits mail tearing my plans apart :)
[1] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer [2] https://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info [3] https://gerrit.wikimedia.org/r/#/c/81598/ [4] https://gerrit.wikimedia.org/r/#/c/78162/ [5] https://gerrit.wikimedia.org/r/#/c/78926/ [6] https://gerrit.wikimedia.org/r/#/c/80403/ [7] http://pastebin.com/yh5286iR
-- Bawolff