Hi all,
I've been working on an api module/extension to extract metadata from commons image description pages, and display it in the API. I know this is an area that various people have thought about from time to time, so I thought it would be of interest to this list.
The specific goals I have: *Should be usable for a light box type feature ("MediaViewer") that needs to display information like Author and license. [1] (This is primary use case) *Should be generic where possible, so that better metadata access can be had by all wikis, even if they don't follow commons conventions. For example, should generically support exif data from files where possible/appropriate, overriding the exif data when more reliable sources of information are available. *Should be compatible with a future wikidata on commons thing. [2] **In particular, I want to read existing description page formatting, not try and force people to use new parser functions or formatting conventions, since they may become outdated in near future when wikidata comes **Hopefully Wikidata would be able to hook into my system (Well at the same time providing its own native interface) *Since descriptions on commons are formatted data (Wikilinks are especially common) it needs to be able to output formatted data. I think html is the most easy to use format. Much more easy to use than say wikitext (However this is perhaps debatable)
What I've come up with is a new api metadata property (Currently pending review in gerrit) called extmetadata that has a hook extensions can hook into. [3] [4] [5] Additionally I developed an extension for reading information from commons description pages. [6]
It combines information from both the file's metadata, and from any extensions. For example, if the Exif data has an author specified ("Artist" in exif speak), and the commons description page also has one specified, the description page takes precedence, under the assumption its more reliable. The module outputs html, since that's the type of data stored in the image description page (Except that it uses full urls instead of local ones).
The downside to this is in order to effectively get metadata out of commons given the current practises, one essentially has to screen scrape and do slightly ugly things (Look ahead for a brighter tomorrow with wikidata!)
As an example, given a query like api.php?action=query&prop=imageinfo&iiprop=extmetadata&titles=File:Schwedenfeuer_Detail_04.JPG&format=xmlfm&iiextmetadatalanguage=en it would produce something like [7]
So thoughts? /me eagerly awaits mail tearing my plans apart :)
[1] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer [2] https://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info [3] https://gerrit.wikimedia.org/r/#/c/81598/ [4] https://gerrit.wikimedia.org/r/#/c/78162/ [5] https://gerrit.wikimedia.org/r/#/c/78926/ [6] https://gerrit.wikimedia.org/r/#/c/80403/ [7] http://pastebin.com/yh5286iR
-- Bawolff
On 31 August 2013 03:10, Brian Wolff bawolff@gmail.com wrote:
Hi all,
I've been working on an api module/extension to extract metadata from commons image description pages, and display it in the API. I know this is an area that various people have thought about from time to time, so I thought it would be of interest to this list.
This looks rather fun. For VisualEditor, we'd quite like to be able to pull in the description of a media file in the page's language when it's inserted into the page, to use as the default caption for images. I was assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-)
However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take it now - presumably the hard work is the reliable screen-scraping, and building the tool-chain to extract from this just to port it over to Wikidata in a few months' time would be a pity.
J.
James Forrester wrote:
However, how much more work would it be to insert it directly into Wikidata right now?
I think a parallel question might be: is Wikidata, as a social or technical project, able and ready to accept such data? I haven't been following Wikidata's progress too much, but I thought the focus was currently infoboxes, following interwiki links. And even those two projects (interwikis + infoboxes) were focused on only Wikipedias.
MZMcBride
On Sun, Sep 1, 2013 at 9:02 AM, MZMcBride z@mzmcbride.com wrote:
I think a parallel question might be: is Wikidata, as a social or technical project, able and ready to accept such data? I haven't been following Wikidata's progress too much, but I thought the focus was currently infoboxes, following interwiki links. And even those two projects (interwikis + infoboxes) were focused on only Wikipedias.
I think https://commons.wikimedia.org/wiki/Commons:Wikidata_for_media_info is the proposed, future part of Wikidata that James/Brian are talking about.
Hoi,
Wikidata is able to support a subset of properties needed for infoboxes. The technology is however implemented on several Wikipedias. Recently it became available for use on Wikivoyage.
The support for interwiki links is well established on both Wikivoyage and Wikipedia.
Probably much of the data that needs to be supported for data cannot be imported yet because the required property types are not supported yet. Thanks, GerardM
On 1 September 2013 05:32, MZMcBride z@mzmcbride.com wrote:
James Forrester wrote:
However, how much more work would it be to insert it directly into Wikidata right now?
I think a parallel question might be: is Wikidata, as a social or technical project, able and ready to accept such data? I haven't been following Wikidata's progress too much, but I thought the focus was currently infoboxes, following interwiki links. And even those two projects (interwikis + infoboxes) were focused on only Wikipedias.
MZMcBride
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Gerard Meijssen wrote:
Wikidata is able to support a subset of properties needed for infoboxes. The technology is however implemented on several Wikipedias. Recently it became available for use on Wikivoyage.
The support for interwiki links is well established on both Wikivoyage and Wikipedia.
Thank you for this. I didn't know about Wikivoyage at all.
I see https://meta.wikimedia.org/wiki/Wikidata/Development and https://meta.wikimedia.org/wiki/Wikidata/Requirements, but I can't find a high-level public timeline or roadmap anywhere. Do you happen to know if there is one?
MZMcBride
On 8/31/13, James Forrester jforrester@wikimedia.org wrote:
However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take it now
- presumably the hard work is the reliable screen-scraping, and building
the tool-chain to extract from this just to port it over to Wikidata in a few months' time would be a pity.
Part of this is meant as a hold over, until Wikidata solves the problem in a more flexible way. However, part of it is meant to still work with wikidata. The idea I have is that this api could be used by any wiki (the base part is in core), and then various extensions can extend it. That way we can make extensions (or even core features) relying on this metadata that can work even on wikis without wikidata/or the commons meta extension I started. The basic features of the api would be available for anyone who needed metadata, and it would return the best information available, even if that means only the exif data. It would also mean that getting the metadata would be independent of the backend used to extract/get the metadata. (I would of course still expect wikidata to introduce its own more flexible APIs).
This looks rather fun. For VisualEditor, we'd quite like to be able to pull in the description of a media file in the page's language when it's inserted into the page, to use as the default caption for images. I was assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-)
Interesting.
[tangent] One idea that sometimes comes up related to this, is a way of specifying default thumbnail parameters on the image description page. For example, on pdfs, sometimes people want to specify a default page number. Often its proposed to be able to specify a default alt text (although some argue that would be bad for accessibility since alt text should be context dependent). Another use, is sometimes people propose having a sharpen/no-sharpen parameter to control if sharpening of thumbnails should take place (photos should be sharpened, line art should not be. Currently we do it based on file type).
It could be interesting to have a magic word like {{#imageParameters:page=3|Description|alt=Alt text}} on the image description page, to specify defaults. (Although I imagine the visual editor folks don't like the idea of adding more in-page metadata). [end not entirely fully thought out tangent]
- --bawolff
Hi Brian!
I like the idea of a metadata API very much. Being able to just replace the scraping backend with Wikidata (as proposed) later seems a good idea. I see no downside as long as no extra work needs to be done on the templates and wikitext, and the API could even be used later to port information from templates to wikidata.
The only thing I'm slightly worried about is the data model and representation of the metadata. Swapping one backend for another will only work if they are conceptually compatible.
Can you give a brief overview of how you imagine the output of the API would be structured, and what information it would contain?
Also, your original proposal said something about outputting HTML. That confuses me - an API module would return structured data, why would you use HTML to represent the metadata? That makes it a lot harder to process...
-- daniel
Am 04.09.2013 18:55, schrieb Brian Wolff:
On 8/31/13, James Forrester jforrester@wikimedia.org wrote:
However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take it now
- presumably the hard work is the reliable screen-scraping, and building
the tool-chain to extract from this just to port it over to Wikidata in a few months' time would be a pity.
Part of this is meant as a hold over, until Wikidata solves the problem in a more flexible way. However, part of it is meant to still work with wikidata. The idea I have is that this api could be used by any wiki (the base part is in core), and then various extensions can extend it. That way we can make extensions (or even core features) relying on this metadata that can work even on wikis without wikidata/or the commons meta extension I started. The basic features of the api would be available for anyone who needed metadata, and it would return the best information available, even if that means only the exif data. It would also mean that getting the metadata would be independent of the backend used to extract/get the metadata. (I would of course still expect wikidata to introduce its own more flexible APIs).
This looks rather fun. For VisualEditor, we'd quite like to be able to pull in the description of a media file in the page's language when it's inserted into the page, to use as the default caption for images. I was assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-)
Interesting.
[tangent] One idea that sometimes comes up related to this, is a way of specifying default thumbnail parameters on the image description page. For example, on pdfs, sometimes people want to specify a default page number. Often its proposed to be able to specify a default alt text (although some argue that would be bad for accessibility since alt text should be context dependent). Another use, is sometimes people propose having a sharpen/no-sharpen parameter to control if sharpening of thumbnails should take place (photos should be sharpened, line art should not be. Currently we do it based on file type).
It could be interesting to have a magic word like {{#imageParameters:page=3|Description|alt=Alt text}} on the image description page, to specify defaults. (Although I imagine the visual editor folks don't like the idea of adding more in-page metadata). [end not entirely fully thought out tangent]
--bawolff
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I'm just throwing some ideas out there, in hope of inspiring you:
Things you might want to consider (at least in the design) of this API/Extension, might be: multi licensing, derivative and/or 'companion' linking (subtitle files, cropping etc, the pictured object) and their copyrights, keeping track of date and country/location of first publication vs date/country of authorship and external repository record numbers/uris (Nasa image, OCLC and other IDs)
These are all elements that have proven to be critical elements in information handling (specifically [re-]publishing) of our media, yet are notoriously difficult/challenging to retrieve currently.
Please don't implement all of it, but I think it would be good to consider if design choices made would exclude such things, or enable future enhancements to include them.
Also, remember that attribution != author (Especially with CC)
DJ
On Fri, Sep 6, 2013 at 11:42 AM, Daniel Kinzler daniel@brightbyte.dewrote:
Hi Brian!
I like the idea of a metadata API very much. Being able to just replace the scraping backend with Wikidata (as proposed) later seems a good idea. I see no downside as long as no extra work needs to be done on the templates and wikitext, and the API could even be used later to port information from templates to wikidata.
The only thing I'm slightly worried about is the data model and representation of the metadata. Swapping one backend for another will only work if they are conceptually compatible.
Can you give a brief overview of how you imagine the output of the API would be structured, and what information it would contain?
Also, your original proposal said something about outputting HTML. That confuses me - an API module would return structured data, why would you use HTML to represent the metadata? That makes it a lot harder to process...
-- daniel
Am 04.09.2013 18:55, schrieb Brian Wolff:
On 8/31/13, James Forrester jforrester@wikimedia.org wrote:
However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take it now
- presumably the hard work is the reliable screen-scraping, and building
the tool-chain to extract from this just to port it over to Wikidata in a few months' time would be a pity.
Part of this is meant as a hold over, until Wikidata solves the problem in a more flexible way. However, part of it is meant to still work with wikidata. The idea I have is that this api could be used by any wiki (the base part is in core), and then various extensions can extend it. That way we can make extensions (or even core features) relying on this metadata that can work even on wikis without wikidata/or the commons meta extension I started. The basic features of the api would be available for anyone who needed metadata, and it would return the best information available, even if that means only the exif data. It would also mean that getting the metadata would be independent of the backend used to extract/get the metadata. (I would of course still expect wikidata to introduce its own more flexible APIs).
This looks rather fun. For VisualEditor, we'd quite like to be able to
pull in the description of a media file in the page's language when it's inserted into the page, to use as the default caption for images. I was assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-)
Interesting.
[tangent] One idea that sometimes comes up related to this, is a way of specifying default thumbnail parameters on the image description page. For example, on pdfs, sometimes people want to specify a default page number. Often its proposed to be able to specify a default alt text (although some argue that would be bad for accessibility since alt text should be context dependent). Another use, is sometimes people propose having a sharpen/no-sharpen parameter to control if sharpening of thumbnails should take place (photos should be sharpened, line art should not be. Currently we do it based on file type).
It could be interesting to have a magic word like {{#imageParameters:page=3|**Description|alt=Alt text}} on the image description page, to specify defaults. (Although I imagine the visual editor folks don't like the idea of adding more in-page metadata). [end not entirely fully thought out tangent]
--bawolff
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
I can offer this demo (quickly ported from toolserver, which now refuses to run it): http://tools.wmflabs.org/magnustools/commonsapi.php
Far from perfect, but to show what could be done now.
If anyone's interested in helping me develop it, I'll make it a "real" tool on Labs.
Cheers, Magnus
On Fri, Sep 6, 2013 at 3:08 PM, Derk-Jan Hartman < d.j.hartman+wmf_ml@gmail.com> wrote:
I'm just throwing some ideas out there, in hope of inspiring you:
Things you might want to consider (at least in the design) of this API/Extension, might be: multi licensing, derivative and/or 'companion' linking (subtitle files, cropping etc, the pictured object) and their copyrights, keeping track of date and country/location of first publication vs date/country of authorship and external repository record numbers/uris (Nasa image, OCLC and other IDs)
These are all elements that have proven to be critical elements in information handling (specifically [re-]publishing) of our media, yet are notoriously difficult/challenging to retrieve currently.
Please don't implement all of it, but I think it would be good to consider if design choices made would exclude such things, or enable future enhancements to include them.
Also, remember that attribution != author (Especially with CC)
DJ
On Fri, Sep 6, 2013 at 11:42 AM, Daniel Kinzler <daniel@brightbyte.de
wrote:
Hi Brian!
I like the idea of a metadata API very much. Being able to just replace the scraping backend with Wikidata (as proposed) later seems a good
idea. I
see no downside as long as no extra work needs to be done on the
templates
and wikitext, and the API could even be used later to port information
from
templates to wikidata.
The only thing I'm slightly worried about is the data model and representation of the metadata. Swapping one backend for another will
only
work if they are conceptually compatible.
Can you give a brief overview of how you imagine the output of the API would be structured, and what information it would contain?
Also, your original proposal said something about outputting HTML. That confuses me - an API module would return structured data, why would you
use
HTML to represent the metadata? That makes it a lot harder to process...
-- daniel
Am 04.09.2013 18:55, schrieb Brian Wolff:
On 8/31/13, James Forrester jforrester@wikimedia.org wrote:
However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take it now
- presumably the hard work is the reliable screen-scraping, and
building
the tool-chain to extract from this just to port it over to Wikidata
in a
few months' time would be a pity.
Part of this is meant as a hold over, until Wikidata solves the problem in a more flexible way. However, part of it is meant to still work with wikidata. The idea I have is that this api could be used by any wiki (the base part is in core), and then various extensions can extend it. That way we can make extensions (or even core features) relying on this metadata that can work even on wikis without wikidata/or the commons meta extension I started. The basic features of the api would be available for anyone who needed metadata, and it would return the best information available, even if that means only the exif data. It would also mean that getting the metadata would be independent of the backend used to extract/get the metadata. (I would of course still expect wikidata to introduce its own more flexible APIs).
This looks rather fun. For VisualEditor, we'd quite like to be able to
pull in the description of a media file in the page's language when
it's
inserted into the page, to use as the default caption for images. I was assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-)
Interesting.
[tangent] One idea that sometimes comes up related to this, is a way of specifying default thumbnail parameters on the image description page. For example, on pdfs, sometimes people want to specify a default page number. Often its proposed to be able to specify a default alt text (although some argue that would be bad for accessibility since alt text should be context dependent). Another use, is sometimes people propose having a sharpen/no-sharpen parameter to control if sharpening of thumbnails should take place (photos should be sharpened, line art should not be. Currently we do it based on file type).
It could be interesting to have a magic word like {{#imageParameters:page=3|**Description|alt=Alt text}} on the image description page, to specify defaults. (Although I imagine the visual editor folks don't like the idea of adding more in-page metadata). [end not entirely fully thought out tangent]
--bawolff
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
https://lists.wikimedia.org/mailman/listinfo/wikitech-l%3E
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
https://lists.wikimedia.org/mailman/listinfo/wikitech-l%3E
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
This looks great. I know a few sites that are already screenscraping for our license info, so this will be a huge help for them. I noticed, however, that the API currently doesn't support the attribution parameter of the licensing templates (where it specifies the attribution string). I'm sure this is just because it's an experimental demo, but thought I would mention it anyway, just in case :)
Ryan Kaldari
On Fri, Sep 6, 2013 at 7:54 AM, Magnus Manske magnusmanske@googlemail.comwrote:
I can offer this demo (quickly ported from toolserver, which now refuses to run it): http://tools.wmflabs.org/magnustools/commonsapi.php
Far from perfect, but to show what could be done now.
If anyone's interested in helping me develop it, I'll make it a "real" tool on Labs.
Cheers, Magnus
On Fri, Sep 6, 2013 at 3:08 PM, Derk-Jan Hartman < d.j.hartman+wmf_ml@gmail.com> wrote:
I'm just throwing some ideas out there, in hope of inspiring you:
Things you might want to consider (at least in the design) of this API/Extension, might be: multi licensing, derivative and/or 'companion' linking (subtitle files, cropping etc, the pictured object) and their copyrights, keeping track of date and country/location of first
publication
vs date/country of authorship and external repository record numbers/uris (Nasa image, OCLC and other IDs)
These are all elements that have proven to be critical elements in information handling (specifically [re-]publishing) of our media, yet are notoriously difficult/challenging to retrieve currently.
Please don't implement all of it, but I think it would be good to
consider
if design choices made would exclude such things, or enable future enhancements to include them.
Also, remember that attribution != author (Especially with CC)
DJ
On Fri, Sep 6, 2013 at 11:42 AM, Daniel Kinzler <daniel@brightbyte.de
wrote:
Hi Brian!
I like the idea of a metadata API very much. Being able to just replace the scraping backend with Wikidata (as proposed) later seems a good
idea. I
see no downside as long as no extra work needs to be done on the
templates
and wikitext, and the API could even be used later to port information
from
templates to wikidata.
The only thing I'm slightly worried about is the data model and representation of the metadata. Swapping one backend for another will
only
work if they are conceptually compatible.
Can you give a brief overview of how you imagine the output of the API would be structured, and what information it would contain?
Also, your original proposal said something about outputting HTML. That confuses me - an API module would return structured data, why would you
use
HTML to represent the metadata? That makes it a lot harder to
process...
-- daniel
Am 04.09.2013 18:55, schrieb Brian Wolff:
On 8/31/13, James Forrester jforrester@wikimedia.org wrote:
However, how much more work would it be to insert it directly into Wikidata right now? I worry about doing the work twice if Wikidata could take
it
now
- presumably the hard work is the reliable screen-scraping, and
building
the tool-chain to extract from this just to port it over to Wikidata
in a
few months' time would be a pity.
Part of this is meant as a hold over, until Wikidata solves the problem in a more flexible way. However, part of it is meant to still work with wikidata. The idea I have is that this api could be used by any wiki (the base part is in core), and then various extensions can extend it. That way we can make extensions (or even core features) relying on this metadata that can work even on wikis without wikidata/or the commons meta extension I started. The basic features of the api would be available for anyone who needed metadata, and it would return the best information available, even if that means only the exif data. It would also mean that getting the metadata would be independent of the backend used to extract/get the metadata. (I would of course still expect wikidata to introduce its own more flexible APIs).
This looks rather fun. For VisualEditor, we'd quite like to be able
to
pull in the description of a media file in the page's language when
it's
inserted into the page, to use as the default caption for images. I
was
assuming we'd have to wait for the port of this data to Wikidata, but this would be hugely helpful ahead of that. :-)
Interesting.
[tangent] One idea that sometimes comes up related to this, is a way of specifying default thumbnail parameters on the image description page. For example, on pdfs, sometimes people want to specify a default page number. Often its proposed to be able to specify a default alt text (although some argue that would be bad for accessibility since alt text should be context dependent). Another use, is sometimes people propose having a sharpen/no-sharpen parameter to control if sharpening of thumbnails should take place (photos should be sharpened, line art should not be. Currently we do it based on file type).
It could be interesting to have a magic word like {{#imageParameters:page=3|**Description|alt=Alt text}} on the image description page, to specify defaults. (Although I imagine the visual editor folks don't like the idea of adding more in-page metadata). [end not entirely fully thought out tangent]
--bawolff
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
https://lists.wikimedia.org/mailman/listinfo/wikitech-l%3E
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<
https://lists.wikimedia.org/mailman/listinfo/wikitech-l%3E
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- undefined _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 9/6/13, Daniel Kinzler daniel@brightbyte.de wrote:
The only thing I'm slightly worried about is the data model and representation of the metadata. Swapping one backend for another will only work if they are conceptually compatible.
The data model I was using was simple key-value pairs. Specifically it was using the various properties defined by Exif (and other metadata things that MediaWiki extracts from files) as the key names. I imagine wikidata would allow for much more complex types of metadata. I was thinking this api module would serve to gather the "basic" information, and wikidata would have its own querying endpoints for the complex view of its metadata.
Can you give a brief overview of how you imagine the output of the API would be structured, and what information it would contain?
As an example, for the url of the license: <LicenseUrl source="commons-desc-page" translatedName="URL for copyright license" hidden="" xml:space="preserve">http://creativecommons.org/licenses/by-sa/3.0/at/deed.en</LicenseUrl>
Which contains the key name ("LicenseUrl"), the place where the data was retrieved from ("commons-desc-page", as opposed to "file-metadata" if it came from the CC:LicenseUrl property of XMP data embedded in the file), the translated name of the key name ( "URL for copyright license", coming from MediaWiki:Exif-licenseurl message), whether or not this property is hidden when displayed on image description page (true in the example), and the value of the property (http://creativecommons.org/licenses/by-sa/3.0/at/deed.en)
Also, your original proposal said something about outputting HTML. That confuses me - an API module would return structured data, why would you use HTML to represent the metadata? That makes it a lot harder to process...
It does. Part of the reason, is I wanted something that could instantly be displayed to the user, hence more user friendly than machine friendly (For example human readable timestamps instead of iso timestamps. Human readable flash firing values, vs constant). The second reason is the source of the data. If we look at the description field on a commons image page, we have things like:
"Front and western side of the house located at 912 E. First Street in {{w|Bloomington, Indiana|Bloomington}}, {{w|Indiana}}, {{w|United States}}. Built in 1925, it is part of the locally-designated Elm Heights Historic District."
Which has links in it. There's a couple options for what we can do with that. We can give it out as is, or we could expand templates and return:
"Front and western side of the house located at 912 E. First Street in [[:w:Bloomington, Indiana|Bloomington]], [[:w:Indiana|Indiana]], [[:w:United States|United States]]. Built in 1925, it is part of the locally-designated Elm Heights Historic District."
Or we could return html: Front and western side of the house located at 912 E. First Street in <a href="//en.wikipedia.org/wiki/Bloomington,_Indiana" class="extiw" title="w:Bloomington, Indiana">Bloomington</a>, <a href="//en.wikipedia.org/wiki/Indiana" class="extiw" title="w:Indiana">Indiana</a>, <a href="//en.wikipedia.org/wiki/United_States" class="extiw" title="w:United States">United States</a>. Built in 1925, it is part of the locally-designated Elm Heights Historic District.
Or we could ditch the html entirely:
Front and western side of the house located at 912 E. First Street in Bloomington, Indiana, United States. Built in 1925, it is part of the locally-designated Elm Heights Historic District.
I think returning the html is the option that is most honest to the original data, while still being easy to process. Sometimes the formatting in the description field is more complex than just simple links.
Given that the use case of showing data to user and having metadata that is easy to process for computers are slightly different, perhaps it makes sense to have two different modules, one that returns html (and human formatted things for timestamps, etc), and the other that returns more machine oriented data (including perhaps the version of the description tag with all html stripped out).
--bawolff
Hi Brian,
I've been working on an api module/extension to extract metadata from
commons image description pages, and display it in the API.
Awesome!
The downside to this is in order to effectively get metadata out of commons given the current practises, one essentially has to screen scrape and do slightly ugly things
This [1] looks quite acrobatic indeed. Can’t we make better use of the machine-readable markings provided by templates? https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
[1] https://gerrit.wikimedia.org/r/#/c/80403/4/CommonsMetadata_body.php
On 9/1/13, Jean-Frédéric jeanfrederic.wiki@gmail.com wrote: [..]
The downside to this is in order to effectively get metadata out of commons given the current practises, one essentially has to screen scrape and do slightly ugly things
This [1] looks quite acrobatic indeed. Can’t we make better use of the machine-readable markings provided by templates? https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
[1] https://gerrit.wikimedia.org/r/#/c/80403/4/CommonsMetadata_body.php
It is using the machine readable data from that page. (Although its debatable how much "Look for a <td> with this id, and then look at the contents of the next sibling <td> you encounter is").
I'm somewhat of a newb though with extracting microformat style metadata, so its quite possible there is a better way, or some higher level parsing library I could use (Something like xpath maybe, although its not really xml I'm looking at).
On 09/04/2013 09:59 AM, Brian Wolff wrote:
This [1] looks quite acrobatic indeed. Can’t we make better use of the machine-readable markings provided by templates? https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
[1] https://gerrit.wikimedia.org/r/#/c/80403/4/CommonsMetadata_body.php
It is using the machine readable data from that page. (Although its debatable how much "Look for a <td> with this id, and then look at the contents of the next sibling <td> you encounter is").
I'm somewhat of a newb though with extracting microformat style metadata, so its quite possible there is a better way, or some higher level parsing library I could use (Something like xpath maybe, although its not really xml I'm looking at).
Parsoid might be able to help you with access to template parameters along with the fully expanded HTML that was produced from them. See [1].
We are going to work on page metadata storage as well, see [2] and [3]. Maybe our storage work could eventually also provide a backend for you.
Gabriel
[1]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Template_content [2]: https://bugzilla.wikimedia.org/show_bug.cgi?id=53508 [3]: https://bugzilla.wikimedia.org/show_bug.cgi?id=49143
On 4 sep. 2013, at 18:59, Brian Wolff bawolff@gmail.com wrote:
On 9/1/13, Jean-Frédéric jeanfrederic.wiki@gmail.com wrote: [..]
The downside to this is in order to effectively get metadata out of commons given the current practises, one essentially has to screen scrape and do slightly ugly things
This [1] looks quite acrobatic indeed. Can’t we make better use of the machine-readable markings provided by templates? https://commons.wikimedia.org/wiki/Commons:Machine-readable_data
[1] https://gerrit.wikimedia.org/r/#/c/80403/4/CommonsMetadata_body.php
It is using the machine readable data from that page. (Although its debatable how much "Look for a <td> with this id, and then look at the contents of the next sibling <td> you encounter is").
Almost all of that is templated, so of course we can choose to actually fix some of those templates if we really wanted to. Especially for the licenses, my intent was EXACTLY to feed a system like you are building right now, while at the same time making Magnus' StockPhoto gadget possible for the immediate future, so I love what you are doing here.
I have not had time to read your patches unfortunately, but can I suggest creating a separate table of licenses ? The licenses are very well suited as 'managed' data units I think and would give you a lot of flexibility. You could have like:
id, abbreviation, short name, long name, license version, long description page, default template, scrapeid, canonical license URL, canonical RFDa, PD/CC, BY, NC, SA, other properties of the license requirements
Then use the 'scrapeid' to link the licenses to the file metadata. The licenses are very well suited for this I think and it will make it a lot easier to search trough the database and to dynamically give suitable representations of the license in different types (very short linked, long linked, full text, full linked) in different languages.
For the other metadata it would also be very nice to take a much more structured and even WikiData approach, but I think a licenses table is much simpler that most other metadata, would give us a lot of flexibility and advantadges and would be easy to import into WikiData once we think we are up to that. Something to consider.
DJ
I'm somewhat of a newb though with extracting microformat style metadata, so its quite possible there is a better way, or some higher level parsing library I could use (Something like xpath maybe, although its not really xml I'm looking at).
I am not really proficient with that either ; but yes I used Xpath in two projects (one prototype Wordpress extension [1], one draft Zotero plugin [2]) before to retrieve Commons metadata. Seems to me it’s less shaky. For a good example see the (better) one made by Zotero folks : < https://github.com/zotero/translators/blob/master/Wikimedia%20Commons.js%3E
And yes, as Derk-Jan says, do remember we can re-markup everything if needed :)
-- Jean-Fred
[1] Just overviewed that one: < https://github.com/CommonsOnCMS/CommonsOnCMS/blob/master/wp-wikimedia/wp-wik...
[2] < https://github.com/JeanFred/translators/blob/master/Wikimedia%20Commons.js%3...
wikitech-l@lists.wikimedia.org