Getting metadata back from Wikimedia Commons

List overview All Threads
Download

newer

older

Update on the status of tiff...

Rights for GLAM group accounts

Jesse de Vos

26 Nov 2015 26 Nov '15

3:47 p.m.

Hi everyone,

We’re trying to get a clearer picture of what material we have on Wikimedia Commons so that our next batch upload doesn’t duplicate with material that is already on Commons. The category: Media from Open Beelden contains all the files and we would like to have all the metadata on Commons for that category (specifically the 'source' URL) to match against our new content upload.

Does anyone know how to gather this using the Commons API? Basically, a call to the API with the “File:title file:///Applications/Colloquy.app/Contents/Resources/Styles/Standard.colloquyStyle/Contents/Resources/title” field that would return a JSON object with all the metadata is exactly what we need. Help would be much appreciated!

Best,

Jesse

-- Met vriendelijke groet, *Jesse de Vos* Researcher Interactive and New Media *T* 035 - 677 39 37 *Aanwezig:* ma t/m do http://www.beeldengeluid.nl/ *Nederlands Instituut voor Beeld en Geluid* *Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB Hilversum | * *beeldengeluid.nl* http://www.beeldengeluid.nl/

Attachments:

attachment.htm (text/html — 2.7 KB)

Show replies by date

Brian Wolff

26 Nov 26 Nov

4:36 p.m.

We dont have great support for this. The best i know of is the "credit" field in https://commons.wikimedia.org/w/api.php?generator=categorymembers&gcmtit... but might need to parse some html (unless the data is included in exif/xmp in which case there is a different api query you can use). Also keep in mind you have to use the continue parameter to get the next page of results.

Hope that helps, --bawolff

On Thursday, November 26, 2015, Jesse de Vos jdvos@beeldengeluid.nl wrote:

...

Hi everyone,

We’re trying to get a clearer picture of what material we have on

Wikimedia Commons so that our next batch upload doesn’t duplicate with material that is already on Commons. The category: Media from Open Beelden contains all the files and we would like to have all the metadata on Commons for that category (specifically the 'source' URL) to match against our new content upload.

...

Does anyone know how to gather this using the Commons API? Basically, a

call to the API with the “File:title” field that would return a JSON object with all the metadata is exactly what we need. Help would be much appreciated!

...

Best,

Jesse

--

Met vriendelijke groet,

Jesse de Vos Researcher Interactive and New Media

T 035 - 677 39 37 Aanwezig: ma t/m do

<

https://ci4.googleusercontent.com/proxy/-SRFha-hgXruUwURqU-2UHoKrRDxJA1bV-lc...

...

Nederlands Instituut voor Beeld en Geluid

Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB Hilversum | beeldengeluid.nl

...

Bas vb

7:21 p.m.

The files have "<!-- <metadata_mapped_json>" and then the json-data, in their file description, there the field "gwtoolset-url-to-the-media-file" seems to be perfect. Thus if you can get the full text-page corresponding to the file it must be possible to easily parse the text to get this part and store it as json? How to get the wikitext's I do not know exactly (but that seems quite trivial). Mvg, Bas

Date: Thu, 26 Nov 2015 11:36:59 -0500 From: bawolff@gmail.com To: glamtools@lists.wikimedia.org Subject: Re: [Glamtools] Getting metadata back from Wikimedia Commons

Hope that helps, --bawolff

On Thursday, November 26, 2015, Jesse de Vos jdvos@beeldengeluid.nl wrote:

...

Hi everyone,

We’re trying to get a clearer picture of what material we have on Wikimedia Commons so that our next batch upload doesn’t duplicate with material that is already on Commons. The category: Media from Open Beelden contains all the files and we would like to have all the metadata on Commons for that category (specifically the 'source' URL) to match against our new content upload.

Does anyone know how to gather this using the Commons API? Basically, a call to the API with the “File:title” field that would return a JSON object with all the metadata is exactly what we need. Help would be much appreciated!

Best,

Jesse

--

Met vriendelijke groet,

Jesse de Vos Researcher Interactive and New Media

T 035 - 677 39 37 Aanwezig: ma t/m do

https://ci4.googleusercontent.com/proxy/-SRFha-hgXruUwURqU-2UHoKrRDxJA1bV-lcprm-obQlgDS90aaD-bHYTz3InAlVSVVv9hESQ9Z7v6tOZbVPXLe637s9tUuFOAOV14Gr7e7kWIqUDpsfbX4=s0-d-e1-ft#http://files.beeldengeluid.nl/handtekening/Beeld-en-Geluid_logo.jpg

Nederlands Instituut voor Beeld en Geluid Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB Hilversum | beeldengeluid.nl

_______________________________________________ Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools

Jane Darnell

27 Nov 27 Nov

8:31 a.m.

Vera was talking to me on Wednesday about developing a short php course for Commonists for this specific issue; namely getting and changing information in Commons templates based on user-defined queries. On the other hand, if all you want to do here is get the list of files that you uploaded with that text and compare it to the list of files you want to upload in the next batch, then probably the catscan tool that Magnus created will do the trick:

https://tools.wmflabs.org/catscan2/catscan2.php?language=commons&project...

On Thu, Nov 26, 2015 at 8:21 PM, Bas vb bas_v_b@hotmail.com wrote:

...

The files have "<!-- <metadata_mapped_json>" and then the json-data, in their file description, there the field "gwtoolset-url-to-the-media-file" seems to be perfect. Thus if you can get the full text-page corresponding to the file it must be possible to easily parse the text to get this part and store it as json?

How to get the wikitext's I do not know exactly (but that seems quite trivial).

Mvg,

Bas

Date: Thu, 26 Nov 2015 11:36:59 -0500 From: bawolff@gmail.com To: glamtools@lists.wikimedia.org Subject: Re: [Glamtools] Getting metadata back from Wikimedia Commons

We dont have great support for this. The best i know of is the "credit" field in https://commons.wikimedia.org/w/api.php?generator=categorymembers&gcmtit... https://commons.wikimedia.org/w/api.php?generator=categorymembers&gcmtitle=%20category:%20Media%20from%20Open%20Beelden%20%20&prop=imageinfo&gcmtype=file&iiprop=extmetadata%7csha1&action=query&formatversion=2&gcmlimit=max&format=json but might need to parse some html (unless the data is included in exif/xmp in which case there is a different api query you can use). Also keep in mind you have to use the continue parameter to get the next page of results.

Hope that helps, --bawolff

On Thursday, November 26, 2015, Jesse de Vos jdvos@beeldengeluid.nl wrote:

...
Hi everyone,

We’re trying to get a clearer picture of what material we have on

Wikimedia Commons so that our next batch upload doesn’t duplicate with material that is already on Commons. The category: Media from Open Beelden contains all the files and we would like to have all the metadata on Commons for that category (specifically the 'source' URL) to match against our new content upload.

...
Does anyone know how to gather this using the Commons API? Basically, a

call to the API with the “File:title” field that would return a JSON object with all the metadata is exactly what we need. Help would be much appreciated!

...
Best,

Jesse

--

Met vriendelijke groet,

Jesse de Vos Researcher Interactive and New Media

T 035 - 677 39 37 Aanwezig: ma t/m do

<

https://ci4.googleusercontent.com/proxy/-SRFha-hgXruUwURqU-2UHoKrRDxJA1bV-lc...

...
Nederlands Instituut voor Beeld en Geluid

Media Parkboulevard 1, 1217 WE Hilversum | Postbus 1060, 1200 BB Hilversum | beeldengeluid.nl

...
_______________________________________________ Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools

Glamtools mailing list Glamtools@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/glamtools

Gaurav Vaidya

26 Nov 26 Nov

9:26 p.m.

Hi Jesse,

...

On 26 Nov 2015, at 8:47 AM, Jesse de Vos jdvos@beeldengeluid.nl wrote:

Hi everyone,

We’re trying to get a clearer picture of what material we have on Wikimedia Commons so that our next batch upload doesn’t duplicate with material that is already on Commons. The category: Media from Open Beelden contains all the files and we would like to have all the metadata on Commons for that category (specifically the 'source' URL) to match against our new content upload.

Does anyone know how to gather this using the Commons API? Basically, a call to the API with the “File:title” field that would return a JSON object with all the metadata is exactly what we need. Help would be much appreciated!

Best,

Jesse

You *can* get to that data via DBpedia — I came up with a quick SPARQL query (https://gist.github.com/gaurav/c9704c9b714e1e927140), which you can run at http://commons.dbpedia.org/sparql — here’s what the output looks like: http://commons.dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fcommons.dbp...

Unfortunately, this based off of a dump of the Commons as of January 10, 2015, so it might be out of date for your purposes! If you’d like more frequent updates, I’d ask on the DBpedia mailing lists — I helped write the Commons extractors, but I don’t really know anything about their infrastructure. You can also run the DBpedia Extraction Framework on a local dump of the entire Commons or on a subset of pages (by using https://commons.wikimedia.org/wiki/Special:Export to export all the pages from the category of interest, say), but I’d definitely check with the DBpedia developers first to see if they have something in the works that might be helpful for you!

Hope this helps!

cheers, Gaurav

3144

Age (days ago)

3145

Last active (days ago)

glamtools@lists.wikimedia.org

4 comments

5 participants

tags (0)

participants (5)

Bas vb
Brian Wolff
Gaurav Vaidya
Jane Darnell
Jesse de Vos