Hi Jesse,
On 26 Nov 2015, at 8:47 AM, Jesse de Vos
<jdvos(a)beeldengeluid.nl> wrote:
Hi everyone,
We’re trying to get a clearer picture of what material we have on Wikimedia Commons so
that our next batch upload doesn’t duplicate with material that is already on Commons. The
category: Media from Open Beelden contains all the files and we would like to have all the
metadata on Commons for that category (specifically the 'source' URL) to match
against our new content upload.
Does anyone know how to gather this using the Commons API? Basically, a call to the API
with the “File:title” field that would return a JSON object with all the metadata is
exactly what we need. Help would be much appreciated!
Best,
Jesse
You *can* get to that data via DBpedia — I came up with a quick SPARQL query
(
https://gist.github.com/gaurav/c9704c9b714e1e927140), which you can run at
http://commons.dbpedia.org/sparql — here’s what the output looks like:
http://commons.dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fcommons.db…
Unfortunately, this based off of a dump of the Commons as of January 10, 2015, so it might
be out of date for your purposes! If you’d like more frequent updates, I’d ask on the
DBpedia mailing lists — I helped write the Commons extractors, but I don’t really know
anything about their infrastructure. You can also run the DBpedia Extraction Framework on
a local dump of the entire Commons or on a subset of pages (by using
https://commons.wikimedia.org/wiki/Special:Export to export all the pages from the
category of interest, say), but I’d definitely check with the DBpedia developers first to
see if they have something in the works that might be helpful for you!
Hope this helps!
cheers,
Gaurav