I apologise in advance for asking this in what seems a Technical List, but @pigsonthewing tells me I can. I want to extract all National Trust names, co-ordinates and the first part of the description from wiki. Having spent a day working through the API, I still cannot see how to do this aside from scraping. I am sure I am missing something very simple, so if anyone would kindly email me, I would be grateful Mike
Hi Mike,
it's not that simple ;-)
I can offer you ~10K Grade I listed buildings and their coordinates from Wikidata: https://tools.wmflabs.org/wikidata-todo/tabernacle.html?wdq=claim%5B1435%3A1... (there's a download button)
We also have ~23K Grade II* buildings: https://tools.wmflabs.org/wikidata-todo/tabernacle.html?wdq=claim%5B1435%3A1...
Many of those won't have an article on English Wikipedia. For those that do, you can use the Wikidata items (Qxxx) from above to get the English Wikipedia articles via the Wikidata api: https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities
Once you have those, you can try to get the initial "blurb": https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts
Cheers, Magnus
On Wed, Jun 3, 2015 at 9:11 PM Mike Cummins b13@b13.co.uk wrote:
I apologise in advance for asking this in what seems a Technical List, but @pigsonthewing tells me I can.
I want to extract all National Trust names, co-ordinates and the first part of the description from wiki.
Having spent a day working through the API, I still cannot see how to do this aside from scraping.
I am sure I am missing something very simple, so if anyone would kindly email me, I would be grateful.
Mike
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Dear Mike,
On 03.06.2015 22:11, Mike Cummins wrote:
I apologise in advance for asking this in what seems a Technical List, but @pigsonthewing tells me I can.
You are welcome.
I want to extract all National Trust names, co-ordinates and the first part of the description from wiki.
I am afraid that the data you are looking for may not be complete. For example, have a look at:
https://www.wikidata.org/wiki/Q4723912
This is a National Trust site, but Wikidata does not contain any statements that says this. So whatever technical means you use, they will not return this item (yet).
Having spent a day working through the API, I still cannot see how to do this aside from scraping. I am sure I am missing something very simple, so if anyone would kindly email me, I would be grateful.
There are many ways. One is to use one of the SPARQL endpoints. Here I am using the new experimental one of the WMF:
This query shows all things owned by National Trust, with English label, description, and coordinates (where available):
Click "execute" to run it. There are ways to get this result in different formats (not embedded in HTML) but I don't find this right now.
The above query may not be the right one (just 200 results). Here is another one for all things with a number in the National Heritage List for England:
This time it's from another SPARQL endpoint, as you can see. As opposed to the previous query, this one has tens of thousands of results. No SPARQL endpoint I tried manages to return all of them before the timeout. But the one I linked above manages significantly more than the other one (30k still worked for me, while the WMF experimental endpoint currently times out even at 10k -- the service is running on a virtual machine that is not very powerful right now; this will change soon). The downside is that this endpoint is not updated every minute but only every month or so, which means that you won't see the current data.
Actually, a slightly simpler query does work for me on the WMF endpoint, see http://tinyurl.com/p2lk7c7. However, be warned that the +50k results displayed in fancy Javascript may slow down your browser too ;-)
The two query services are based on slightly different versions of the RDF data, hence the slightly different queries. This will all be unified as the RDF work continues. Anyway, I hope you get a first idea from these queries and maybe can play with them a bit to find other things (they look technical at first, but in the end you can just copy patterns and change Qids or Pids as you need).
The key problem in your case might be that the data you need is not really in Wikidata yet. You could look at the autolist2 tool from Magnus as an option to complete the data based on Wikipedia categories etc. (if the information you need is there). It is a very efficient tool for addling large numbers of statements in little time. For something as simple as adding "owned by National Trust" ("P127 Q333515") this might work well. If you need more elaborate data, such as some kind of official IDs for each site, then someone who has a bot may be able to help you if you know of a source where this data can be found.
Regards,
Markus
Mike
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
This query shows all things owned by National Trust, with English label, description, and coordinates (where available):
Click "execute" to run it. There are ways to get this result in different formats (not embedded in HTML) but I don't find this right now.
If you just want the results and not the GUI, you can ask the endpoint directly: http://tinyurl.com/ph3wn4m
(this is https://wdqs-beta.wmflabs.org/bigdata/namespace/wdq/sparql?query= and then SPARQL, URL-encoded).
If you want it in JSON, you'll need to add header: Accept:application/sparql-results+json
(not easy to do from browser, unfortunately, unless you use tool like Postman in Chrome) - otherwise you'll get the default XML. That's the endpoint that the GUI is using, proceeding then to parse the result and present them in more human-friendly form.
other one (30k still worked for me, while the WMF experimental endpoint currently times out even at 10k -- the service is running on a virtual machine that is not very powerful right now; this will change soon). The
Yes, this service has 30 seconds cap currently, so if the query takes longer, sorry :) The cap of course rill be raised significantly (and also performance would be better) once we get it to production.