Dear Mike,
On 03.06.2015 22:11, Mike Cummins wrote:
I apologise in advance for asking this in what seems a Technical List, but @pigsonthewing tells me I can.
You are welcome.
I want to extract all National Trust names, co-ordinates and the first part of the description from wiki.
I am afraid that the data you are looking for may not be complete. For example, have a look at:
https://www.wikidata.org/wiki/Q4723912
This is a National Trust site, but Wikidata does not contain any statements that says this. So whatever technical means you use, they will not return this item (yet).
Having spent a day working through the API, I still cannot see how to do this aside from scraping. I am sure I am missing something very simple, so if anyone would kindly email me, I would be grateful.
There are many ways. One is to use one of the SPARQL endpoints. Here I am using the new experimental one of the WMF:
This query shows all things owned by National Trust, with English label, description, and coordinates (where available):
Click "execute" to run it. There are ways to get this result in different formats (not embedded in HTML) but I don't find this right now.
The above query may not be the right one (just 200 results). Here is another one for all things with a number in the National Heritage List for England:
This time it's from another SPARQL endpoint, as you can see. As opposed to the previous query, this one has tens of thousands of results. No SPARQL endpoint I tried manages to return all of them before the timeout. But the one I linked above manages significantly more than the other one (30k still worked for me, while the WMF experimental endpoint currently times out even at 10k -- the service is running on a virtual machine that is not very powerful right now; this will change soon). The downside is that this endpoint is not updated every minute but only every month or so, which means that you won't see the current data.
Actually, a slightly simpler query does work for me on the WMF endpoint, see http://tinyurl.com/p2lk7c7. However, be warned that the +50k results displayed in fancy Javascript may slow down your browser too ;-)
The two query services are based on slightly different versions of the RDF data, hence the slightly different queries. This will all be unified as the RDF work continues. Anyway, I hope you get a first idea from these queries and maybe can play with them a bit to find other things (they look technical at first, but in the end you can just copy patterns and change Qids or Pids as you need).
The key problem in your case might be that the data you need is not really in Wikidata yet. You could look at the autolist2 tool from Magnus as an option to complete the data based on Wikipedia categories etc. (if the information you need is there). It is a very efficient tool for addling large numbers of statements in little time. For something as simple as adding "owned by National Trust" ("P127 Q333515") this might work well. If you need more elaborate data, such as some kind of official IDs for each site, then someone who has a bot may be able to help you if you know of a source where this data can be found.
Regards,
Markus
Mike
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata