On 03.06.2015 22:11, Mike Cummins wrote:
I apologise in advance for asking this in what
seems a Technical
List, but @pigsonthewing tells me I can.
You are welcome.
I want to extract all National Trust names, co-ordinates and the
first part of the description from wiki.
I am afraid that the data you are looking for may not be complete. For
example, have a look at:
This is a National Trust site, but Wikidata does not contain any
statements that says this. So whatever technical means you use, they
will not return this item (yet).
Having spent a day working through the API, I still cannot see how
to do this aside from scraping.
I am sure I am missing something very simple, so if anyone would
kindly email me, I would be grateful.
There are many ways. One is to use one of the SPARQL endpoints. Here I
am using the new experimental one of the WMF:
This query shows all things owned by National Trust, with English label,
description, and coordinates (where available):
Click "execute" to run it. There are ways to get this result in
different formats (not embedded in HTML) but I don't find this right now.
The above query may not be the right one (just 200 results). Here is
another one for all things with a number in the National Heritage List
This time it's from another SPARQL endpoint, as you can see. As opposed
to the previous query, this one has tens of thousands of results. No
SPARQL endpoint I tried manages to return all of them before the
timeout. But the one I linked above manages significantly more than the
other one (30k still worked for me, while the WMF experimental endpoint
currently times out even at 10k -- the service is running on a virtual
machine that is not very powerful right now; this will change soon). The
downside is that this endpoint is not updated every minute but only
every month or so, which means that you won't see the current data.
Actually, a slightly simpler query does work for me on the WMF endpoint,
. However, be warned that the +50k results
The two query services are based on slightly different versions of the
RDF data, hence the slightly different queries. This will all be unified
as the RDF work continues. Anyway, I hope you get a first idea from
these queries and maybe can play with them a bit to find other things
(they look technical at first, but in the end you can just copy patterns
and change Qids or Pids as you need).
The key problem in your case might be that the data you need is not
really in Wikidata yet. You could look at the autolist2 tool from Magnus
as an option to complete the data based on Wikipedia categories etc. (if
the information you need is there). It is a very efficient tool for
addling large numbers of statements in little time. For something as
simple as adding "owned by National Trust" ("P127 Q333515") this might
work well. If you need more elaborate data, such as some kind of
official IDs for each site, then someone who has a bot may be able to
help you if you know of a source where this data can be found.
Wikidata mailing list