Some comments about the previous suggestions. I worked on extracting
coordinates and merging/picking best coordinates from the different
In addition to wrong sign or easting/northing like pointed by Paul,
frequent errors are due to copy/paste from other articles. Detecting
outliers helps. Using a fixed radius often doesn't work well. A solution is
to ignore coordinates where the sum of the distance to all others is
significantly higher than the average.
A warning about Sven's idea to prefer local language Wikipedia. It works
well in your example (Netherlands / nlwiki) because that is clearly the
dominant language in this country, but in general it is not easy to
determine "the local language":
- An article about a large entity may span many language regions
- The entity may be located at the border of such regions (take the
Matterhorn peak, at the border of Italy and Switzerland)
- Countries or regions may have many official languages (e.g. Belgium, or
region of Brussels)
- Or may have recognized regional languages (e.g. Spain: Spanish and
Catalan). Determining a single local language is arbitrary in many cases.
Also, I found out that for languages where the wiki is rather small (in
terms of number of articles, or unique authors), coordinates quality is
usually not better than in large wikis (e.g. English). In a large wiki, the
probability that somebody notices and fixes wrong coordinates is higher.
Another challenge when merging coordinates from all languages to Wikidata
is that a single article can have several coordinates. E.g.:
- Main coordinates about the actual topic (usually with display=title, and
shown top-right), and other coordinates not about the article topic itself
but mentioned inline when referring to something else (the latter can be
present alone too, without main coordinates). These in-lined coordinates
should not be considered when trying to pick coordinates for Wikidata.
- Some templates require several coordinates, typically linear features
such as rivers, that have source and mouth coordinates. Dealing with these
cases is required for useful results.
On Thu, Jun 13, 2013 at 4:13 PM, Paul A. Houle <paul(a)ontology2.com> wrote:
You’ll do better dealing with bad coordinates
if your system can
recognize how bad particular cases are.
The worst error I see in Wikipedia is that sometimes people get east
and west confused, so there is this mirror image of Europe reflected
across the U.K. You find cute little Czech towns out in the Atlantic with
the seamounts and the shipwrecks.
If you computed the bounding circle for the points, the radius would
be an indicator of the degree of confusion that would be highly effective
for smoking out the craziest “rouge points”
There are other cases where reasonable parties could disagree about
the exact coordinates for things and it is not worth sweating it. Where
exactly is the state of Ohio or Lake Superior? These things are shapes,
not points. You couldn’t argue about an uncertainty radius of 10
kilometers for a point like that, in fact, the consumer system should
know that it can move those labels around a little bit to improve other
Wikidata-l mailing list