You’ll do better dealing with bad coordinates if your system can recognize how bad particular cases are.
The worst error I see in Wikipedia is that sometimes people get east and west confused, so there is this mirror image of Europe reflected across the U.K. You find cute little Czech towns out in the Atlantic with the seamounts and the shipwrecks.
If you computed the bounding circle for the points, the radius would be an indicator of the degree of confusion that would be highly effective for smoking out the craziest “rouge points”
There are other cases where reasonable parties could disagree about the exact coordinates for things and it is not worth sweating it. Where exactly is the state of Ohio or Lake Superior? These things are shapes, not points. You couldn’t argue about an uncertainty radius of 10 kilometers for a point like that, in fact, the consumer system should know that it can move those labels around a little bit to improve other layout metrics.
Some comments about the previous suggestions. I worked on extracting coordinates and merging/picking best coordinates from the different languages.
In addition to wrong sign or easting/northing like pointed by Paul, frequent errors are due to copy/paste from other articles. Detecting outliers helps. Using a fixed radius often doesn't work well. A solution is to ignore coordinates where the sum of the distance to all others is significantly higher than the average.
A warning about Sven's idea to prefer local language Wikipedia. It works well in your example (Netherlands / nlwiki) because that is clearly the dominant language in this country, but in general it is not easy to determine "the local language": - An article about a large entity may span many language regions - The entity may be located at the border of such regions (take the Matterhorn peak, at the border of Italy and Switzerland) - Countries or regions may have many official languages (e.g. Belgium, or region of Brussels) - Or may have recognized regional languages (e.g. Spain: Spanish and Catalan). Determining a single local language is arbitrary in many cases.
Also, I found out that for languages where the wiki is rather small (in terms of number of articles, or unique authors), coordinates quality is usually not better than in large wikis (e.g. English). In a large wiki, the probability that somebody notices and fixes wrong coordinates is higher.
Another challenge when merging coordinates from all languages to Wikidata is that a single article can have several coordinates. E.g.: - Main coordinates about the actual topic (usually with display=title, and shown top-right), and other coordinates not about the article topic itself but mentioned inline when referring to something else (the latter can be present alone too, without main coordinates). These in-lined coordinates should not be considered when trying to pick coordinates for Wikidata. - Some templates require several coordinates, typically linear features such as rivers, that have source and mouth coordinates. Dealing with these cases is required for useful results.
On Thu, Jun 13, 2013 at 4:13 PM, Paul A. Houle paul@ontology2.com wrote:
You’ll do better dealing with bad coordinates if your system can
recognize how bad particular cases are.
The worst error I see in Wikipedia is that sometimes people get east
and west confused, so there is this mirror image of Europe reflected across the U.K. You find cute little Czech towns out in the Atlantic with the seamounts and the shipwrecks.
If you computed the bounding circle for the points, the radius would
be an indicator of the degree of confusion that would be highly effective for smoking out the craziest “rouge points”
There are other cases where reasonable parties could disagree about
the exact coordinates for things and it is not worth sweating it. Where exactly is the state of Ohio or Lake Superior? These things are shapes, not points. You couldn’t argue about an uncertainty radius of 10 kilometers for a point like that, in fact, the consumer system should know that it can move those labels around a little bit to improve other layout metrics.
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l