Hi Sebastian,

>Is there a list of geodata issues, somewhere? Can you give some example? 

My main "pain" points:

- the cebuano geo duplicates:
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_proposed_course_of_action_for_dealing_with_cebwiki/svwiki_geographic_duplicates

- detecting "anonym" editings  of the wikidata labels from wikidata JSON dumps.  As I know - Now it is impossible, - no similar  information in the JSON dump, so I cant' create a score.
  This is similar problem like the original posts ; ( ~ quality score )   but I would like to use the original editing history and implementing/tuning my scoring algorithm.

  When somebody renaming some city names (trolls) , then my matching algorithm not find them,
  and in this cases I can use the previous "better" state of the wikidata.
  It is also important for merging openstreetmap place-names with wikidata labels for end users.

 

> Do you have a reference dataset as well, or would that be NaturalEarth itself?

Sorry, I don't have a reference datasets.  and NaturalEarth is only a subset of the "reality" . not contains all cities, rivers, ... 
But maybe you can use OpenStreetMap as a best resource. 
Sometimes I matching add adding wikidata concordances to https://www.whosonfirst.org/ (WOF)  gazetteer; but this data originated mostly from  similar sources ( geonames,..)  so can't use a quality indicator.

If you need some easy example - probably the "airports" is a good start for checking wikidata completeness.
(p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code ; p931_place_served ;  p131_located_in )

> What would help you to measure completeness for adding concordances to NaturalEarth. 

I have created my own tools/scripts  ;  because waiting for the community for fixing cebwiki data problems is lot of times.

I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so flexible/scalable  for geo matchings , ) 
- adding some scoring based on cebwiki /srwiki ... 
- creating some sheets for manual checking.
but this process is like a  ~ "fuzzy left join" ...  with lot of hacky codes and manual tunings.

If I don't find some NaturalEarth/WOF  object in the wikidata, then I have to manually debug.  
The most problems is
- different transliterations / spellings / english vs. local names ...
- some trolling by  anonymous users ( mostly from mobile phone ).
- problems with  GPS coordinates.
- changes in the real data ( cities joining / splitting ) so need lot of background research.

best,
Imre


 



 


 

Sebastian Hellmann <hellmann@informatik.uni-leipzig.de> ezt írta (időpont: 2019. aug. 28., Sze, 11:11):

Hi Imre,

we can encode these rules using the JSON MongoDB database we created in GlobalFactSync project (https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE). As  basis for the GFS Data Browser. The database has open read access.

Is there a list of geodata issues, somewhere? Can you give some example? GFS focuses on both: overall quality measures and very domain specific adaptations. We will also try to flag these issues for Wikipedians.

So I see that there is some notion of what is good and what not by source. Do you have a reference dataset as well, or would that be NaturalEarth itself? What would help you to measure completeness for adding concordances to NaturalEarth.

-- Sebastian

On 24.08.19 21:26, Imre Samu wrote:
For geodata ( human settlements/rivers/mountains/... )  ( with GPS coordinates ) my simple rules:
- if it has a  "local wikipedia pages" or  any big lang["EN/FR/PT/ES/RU/.."]  wikipedia page ..  than it is OK. 
- if it is only in "cebuano" AND outside of "cebuano BBOX" ->  then .... this is lower quality 
- only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX ->  this is lower quality
- only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality 
- geodata without GPS coordinate ->  ...
- ....
so my rules based on wikipedia pages and languages areas ...  and I prefer wikidata - with local wikipedia pages.

This is based on my experience - adding Wikidata ID concordances to NaturalEarth ( https://www.naturalearthdata.com/blog/ )
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org