Hi Sebastian,
Is there a list of geodata issues, somewhere? Can you give some example?
My main "pain" points:
- the cebuano geo duplicates: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_propos...
- detecting "anonym" editings of the wikidata labels from wikidata JSON dumps. As I know - Now it is impossible, - no similar information in the JSON dump, so I cant' create a score. This is similar problem like the original posts ; ( ~ quality score ) but I would like to use the original editing history and implementing/tuning my scoring algorithm.
When somebody renaming some city names (trolls) , then my matching algorithm not find them, and in this cases I can use the previous "better" state of the wikidata. It is also important for merging openstreetmap place-names with wikidata labels for end users.
Do you have a reference dataset as well, or would that be NaturalEarth
itself?
Sorry, I don't have a reference datasets. and NaturalEarth is only a subset of the "reality" . not contains all cities, rivers, ... But maybe you can use OpenStreetMap as a best resource. Sometimes I matching add adding wikidata concordances to https://www.whosonfirst.org/ (WOF) gazetteer; but this data originated mostly from similar sources ( geonames,..) so can't use a quality indicator.
If you need some easy example - probably the "airports" is a good start for checking wikidata completeness. (p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code ; p931_place_served ; p131_located_in )
What would help you to measure completeness for adding concordances to
NaturalEarth.
I have created my own tools/scripts ; because waiting for the community for fixing cebwiki data problems is lot of times.
I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so flexible/scalable for geo matchings , ) - adding some scoring based on cebwiki /srwiki ... - creating some sheets for manual checking. but this process is like a ~ "fuzzy left join" ... with lot of hacky codes and manual tunings.
If I don't find some NaturalEarth/WOF object in the wikidata, then I have to manually debug. The most problems is - different transliterations / spellings / english vs. local names ... - some trolling by anonymous users ( mostly from mobile phone ). - problems with GPS coordinates. - changes in the real data ( cities joining / splitting ) so need lot of background research.
best, Imre
Sebastian Hellmann hellmann@informatik.uni-leipzig.de ezt írta (időpont: 2019. aug. 28., Sze, 11:11):
Hi Imre,
we can encode these rules using the JSON MongoDB database we created in GlobalFactSync project ( https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE). As basis for the GFS Data Browser. The database has open read access.
Is there a list of geodata issues, somewhere? Can you give some example? GFS focuses on both: overall quality measures and very domain specific adaptations. We will also try to flag these issues for Wikipedians.
So I see that there is some notion of what is good and what not by source. Do you have a reference dataset as well, or would that be NaturalEarth itself? What would help you to measure completeness for adding concordances to NaturalEarth.
-- Sebastian On 24.08.19 21:26, Imre Samu wrote:
For geodata ( human settlements/rivers/mountains/... ) ( with GPS coordinates ) my simple rules:
- if it has a "local wikipedia pages" or any big
lang["EN/FR/PT/ES/RU/.."] wikipedia page .. than it is OK.
- if it is only in "cebuano" AND outside of "cebuano BBOX" -> then ....
this is lower quality
- only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX -> this is lower
quality
- only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
- geodata without GPS coordinate -> ...
- ....
so my rules based on wikipedia pages and languages areas ... and I prefer wikidata - with local wikipedia pages.
This is based on my experience - adding Wikidata ID concordances to NaturalEarth ( https://www.naturalearthdata.com/blog/ )
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org