Hi Sebastian,
Is there a list of geodata issues, somewhere? Can you
give some example?
My main "pain" points:
- the cebuano geo duplicates:
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_propo…
- detecting "anonym" editings of the wikidata labels from wikidata JSON
dumps. As I know - Now it is impossible, - no similar information in the
JSON dump, so I cant' create a score.
This is similar problem like the original posts ; ( ~ quality score )
but I would like to use the original editing history and
implementing/tuning my scoring algorithm.
When somebody renaming some city names (trolls) , then my matching
algorithm not find them,
and in this cases I can use the previous "better" state of the wikidata.
It is also important for merging openstreetmap place-names with wikidata
labels for end users.
Do you have a reference dataset as well, or would that
be NaturalEarth
itself?
Sorry, I don't have a reference datasets. and NaturalEarth is only a
subset of the "reality" . not contains all cities, rivers, ...
But maybe you can use OpenStreetMap as a best resource.
Sometimes I matching add adding wikidata concordances to
https://www.whosonfirst.org/ (WOF) gazetteer; but this data originated
mostly from similar sources ( geonames,..) so can't use a quality
indicator.
If you need some easy example - probably the "airports" is a good start for
checking wikidata completeness.
(p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code
; p931_place_served ; p131_located_in )
What would help you to measure completeness for adding
concordances to
NaturalEarth.
I have created my own tools/scripts ; because waiting for the community
for fixing cebwiki data problems is lot of times.
I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so
flexible/scalable for geo matchings , )
- adding some scoring based on cebwiki /srwiki ...
- creating some sheets for manual checking.
but this process is like a ~ "fuzzy left join" ... with lot of hacky
codes and manual tunings.
If I don't find some NaturalEarth/WOF object in the wikidata, then I have
to manually debug.
The most problems is
- different transliterations / spellings / english vs. local names ...
- some trolling by anonymous users ( mostly from mobile phone ).
- problems with GPS coordinates.
- changes in the real data ( cities joining / splitting ) so need lot of
background research.
best,
Imre
Sebastian Hellmann <hellmann(a)informatik.uni-leipzig.de> ezt írta (időpont:
2019. aug. 28., Sze, 11:11):
> Hi Imre,
>
> we can encode these rules using the JSON MongoDB database we created in
> GlobalFactSync project (
>
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE).
> As basis for the GFS Data Browser. The database has open read access.
>
> Is there a list of geodata issues, somewhere? Can you give some example?
> GFS focuses on both: overall quality measures and very domain specific
> adaptations. We will also try to flag these issues for Wikipedians.
>
> So I see that there is some notion of what is good and what not by source.
Do you have a reference dataset as well, or would that
be NaturalEarth
> itself? What would help you to measure completeness for adding
concordances
> to NaturalEarth.
>
> -- Sebastian
> On 24.08.19 21:26, Imre Samu wrote:
>
> For geodata ( human settlements/rivers/mountains/... ) ( with GPS
> coordinates ) my simple rules:
> - if it has a "local wikipedia pages" or any big
> lang["EN/FR/PT/ES/RU/.."] wikipedia page .. than it is OK.
> - if it is only in "cebuano" AND outside of "cebuano BBOX" ->
then ....
> this is lower quality
> - only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX ->
this is lower
> quality
> - only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
> - geodata without GPS coordinate -> ...
> - ....
> so my rules based on wikipedia pages and languages areas ... and I prefer
> wikidata - with local wikipedia pages.
>
> This is based on my experience - adding Wikidata ID concordances to
> NaturalEarth (
https://www.naturalearthdata.com/blog/ )
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects:
http://dbpedia.org,
http://nlp2rdf.org,
>
http://linguistics.okfn.org,
https://www.w3.org/community/ld4lt
> <http://www.w3.org/community/ld4lt>
> Homepage:
http://aksw.org/SebastianHellmann
> Research Group:
http://aksw.org
>