sitelinks / I want to use the data to help rank
possible text entity
links to Wikidata items
side note:
I am helping the
https://www.naturalearthdata.com/ project by adding
wikidata concordances.
it is a public domain geo-database ... with [ mountains, rivers, populated
places, .. ]
I am using wikidata json dumps - and I am importing to PostGIS database.
And I am ranking the matches with
- distance, ( lower is better )
- text similarity ( I am checking the "labels" and the "aliases" )
- and sitelinks!
And I am lowering the "mostly imported sitelinks" ranks ("cebwiki" ,
... )
why? :
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/08#Nonsens…
Because a lot of geodata re-imported. And the "distance" and
"text/labels" are the same.
So be careful with the imported Wikipedia pages! ( sitelinks )
Now: As I see the geodata quality is so much better - mostly: where the
active wikidata community is cleaning ..
it is just an example of why the simple "sitelinks" number is not enough
:-)
on the other hand: probably the P625 coordinate location is also
important.
https://www.wikidata.org/wiki/Property:P625
In Germany - the "dewiki" is higher ranks.
in Hungary - the "huwiki" is prefered.
Kind Regards,
Imre
<finin(a)umbc.edu> ezt írta (időpont: 2022. márc. 22., K, 22:25):
Is there a simple way to get the sitelinks count data
for all Wikidata
items? I want to use the data to help rank possible text entity links to
Wikidata items
I'm really only interested in counts for items that have at least one
(e.g., wikibase:sitelinks value that's >0). According to statistics I've
seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata
dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt
that the approach would work to completion.
_______________________________________________
Wikidata mailing list -- wikidata(a)lists.wikimedia.org
To unsubscribe send an email to wikidata-leave(a)lists.wikimedia.org