sitelinks / I want to use the data to help rank possible text entity
links to Wikidata items
side note: I am helping the https://www.naturalearthdata.com/ project by adding wikidata concordances. it is a public domain geo-database ... with [ mountains, rivers, populated places, .. ] I am using wikidata json dumps - and I am importing to PostGIS database. And I am ranking the matches with - distance, ( lower is better ) - text similarity ( I am checking the "labels" and the "aliases" ) - and sitelinks!
And I am lowering the "mostly imported sitelinks" ranks ("cebwiki" , ... ) why? : https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/08#Nonsense...
Because a lot of geodata re-imported. And the "distance" and "text/labels" are the same. So be careful with the imported Wikipedia pages! ( sitelinks ) Now: As I see the geodata quality is so much better - mostly: where the active wikidata community is cleaning ..
it is just an example of why the simple "sitelinks" number is not enough :-)
on the other hand: probably the P625 coordinate location is also important. https://www.wikidata.org/wiki/Property:P625 In Germany - the "dewiki" is higher ranks. in Hungary - the "huwiki" is prefered.
Kind Regards, Imre
finin@umbc.edu ezt írta (időpont: 2022. márc. 22., K, 22:25):
Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org