>  sitelinks  /  I want to use the data to help rank possible text entity links to Wikidata items

side note:
I am helping the https://www.naturalearthdata.com/ project by adding wikidata concordances.
it is a public domain geo-database ... with [ mountains, rivers, populated places, .. ]
I am using wikidata json dumps - and I am importing to PostGIS database.
And I am ranking the matches with 
- distance,   ( lower is better ) 
- text similarity ( I am checking the "labels" and the "aliases"  )
- and sitelinks!

And I am lowering the "mostly imported sitelinks" ranks ("cebwiki" , ... )
why? :  https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/08#Nonsense_imported_from_Geonames 
Because a lot of geodata re-imported.   And the "distance" and "text/labels" are the same. 
So be careful with the imported Wikipedia pages! ( sitelinks )
Now: As I see the geodata quality is so much better -  mostly: where the active wikidata community is cleaning ..   

it is just an example of why the simple "sitelinks" number is not enough :-) 

on the other hand:    probably the P625 coordinate location is also important.   https://www.wikidata.org/wiki/Property:P625
In Germany - the "dewiki" is higher ranks.
in Hungary  - the "huwiki" is prefered.
Kind Regards,


<finin@umbc.edu> ezt írta (időpont: 2022. márc. 22., K, 22:25):
Is there a simple way to get the sitelinks count data for all Wikidata items?  I want to use the data to help rank possible text entity links to Wikidata items

I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0).  According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.

I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump.  I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion.
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-leave@lists.wikimedia.org