Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion.
Sorry, I don't have an answer for you, hopefully others respond. When you get the answer... it would be great if you could add a new section called "Statistics" to this page: Help:Sitelinks - Wikidata https://www.wikidata.org/wiki/Help:Sitelinks
Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/
On Tue, Mar 22, 2022 at 4:25 PM finin@umbc.edu wrote:
Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Here is a dashboard with the number of items that do have a sitelink: https://grafana.wikimedia.org/goto/9qEhxrP7k?orgId=1
Jan Ainali
Den tis 22 mars 2022 kl 22:25 skrev finin@umbc.edu:
Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
I'm not sure if wikibase:sitelinks is included in the standard WIkidata
dump.
As I see - it is in the JSON dump. https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recomme...)
https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_json.html#json_...
example:
{ "sitelinks": { "afwiki": { "site": "afwiki", "title": "New York Stad", "badges": [] }, "frwiki": { "site": "frwiki", "title": "New York City", "badges": [] }, "nlwiki": { "site": "nlwiki", "title": "New York City", "badges": [ "Q17437796" ] }, "enwiki": { "site": "enwiki", "title": "New York City", "badges": [] }, "dewiki": { "site": "dewiki", "title": "New York City", "badges": [ "Q17437798" ] } } }
Kind Regards, Imre
finin@umbc.edu ezt írta (időpont: 2022. márc. 22., K, 22:25):
Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
In the queryable Wikidata model, there is a property wikibase:sitelinks whose value is an integer that is the number of Wikipedia sites that the item appears on if it is on at least one site. This is what I'm after. I'm not sure that this value is in the RDF dumps and in the smaller truthy dumps, in particular.
In the queryable Wikidata model, there is a property wikibase:sitelinks
whose value is an integer that is the number of Wikipedia sites that the item appears on if it is on at least one site.
This is what I'm after. I'm not sure that this value is in the RDF dumps
and in the smaller truthy dumps, in particular.
As I see the "latest-all.nt.bz2" contains the "sitelink" info ( downloaded from here https://dumps.wikimedia.org/wikidatawiki/entities/ )
$ bzcat latest-all.nt.bz2 | grep sitelink | head https://www.wikidata.org/wiki/Special:EntityData/Q31 < http://wikiba.se/ontology#sitelinks%3E "345"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q8 < http://wikiba.se/ontology#sitelinks%3E "149"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q23 < http://wikiba.se/ontology#sitelinks%3E "235"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q24 < http://wikiba.se/ontology#sitelinks%3E "26"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q42 < http://wikiba.se/ontology#sitelinks%3E "116"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q1868 < http://wikiba.se/ontology#sitelinks%3E "29"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q2013 < http://wikiba.se/ontology#sitelinks%3E "119"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q45 < http://wikiba.se/ontology#sitelinks%3E "338"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q51 < http://wikiba.se/ontology#sitelinks%3E "292"^^< http://www.w3.org/2001/XMLSchema#integer%3E . https://www.wikidata.org/wiki/Special:EntityData/Q58 < http://wikiba.se/ontology#sitelinks%3E "138"^^< http://www.w3.org/2001/XMLSchema#integer%3E .
the number of Wikipedia sites
For example the first line in my example: Q31 = Belgium ( country in western Europe ) https://www.wikidata.org/wiki/Q31 https://www.wikidata.org/wiki/Special:EntityData/Q31 < http://wikiba.se/ontology#sitelinks%3E "345"^^< http://www.w3.org/2001/XMLSchema#integer%3E .
*Q31.Sitelinks= 345 * * == [ Wikipedia(278 entries)* + Wikibooks(3 entries) + Wikinews(30 entries) + Wikiquote(12 entries) + Wikivoyage(21 entries) + Multilingual sites(1 entry) ]
It is not entirely clear to me that you need the "278" or the "345" as a result.
Kind regards, Imre
sitelinks / I want to use the data to help rank possible text entity
links to Wikidata items
side note: I am helping the https://www.naturalearthdata.com/ project by adding wikidata concordances. it is a public domain geo-database ... with [ mountains, rivers, populated places, .. ] I am using wikidata json dumps - and I am importing to PostGIS database. And I am ranking the matches with - distance, ( lower is better ) - text similarity ( I am checking the "labels" and the "aliases" ) - and sitelinks!
And I am lowering the "mostly imported sitelinks" ranks ("cebwiki" , ... ) why? : https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/08#Nonsense...
Because a lot of geodata re-imported. And the "distance" and "text/labels" are the same. So be careful with the imported Wikipedia pages! ( sitelinks ) Now: As I see the geodata quality is so much better - mostly: where the active wikidata community is cleaning ..
it is just an example of why the simple "sitelinks" number is not enough :-)
on the other hand: probably the P625 coordinate location is also important. https://www.wikidata.org/wiki/Property:P625 In Germany - the "dewiki" is higher ranks. in Hungary - the "huwiki" is prefered.
Kind Regards, Imre
finin@umbc.edu ezt írta (időpont: 2022. márc. 22., K, 22:25):
Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
If you're willing to settle for all Wikidata items with at least *two* sitelinks (roughly 11.5 million items), it can be done with five simple WDQS queries (these only return the QIDs though -- no labels):
SELECT?i{VALUES?s{2}?i wikibase:sitelinks?s}
SELECT?i{VALUES?s{3}?i wikibase:sitelinks?s}
(The sitelink counts are implicit for the above two queries and are omitted from the results to help avoid a timeout or error message.)
SELECT*{VALUES?s{4 7}?i wikibase:sitelinks?s}
SELECT*{VALUES?s{5 6}?i wikibase:sitelinks?s}
SELECT*{VALUES?s{8 9 10 [...] 398 399 400}?i wikibase:sitelinks?s}
(There are a few dozen Wikimedia page-type items that have more than 400 sitelinks; these can be found here: https://www.wikidata.org/wiki/Wikidata:Database_reports/Most_sitelinked_item... .)
Each of these queries ran successfully for me in about 20-30 seconds and I was able to download the full results as both a TSV and JSON file without any problems. I had no luck with my attempts to query for the 18.4 million items with only one sitelink, even when using LIMIT and OFFSET.
Hope that helps,
Tyler
On Tue, Mar 22, 2022 at 5:25 PM finin@umbc.edu wrote:
Is there a simple way to get the sitelinks count data for all Wikidata items? I want to use the data to help rank possible text entity links to Wikidata items
I'm really only interested in counts for items that have at least one (e.g., wikibase:sitelinks value that's >0). According to statistics I've seen, only about 1/3 of Wikidata items have at least one sitelink.
I'm not sure if wikibase:sitelinks is included in the standard WIkidata dump. I could try a SPARQL query with an OFFSET and LIMIT, but I doubt that the approach would work to completion. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Thanks, I'll try this! Knowing which items have at least two sitelinks might be good enough. I was unfamiliar with the VALUES opinion in SPARQL 1.1