After the successful working on solving the problem of human settlements without a country, we have to move to the elephant in the room: items without statements.
After some discussion in Telegram, let me share some numbers here.
There are currently 829k of them. https://w.wiki/DEYs
There are another 493k items with only one identifier and no other statement. https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID (P2671).
So basically we have more than a million items that are empty or almost empty.
(And this is not even the list of all items without any instance of (P31) / subclass of (P279), nor a list of items with missing other basic statements, like country, coordinates, sports, etc, etc.)
All these items are basically useless as they contain no data.
Any thoughts on how we can reduce this problem?
Romaine
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other statement. https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID (P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
I tried some queries, and they all timed out :(
I'm not very good at SPARQL.
But I agree with Andy: Dividing hundreds of thousands of items into small groups that can be processed by people who are likely to know something relevant about those items, is probably a better way to try to handle it than just looking at a huge pile of items.
Some ways to divide them that I can think of immediately: 1. Having a sitelink to particular languages. 2. Having a label or a description in a particular languages. 3. Having certain characteristics in the label, like length, or presence of certain characters (even a mostly arbitrary characteristic, like "label starts with the letters 'Mi' " or "has digits in label", is better than nothing).
If someone can make a bunch of queries that do something like this and actually work (and don't time out), this can be a nice beginning.
בתאריך יום ו׳, 28 בפבר׳ 2025, 11:42, מאת Andy Mabbett < andy@pigsonthewing.org.uk>:
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other
statement.
https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID
(P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Hi y'all,
Good ideas.
Queries for such a big number of items are often timing out. Here is a working QLever query for items with more than 10 sitelinks : https://qlever.cs.uni-freiburg.de/wikidata/VdiLsm. There is only one result, you can decrease the value for more results. Reminder, QLever results are not updated in real time, it's based on dumps (who are late because right now, results are from 29.01.2025).
Cheers, Nicolas
Le ven. 28 févr. 2025 à 18:05, Amir E. Aharoni amir.aharoni@mail.huji.ac.il a écrit :
I tried some queries, and they all timed out :(
I'm not very good at SPARQL.
But I agree with Andy: Dividing hundreds of thousands of items into small groups that can be processed by people who are likely to know something relevant about those items, is probably a better way to try to handle it than just looking at a huge pile of items.
Some ways to divide them that I can think of immediately:
- Having a sitelink to particular languages.
- Having a label or a description in a particular languages.
- Having certain characteristics in the label, like length, or presence
of certain characters (even a mostly arbitrary characteristic, like "label starts with the letters 'Mi' " or "has digits in label", is better than nothing).
If someone can make a bunch of queries that do something like this and actually work (and don't time out), this can be a nice beginning.
בתאריך יום ו׳, 28 בפבר׳ 2025, 11:42, מאת Andy Mabbett < andy@pigsonthewing.org.uk>:
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other
statement.
https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID
(P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Perhaps I count as a SPARQL expert now, but I do see one easy way to see all the Wikidata items with no statements (and are not lexemes):
https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fwiki%0AWHERE%20%7B%0A%20%2...
Also, here is a query to get the true "duds" - no statements, no lexemes, and no Wikipedia/Wikimedia articles - it looks like there are about 8,000 of these, so thankfully not really an "elephant":
https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%...
Finally - on a minor no, it looks there are still about 2,000 human settlements in Wikidata without a country:
https://wikidatawalkabout.org/?c=Q486972&lang=en&f.P17=novalue
This is not meant to sound like a criticism - Romaine, you have obviously made an enormous improvement there! And perhaps the remaining ones are difficult to categorize.
-Yaron
On Fri, Feb 28, 2025 at 9:57 AM Nicolas VIGNERON vigneron.nicolas@gmail.com wrote:
Hi y'all,
Good ideas.
Queries for such a big number of items are often timing out. Here is a working QLever query for items with more than 10 sitelinks : https://qlever.cs.uni-freiburg.de/wikidata/VdiLsm. There is only one result, you can decrease the value for more results. Reminder, QLever results are not updated in real time, it's based on dumps (who are late because right now, results are from 29.01.2025).
Cheers, Nicolas
Le ven. 28 févr. 2025 à 18:05, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> a écrit :
I tried some queries, and they all timed out :(
I'm not very good at SPARQL.
But I agree with Andy: Dividing hundreds of thousands of items into small groups that can be processed by people who are likely to know something relevant about those items, is probably a better way to try to handle it than just looking at a huge pile of items.
Some ways to divide them that I can think of immediately:
- Having a sitelink to particular languages.
- Having a label or a description in a particular languages.
- Having certain characteristics in the label, like length, or presence
of certain characters (even a mostly arbitrary characteristic, like "label starts with the letters 'Mi' " or "has digits in label", is better than nothing).
If someone can make a bunch of queries that do something like this and actually work (and don't time out), this can be a nice beginning.
בתאריך יום ו׳, 28 בפבר׳ 2025, 11:42, מאת Andy Mabbett < andy@pigsonthewing.org.uk>:
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other
statement.
https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID
(P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
I think we have this:
https://www.wikidata.org/wiki/Wikidata:Database_reports/Popular_items_withou...
Not sure how much up to date this is.
Yaroslav
On Fri, Feb 28, 2025 at 8:28 PM Yaron Koren yaron57@gmail.com wrote:
Perhaps I count as a SPARQL expert now, but I do see one easy way to see all the Wikidata items with no statements (and are not lexemes):
https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fwiki%0AWHERE%20%7B%0A%20%2...
Also, here is a query to get the true "duds" - no statements, no lexemes, and no Wikipedia/Wikimedia articles - it looks like there are about 8,000 of these, so thankfully not really an "elephant":
https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%...
Finally - on a minor no, it looks there are still about 2,000 human settlements in Wikidata without a country:
https://wikidatawalkabout.org/?c=Q486972&lang=en&f.P17=novalue
This is not meant to sound like a criticism - Romaine, you have obviously made an enormous improvement there! And perhaps the remaining ones are difficult to categorize.
-Yaron
On Fri, Feb 28, 2025 at 9:57 AM Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:
Hi y'all,
Good ideas.
Queries for such a big number of items are often timing out. Here is a working QLever query for items with more than 10 sitelinks : https://qlever.cs.uni-freiburg.de/wikidata/VdiLsm. There is only one result, you can decrease the value for more results. Reminder, QLever results are not updated in real time, it's based on dumps (who are late because right now, results are from 29.01.2025).
Cheers, Nicolas
Le ven. 28 févr. 2025 à 18:05, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> a écrit :
I tried some queries, and they all timed out :(
I'm not very good at SPARQL.
But I agree with Andy: Dividing hundreds of thousands of items into small groups that can be processed by people who are likely to know something relevant about those items, is probably a better way to try to handle it than just looking at a huge pile of items.
Some ways to divide them that I can think of immediately:
- Having a sitelink to particular languages.
- Having a label or a description in a particular languages.
- Having certain characteristics in the label, like length, or presence
of certain characters (even a mostly arbitrary characteristic, like "label starts with the letters 'Mi' " or "has digits in label", is better than nothing).
If someone can make a bunch of queries that do something like this and actually work (and don't time out), this can be a nice beginning.
בתאריך יום ו׳, 28 בפבר׳ 2025, 11:42, מאת Andy Mabbett < andy@pigsonthewing.org.uk>:
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other
statement.
https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID
(P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
-- WikiWorks · MediaWiki Consulting · http://wikiworks.com _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Further to the below, this query https://w.wiki/DFDJ using a random sample finds that
* about 9.7% of items without statements have no wikipedia links.
That's about 80,000 items, which is more than Yaron Koren found -- probably because I'm only including sitelinks to actual wikipedias, not wikisource wikivoyage or wikicommons.
* about 86% have one wikipedia link (713,000 items)
* 3.5% have two wikipedia links (29,000 items), 0.3% (about 2500 items) have three wikipedia links
So to fix these items will require analysing what information can be extracted from the wiki articles.
-- James.
Romaine wrote:
We have to move to the elephant in the room: items without statements.
There are currently 829k of them. https://w.wiki/DEYs
Andy Mabbett wrote:
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages. Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump? Or having an article or talk page template added by a bot, to each affected article?
On 28/02/2025 19:28, Yaron Koren wrote:
...
Also, here is a query to get the true "duds" - no statements, no lexemes, and no Wikipedia/Wikimedia articles - it looks like there are about 8,000 of these, so thankfully not really an "elephant":
https://query.wikidata.org/ #SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%20%3Fitem%20wikibase%3Astatements%200%20.%0A%20%20MINUS%20%7B%20%3Fitem%20dct%3Alanguage%20%5B%5D%20%7D%20.%20%23%20exclude%20lexemes%0A%20%20MINUS%20%7B%20%5B%5D%20schema%3Aabout%20%3Fitem%3B%20schema%3AisPartOf%20%5B%5D%20%7D%20%23%20exclude%20items%20with%20no%20Wikipedia%2C%20etc.%20article%0A%7D https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%20%3Fitem%20wikibase%3Astatements%200%20.%0A%20%20MINUS%20%7B%20%3Fitem%20dct%3Alanguage%20%5B%5D%20%7D%20.%20%23%20exclude%20lexemes%0A%20%20MINUS%20%7B%20%5B%5D%20schema%3Aabout%20%3Fitem%3B%20schema%3AisPartOf%20%5B%5D%20%7D%20%23%20exclude%20items%20with%20no%20Wikipedia%2C%20etc.%20article%0A%7D
Breakdown by wiki of the number of items with sitelinks but no statements: https://w.wiki/DFDh
Led by * English wikipedia (68,000 articles), * Kazakh wiki (42,000 articles), * Polish wiki (30,000 articles), * Nepalese Newari wiki (25,000 articles), * Chinese wiki (21,500 articles) * Spanish wiki (21,500 articles)
Note that some of these "articles" are in fact redirects.
-- James.
On 28/02/2025 22:18, James Heald wrote:
Further to the below, this query https://w.wiki/DFDJ using a random sample finds that
- about 9.7% of items without statements have no wikipedia links.
That's about 80,000 items, which is more than Yaron Koren found -- probably because I'm only including sitelinks to actual wikipedias, not wikisource wikivoyage or wikicommons.
about 86% have one wikipedia link (713,000 items)
3.5% have two wikipedia links (29,000 items), 0.3% (about 2500 items)
have three wikipedia links
So to fix these items will require analysing what information can be extracted from the wiki articles.
-- James.
I think there 2 routes that we can follow, for a part an automated one and for a part one with involving communities to help out. With many thousands of items, I hope we can first try to do the bulk in an automated way, because over an million is too much labour intensive. Also such high numbers (20 000+) can work demotivating for communities to start on them. Then a second round with community input? I suspect that most items will get a P31, and those who need a subclass of are more limited and more complex, so there community input is welcome.
Dividing it in parts that somehow form a group together is certainly a good approach. This I also did for adding countries: I often then work per identifier or per sitelink to one Wikipedia, working it down before I move to another group. With countries, many Wikipedias have infoboxes with in it one row "Country | Foo Bar". I hope infoboxes can be used for P31 too. Does anyone know how to extract data from infoboxes and adding it on Wikidata?
Romaine
PS: Yes, the elephant in the china shop also exist, but that is another expression and with a different meaning. For the current stage on working to get this issue solved, I prefer the idea of Schrödinger's data https://en.wikipedia.org/wiki/Schr%C3%B6dinger%27s_cat, until we add the data to an item, we do not know if we have a source we can extract the data from exist or not.
Op vr 28 feb 2025 om 23:35 schreef James Heald jpm.heald@gmail.com:
Breakdown by wiki of the number of items with sitelinks but no statements: https://w.wiki/DFDh
Led by
- English wikipedia (68,000 articles),
- Kazakh wiki (42,000 articles),
- Polish wiki (30,000 articles),
- Nepalese Newari wiki (25,000 articles),
- Chinese wiki (21,500 articles)
- Spanish wiki (21,500 articles)
Note that some of these "articles" are in fact redirects.
-- James.
On 28/02/2025 22:18, James Heald wrote:
Further to the below, this query https://w.wiki/DFDJ using a random sample finds that
- about 9.7% of items without statements have no wikipedia links.
That's about 80,000 items, which is more than Yaron Koren found -- probably because I'm only including sitelinks to actual wikipedias, not wikisource wikivoyage or wikicommons.
about 86% have one wikipedia link (713,000 items)
3.5% have two wikipedia links (29,000 items), 0.3% (about 2500 items)
have three wikipedia links
So to fix these items will require analysing what information can be extracted from the wiki articles.
-- James.
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
On Sat, 1 Mar 2025 at 02:49, Romaine Wiki romaine.wiki@gmail.com wrote:
Does anyone know how to extract data from infoboxes and adding it on Wikidata?
HarvestTempates:
https://pltools.toolforge.org/harvesttemplates/
Great!
I hope we can work on this issue further together.
In the past week I have brought the 814 items without P31/279 with sitelink to the Limburgish Wikipedia to 0. Mostly done by hand as they do not have much infoboxes. I hope with other Wikipedia's it will be easier by extracting data from infoboxes.
Would it be helpful to create a project page for this subject?
Romaine
Op zo 2 mrt 2025 om 13:48 schreef Andy Mabbett andy@pigsonthewing.org.uk:
On Sat, 1 Mar 2025 at 02:49, Romaine Wiki romaine.wiki@gmail.com wrote:
Does anyone know how to extract data from infoboxes and adding it on
Wikidata?
HarvestTempates:
https://pltools.toolforge.org/harvesttemplates/
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
With your 1st query, items with no statements, I get 827715 results. That combined with items with only one identifier statement (493k items), we are over a million items with too limited statements. And then there are still many more without both P31 and P279. So I still see an elephant: still way too many items with this problem. (But I am not so much interested in when we have an elephant in the room or not, the point is that many items are empty or almost empty and the question on the table is how we can reduce this issue.)
That you exclude items with Wikipedia/Wikimedia sitelinks but I see no reason for that, as still they miss the basic statements to be able to run a simple query and for most humans it is still impossible to tell what the item is about.
---- Not so much related to the current discussion, The items with P31 human settlements without a country: I only included P31 = human settlement, and not a subclass of human settlements. This I did as I already got with the simple query (I shared in my other e-mail from a week ago) too many server timeouts. I already started with that project already earlier (than the e-mail) when we had 10 000+ items without country, and this has been brought back to only 93 (of which 84 relate to Armenia). And for the moment I left it there as after getting it down from 10 000+ to 93 (together with the help of others), I got a bit tired of the subject. The remaining ones you list are those 93 left and items that have via a subclass of P31 as human settlement. Sure those need attention too, but that was not the point of what I wrote.
Romaine
Op vr 28 feb 2025 om 20:28 schreef Yaron Koren yaron57@gmail.com:
Perhaps I count as a SPARQL expert now, but I do see one easy way to see all the Wikidata items with no statements (and are not lexemes):
https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fwiki%0AWHERE%20%7B%0A%20%2...
Also, here is a query to get the true "duds" - no statements, no lexemes, and no Wikipedia/Wikimedia articles - it looks like there are about 8,000 of these, so thankfully not really an "elephant":
https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%...
Finally - on a minor no, it looks there are still about 2,000 human settlements in Wikidata without a country:
https://wikidatawalkabout.org/?c=Q486972&lang=en&f.P17=novalue
This is not meant to sound like a criticism - Romaine, you have obviously made an enormous improvement there! And perhaps the remaining ones are difficult to categorize.
-Yaron
On Fri, Feb 28, 2025 at 9:57 AM Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:
Hi y'all,
Good ideas.
Queries for such a big number of items are often timing out. Here is a working QLever query for items with more than 10 sitelinks : https://qlever.cs.uni-freiburg.de/wikidata/VdiLsm. There is only one result, you can decrease the value for more results. Reminder, QLever results are not updated in real time, it's based on dumps (who are late because right now, results are from 29.01.2025).
Cheers, Nicolas
Le ven. 28 févr. 2025 à 18:05, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> a écrit :
I tried some queries, and they all timed out :(
I'm not very good at SPARQL.
But I agree with Andy: Dividing hundreds of thousands of items into small groups that can be processed by people who are likely to know something relevant about those items, is probably a better way to try to handle it than just looking at a huge pile of items.
Some ways to divide them that I can think of immediately:
- Having a sitelink to particular languages.
- Having a label or a description in a particular languages.
- Having certain characteristics in the label, like length, or presence
of certain characters (even a mostly arbitrary characteristic, like "label starts with the letters 'Mi' " or "has digits in label", is better than nothing).
If someone can make a bunch of queries that do something like this and actually work (and don't time out), this can be a nice beginning.
בתאריך יום ו׳, 28 בפבר׳ 2025, 11:42, מאת Andy Mabbett < andy@pigsonthewing.org.uk>:
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other
statement.
https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID
(P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
-- WikiWorks · MediaWiki Consulting · http://wikiworks.com _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
I think the point of filtering the items with existing sitelinks is not to exclude them, but to use those sitelinks so we can have more information about the item and add statements accordingly; we can start with those as a way to reduce the backlog.
Regards, FlyingAce
El vie, 28 de feb de 2025, 8:11 p. m., Romaine Wiki romaine.wiki@gmail.com escribió:
With your 1st query, items with no statements, I get 827715 results. That combined with items with only one identifier statement (493k items), we are over a million items with too limited statements. And then there are still many more without both P31 and P279. So I still see an elephant: still way too many items with this problem. (But I am not so much interested in when we have an elephant in the room or not, the point is that many items are empty or almost empty and the question on the table is how we can reduce this issue.)
That you exclude items with Wikipedia/Wikimedia sitelinks but I see no reason for that, as still they miss the basic statements to be able to run a simple query and for most humans it is still impossible to tell what the item is about.
Not so much related to the current discussion, The items with P31 human settlements without a country: I only included P31 = human settlement, and not a subclass of human settlements. This I did as I already got with the simple query (I shared in my other e-mail from a week ago) too many server timeouts. I already started with that project already earlier (than the e-mail) when we had 10 000+ items without country, and this has been brought back to only 93 (of which 84 relate to Armenia). And for the moment I left it there as after getting it down from 10 000+ to 93 (together with the help of others), I got a bit tired of the subject. The remaining ones you list are those 93 left and items that have via a subclass of P31 as human settlement. Sure those need attention too, but that was not the point of what I wrote.
Romaine
Op vr 28 feb 2025 om 20:28 schreef Yaron Koren yaron57@gmail.com:
Perhaps I count as a SPARQL expert now, but I do see one easy way to see all the Wikidata items with no statements (and are not lexemes):
https://query.wikidata.org/#SELECT%20%3Fitem%20%3Fwiki%0AWHERE%20%7B%0A%20%2...
Also, here is a query to get the true "duds" - no statements, no lexemes, and no Wikipedia/Wikimedia articles - it looks like there are about 8,000 of these, so thankfully not really an "elephant":
https://query.wikidata.org/#SELECT%20DISTINCT%20%3Fitem%20WHERE%20%7B%0A%20%...
Finally - on a minor no, it looks there are still about 2,000 human settlements in Wikidata without a country:
https://wikidatawalkabout.org/?c=Q486972&lang=en&f.P17=novalue
This is not meant to sound like a criticism - Romaine, you have obviously made an enormous improvement there! And perhaps the remaining ones are difficult to categorize.
-Yaron
On Fri, Feb 28, 2025 at 9:57 AM Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:
Hi y'all,
Good ideas.
Queries for such a big number of items are often timing out. Here is a working QLever query for items with more than 10 sitelinks : https://qlever.cs.uni-freiburg.de/wikidata/VdiLsm. There is only one result, you can decrease the value for more results. Reminder, QLever results are not updated in real time, it's based on dumps (who are late because right now, results are from 29.01.2025).
Cheers, Nicolas
Le ven. 28 févr. 2025 à 18:05, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> a écrit :
I tried some queries, and they all timed out :(
I'm not very good at SPARQL.
But I agree with Andy: Dividing hundreds of thousands of items into small groups that can be processed by people who are likely to know something relevant about those items, is probably a better way to try to handle it than just looking at a huge pile of items.
Some ways to divide them that I can think of immediately:
- Having a sitelink to particular languages.
- Having a label or a description in a particular languages.
- Having certain characteristics in the label, like length, or
presence of certain characters (even a mostly arbitrary characteristic, like "label starts with the letters 'Mi' " or "has digits in label", is better than nothing).
If someone can make a bunch of queries that do something like this and actually work (and don't time out), this can be a nice beginning.
בתאריך יום ו׳, 28 בפבר׳ 2025, 11:42, מאת Andy Mabbett < andy@pigsonthewing.org.uk>:
On Fri, 28 Feb 2025 at 16:06, Romaine Wiki romaine.wiki@gmail.com wrote:
There are another 493k items with only one identifier and no other
statement.
https://qlever.cs.uni-freiburg.de/wikidata/Z8OkZi?exec=true Often that single identifier is just the Google Knowledge Graph ID
(P2671).
The first half-dozen or so I checked all also have a Wikipedia link in one or more languages.
Maybe it would be worth making a query for each of the top, say twenty languages and posting on the relevant Village Pump?
Or having an article or talk page template added by a bot, to each affected article?
-- Andy Mabbett https://pigsonthewing.org.uk _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
-- WikiWorks · MediaWiki Consulting · http://wikiworks.com _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Hi Romaine,
On Fri, Feb 28, 2025 at 6:11 PM Romaine Wiki romaine.wiki@gmail.com wrote:
With your 1st query, items with no statements, I get 827715 results. That combined with items with only one identifier statement (493k items), we are over a million items with too limited statements. And then there are still many more without both P31 and P279. So I still see an elephant: still way too many items with this problem. (But I am not so much interested in when we have an elephant in the room or not, the point is that many items are empty or almost empty and the question on the table is how we can reduce this issue.)
That you exclude items with Wikipedia/Wikimedia sitelinks but I see no reason for that, as still they miss the basic statements to be able to run a simple query and for most humans it is still impossible to tell what the item is about.
Okay, I (and perhaps others) had misunderstood what you meant by "the elephant in the room". I thought you meant that these were truly useless items, like someone typing in "hello world", hitting "Save" and then moving on.
(Looking at these items, it looks like a lot of them, while not quite that bad, should be turned into redirects. I randomly clicked on this one, for example: https://www.wikidata.org/wiki/Q12672715 , and it looks like it should be merged in to https://www.wikidata.org/wiki/Q466973 .)
With that said, for items that do already correspond to a Wikipedia/Wikimedia page and are thus presumably valid, I don't think having zero properties is in itself a major issue - that is, it's not much worse than having only one property. That's a matter of opinion, of course. You could argue that lacking both "instance of" and "subclass of" makes an item unknowable, but in that case, the set of items that lack both of these properties is the real issue, regardless of what other data they hold.
Not so much related to the current discussion, The items with P31 human settlements without a country: I only included P31 = human settlement, and not a subclass of human settlements. This I did as I already got with the simple query (I shared in my other e-mail from a week ago) too many server timeouts. I already started with that project already earlier (than the e-mail) when we had 10 000+ items without country, and this has been brought back to only 93 (of which 84 relate to Armenia). And for the moment I left it there as after getting it down from 10 000+ to 93 (together with the help of others), I got a bit tired of the subject. The remaining ones you list are those 93 left and items that have via a subclass of P31 as human settlement. Sure those need attention too, but that was not the point of what I wrote.
Sorry, I had somehow forgotten that Wikidata Walkabout queries the immediate subclasses as well! Given that you were only going for "human settlement", you did an amazing job.
-Yaron
I successfully googled "elephant in the room". We may say, that was good for me, I had to learn sg. new. During preparations of Soyuz-Apollo spaceflight clever people decided that Soviets would speak English all the time and Americans Russian in order to reduce vocabulary and avoid less known idioms.. We in Hungary have a similar idiom (elephant in the china shop) which has a completely different meaning, and this was disturbing for the first sight. Yes, my English is not as perfect as it could and should be. I may be not the only one, as Lennon said. So please, friends... :) Thank you!