Yesterday it was 10 years ago when Wikidata was founded and two weeks ago Wikidata reached the amount of 100 million items. This is a good moment to see what we have (and don't have), to look a bit back, and also some hope for the future.
The idea to describe this already started in September and since then I have done various analysis to get a picture. This, however, will not be a complete overview as there are too many factors involved, just a general picture of what I came across.
(Spoiler: This e-mail gets more structure further below. :-p)
== Structured? ==
Wikidata, it is said it contains structured data. I think we need to be more precise with it: it is how the data is stored that is structured. And this structured data is *only* present on an individual item. If we zoom out a little bit, and view multiple items of a serie, among items the data is often missing, fragmented, differently organised, and sometimes even problematic. On a multi-item-level (serie-level) it highly depends if a user has done all the work to synchronise the various items all together or not.
*Example:* I came across a serie of items about a certain sports tournament with an edition organised each year for 50 years on a row. For P31 (instance of), on 5 items it was called an event, on 25 items it was called a sporting event, on on 13 items a tournament, on some others a competition, and a few without P31. To be clear, each edition had the same setup, was for the same sport, everything the same. The articles on Wikipedia are better structured!
This is just a simple serie of items. Zooming out another level, the differences between series are huge, which makes the quality low.
How is a new item added? In the past ten years many items have been added with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here other additions.) In future still many items will be created when an article on Wikipedia has been created. In the worst case, the user adds the sitelink and the items stays empty (practically useless!). A little bit better, the user adds P31/P279 (instance of/subclass of) (not useful, but it helps). A bit more better, also other statements are added (an item becomes useful). Better when a user checks one/two other items in a series. Much better when a user checks all items of the row of subjects. And fantastic when a user checks all items in a series and in other series.
Realistic for most new items? No, this is way too much effort. At the same time, to get quality data, it is needed.
*Example:* About a month ago there were 13 000 items with a sitelink to the Dutch Wikipedia without the basic statements P31/P279. This is just one language version, we have hundreds of wikis!
After some time after a new article has been written, users use a bot/tool to mass import new articles from Wikipedia to Wikidata with zero/little statements. We should be happy that they do this work, but these items are largely empty and do not contain useful/needed data. Also many duplicates are created this way. We need to go to the source and find a solution there, re-thinking the workflow, otherwise we keep mopping with the tap open.
*Needed for the future:* a "new article to Wikidata wizard". I imagine that when a user is ready with writing an article, he clicks on Publish page. As soon as the page is saved the user gets a pop-up dialogue. The user is first asked (in the dialogue) to search in Wikidata to see if already an item exists about this subject. With a completely new subject or empty item, the second step is that the dialogue suggests (based on the published article) a few statements the user can click and confirm. Most new articles are about subjects that are part of some sort of series or about a subject with a default set of properties we expect to be always present (like a building: country, located in the administrative territorial entity and coordinates).
I think we can be more precise about what Wikidata contains: it contains chaotic data in a structured way, which is often not structurally added nor maintained.
To get more quality, we not only must have the data structured on items and among items, but also the way how we think about working with the data needs more structure. We currently work with individual items, and without an integral perspective on the data: we have no overview.
== Wikidata gives no overview ==
I sometimes heard users say that Wikidata can provide an overview. That is however not true. Wikidata does not give an overview! Wikidata can't give an overview itself, but a tool can create an overview with the use of data from Wikidata.
To get more quality on Wikidata, more overview is needed. Overview over what is missing but should have been added on every item of a series. Overview over what unexpected use of properties can be found in a series of items. Tools that currently exist are especially good in detecting what data has been added, but not what data is missing or is weird for this type of item.
*Needed for the future:* a tool "get me more like this item", but I prefer to call it a "smart tool". When looking at an item, I often find myself wondering about what other items of this series has as statements.If a series contains 50(+) items, I have to open every single item to see if anything weird is going on or anything is missing in these items. I wish I could press a button "get me more like this". The tool then shows a full series of items with the same label (like only the year changes) (but also takes into account the labels in multiple languages at the same time) and or with the same description and or with the same/similar properties. The tool gives suggestions what to include, but it is also possible to indicate that the tool should ignore certain things. In this way I can easily find a certain sports tournament with 50 editions the past 50 years. And then includes also those editions of that tournament that have no article (on WP) in my native language (and thus no label in my language), but have an edition in for example in the Italian WP. The tools shows all the properties added, without having to indicate myself which properties should be shown, and can show the labels and descriptions in multiple languages.
== Labels, descriptions and aliases ==
If I have to describe one of the main things I do on Wikidata it would be fixing language. The number one thing to fix are capitals -> lower case. I click edit, change the capital of the label, change the capital of the description (if it is only one), and often changing the capitals from all the aliases, click save. This sounds not much work, but with visiting 100 items, it becomes a lot of work. And this was just one language, often I fix it for English and French too. Can't this be made easier?
*Needed for the future:* a tool with what I can fix capitals in one click. In 99,9% of the cases they are capitals that need to be fixed to lower case. Especially ligatures take a lot of work. If someone works on this, take into account the ligatures https://en.wikipedia.org/wiki/Ligature_(writing) and for Dutch also IJ -> ij.
*Needed for the future:* a Wikidata game that can easily find items where capitals are used while it should be lower case.
With many subjects the labels and descriptions are all right or all wrong if it comes to capitals. One group of subjects is more challenging, but in number as in the combination lower case/capitals: taxons. Many labels got imported from Wikipedia. In Dutch for example, the local names should be lower case and the scientific names with a capital. This is currently a big mess on thousands of items.
Another thing I have to fix frequently are dots in descriptions. Apparently some users like to use a dot in there, while they shouldn't. Finding the places where this took place is very hard...
*Needed for the future:* being able to run a query on the labels, descriptions and aliases. Many errors and issues can be find in their and need to get solved, but finding them is not easy. I recently came across a series of items with a spelling error.
Did you know that there are more than 20 places in the world that are called Amsterdam? How useful is then a description "building in Amsterdam"? Yes, a large number of users find it too much work to add the country of where a certain item is located.
*Needed for the future:* a tool/query with what I can quickly get an overview of all the descriptions that doesn't contain a country.
*Needed for the future:* a Wikidata game that gives me descriptions without country while they should have one.
We have arrived at useful labels and descriptions. A lot of work needs to be done in that field. Many subjects do not have a unique label as there are other subjects with the same name. To select the right item, a description is needed to clarify the context of that item.
*Needed for the future:* a Wikidata game that can generate descriptions. For many items the description can simply be <subject =P31> in <location/administrative territorial entity =P131/P276>, <country> At the same time this can be added in your local language as in English, so everyone knows what the topic is about. (Bonus: there are still items that do not contain a country, maybe something to be fixed right away?)
A Wikidata game can help to find items with missing labels/languages, but it should be possible to simple query these.
On the other hand, I also have came across items with many wrong descriptions, especially "Wikimedia category" and "Wikimedia disambiguation page". Sometimes this can't be simply reverted causing a lot of manual labour. On a recent occasion it took 50 minutes https://www.wikidata.org/w/index.php?title=Q89509298&action=history to get the page saved!
*Needed for the future:* a tool instantly removes in one item all the labels of disambiguation pages, Wikimedia category or Wikimedia list article.
Having at least a label in English is very welcome, otherwise there is no clue what Q1234567 is about. There are bots who add missing labels, including copying the page names from Commons. The sitelinks on Commons are often Commons categories that are connected to items about that individual subject. The bots adding the missing labels sadly also copy the prefix Category:when entering the labels, which is often wrong. Simple solution: Only add the Category: prefix if P31 has Wikimedia category as statement.
I personally think that the biggest weakness of Wikidata are the missing labels, and then in particular the missing labels in English. If an item has no label at all, it is basically useless. If an item only has a label on a local language (and not English), it only can be used in that local language which is a minority of the world. At the same time, while in many countries most people speak also English next to their local language, in many other countries this is not the case and people don't understand English. This is a matter of accessibility and therefore it has priority.
*Needed for the future:* a program to get for (almost) all items a label available in English + translations of this label in many local languages.
*Needed for the future:* the minimal requirement for batch uploads that they contain at least a label in English.
*Needed for the future:* a tool that helps with translations. There are many items with the same name. Currently we have to add a translation to each single item. A tool would be handy to find all items with the same name in English (like: Saint Servatius Church), and then being able to add a translation only once which the tool add to all the items.
*Needed for the future:* a tool that can do transliteration. Transliteration is a huge barrier for the usability of the data, as many labels are only added in one script, while the user uses natively another script. This especially involves names.
Especially smaller language communities have a hard time on Wikidata. A small language community means that only a very limited part of the (essential) items on Wikidata gets translated into their local language. At the same time, if you work on adding statements to Wikidata (while being a non-English speaker), you highly depend on translations being available in your own language. If something has no label in your local language -> it will not be found when searching with the local word -> you can't add a statement or you likely add a wrong statement.
Without any statements an item is practically useless, so various users are searching for items without any statements to add them. While doing that, I recently came across a few items with only a sitelink to a Wikipedia language version. This language is not available in Google Translate or any other translation tool I could find, resulting in that nothing could be done with these items.
== Statements ==
Every item on Wikidata should have have at least a statement instance of or subclass of (P31/P279) (or both), because these two properties define what the item is about. Without these properties, we are practically blind. It is great to see some of you are working on getting all the items to have these properties on them. (I recently completed that for all items with a sitelink to nlwiki: 23 373x -> 0x.) More help is needed for the many other items!
*Needed for the future:* a Wikidata game that brings up items without P31/P279 and gives suggestion(s) to add.
While doing the project of adding P31/P279, I noticed that various users still do not understand the difference between these two properties. This means that on various items users have added the P31/P279 wrong. We need to think on how we can find the items where that is the case and fix those. There is also a grey zone: a series of items for what it is not precisely clear whether it should be. Perhaps a project who can take care of those cases?
In addition to P31/P279, each theme of items has a fixed set of basic statements that always should have been added. For example with taxon as P31, also needed are scientific name (P225), taxon rank (P105), and parent taxon (P171), For example with building as P31, also needed are country (P17), administrative territorial entity (P131), and coordinates (P625). This seems pretty obvious, but a recent large data import still forgot some basic statements, which still hasn't been fixed. The goal of adding data is that the data can be used. By having some basic statements missing, the quality becomes too low. I think such data imports should not be allowed.
*Needed for the future:* a program/project to get for (almost) all items the basic statements present.
We then probably should also have attention for the quality. For example the administrative territorial entity (P131) should not be too generic. A village or a building should get as P131 the smallest administrative territorial entity as possible (in many countries the municipality).
With the various fixed sets of basic statements I estimate that it is possible to cover at least 90% of all the items (taxa, geographic features, people, structures, astronomical objects, publications, etc.). The remaining ones are harder, often more specialistic. Those have often for P31 "term" or "concept". To make those items more useable these items need to get more statements that provide a better context. Then properties like "aspect of" (P1269) and "characterized by" (P1552) are needed.
== Identifiers ==
About identifiers can't be said much: they do what do have to do. Even within Wikidata they help a lot, as a symbol is shown when the same identifier has been used on an other item en thus solving duplicates. The focus where most identifiers seem to be related to are sports, popular culture (music, movies, etc) and monument identifiers. In many other fields no properties for identifiers have been created yet.
For a generic user, most work regarding identifiers is in finding out how to find the specific identifier on which website so it can be added to the item. I think there we should have more attention for so we can make it easier for users to add them. Another thing is that for the theme I am working on, it is not easy to see where identifiers are missing but do likely exist.
*Needed for the future:* a tool that lists all potential items (in general or of a set of items/query) where an identifier likely is missing.
If identifiers are added to items, an icon next to it often indicates which statements are missing on that same item. For example, if I add a monument identifier, it also indicates that for example a country (P17) have been added. A great help to get items more complete. At the same time identifiers are often forgotten. The other way round would be welcome too: when an item gets for example building als P31, it should also suggest to add a country (P17), an administrative territorial entity (P131), coordinates (P625), and perhaps more.
== Other ==
Besides the ones already mentioned there are some tools/software/issues that would make the work easier or need to be solved.
*Needed for the future:* a tool that looks up all coordinates nearby certain coordinates. Like the Special:Nearby, but then any given location.
*Needed for the future:* better suggestions when adding statements. For example, when I added bridges (those things to cross a river), I get suggestions for properties related to astronomy. When an item has a Wikipedia article as sitelink, it would be great if a statement suggester would use the Wikipedia article to give suggestions. For example, why do I have to indicate manually the country (again) if this already has been indicated twice in the Wikipedia article?
Ten years ago Wikidata started. Those years past by quickly. We all together have put so much work in it with a great result as outcome. But we are not ready yet. For the next ten years I expect our main focus to be improving the quality.
Reading through all this carefully and taking notes along the way it appeared to me that ShEx (and better easier tooling for it) could help in about 50% of your future wants/needs.
Great thoughts and thanks for sharing!
On Tue, Nov 1, 2022 at 6:41 AM Romaine Wiki romaine.wiki@gmail.com wrote:
Yesterday it was 10 years ago when Wikidata was founded and two weeks ago Wikidata reached the amount of 100 million items. This is a good moment to see what we have (and don't have), to look a bit back, and also some hope for the future.
The idea to describe this already started in September and since then I have done various analysis to get a picture. This, however, will not be a complete overview as there are too many factors involved, just a general picture of what I came across.
(Spoiler: This e-mail gets more structure further below. :-p)
== Structured? ==
Wikidata, it is said it contains structured data. I think we need to be more precise with it: it is how the data is stored that is structured. And this structured data is *only* present on an individual item. If we zoom out a little bit, and view multiple items of a serie, among items the data is often missing, fragmented, differently organised, and sometimes even problematic. On a multi-item-level (serie-level) it highly depends if a user has done all the work to synchronise the various items all together or not.
*Example:* I came across a serie of items about a certain sports tournament with an edition organised each year for 50 years on a row. For P31 (instance of), on 5 items it was called an event, on 25 items it was called a sporting event, on on 13 items a tournament, on some others a competition, and a few without P31. To be clear, each edition had the same setup, was for the same sport, everything the same. The articles on Wikipedia are better structured!
This is just a simple serie of items. Zooming out another level, the differences between series are huge, which makes the quality low.
How is a new item added? In the past ten years many items have been added with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here other additions.) In future still many items will be created when an article on Wikipedia has been created. In the worst case, the user adds the sitelink and the items stays empty (practically useless!). A little bit better, the user adds P31/P279 (instance of/subclass of) (not useful, but it helps). A bit more better, also other statements are added (an item becomes useful). Better when a user checks one/two other items in a series. Much better when a user checks all items of the row of subjects. And fantastic when a user checks all items in a series and in other series.
Realistic for most new items? No, this is way too much effort. At the same time, to get quality data, it is needed.
*Example:* About a month ago there were 13 000 items with a sitelink to the Dutch Wikipedia without the basic statements P31/P279. This is just one language version, we have hundreds of wikis!
After some time after a new article has been written, users use a bot/tool to mass import new articles from Wikipedia to Wikidata with zero/little statements. We should be happy that they do this work, but these items are largely empty and do not contain useful/needed data. Also many duplicates are created this way. We need to go to the source and find a solution there, re-thinking the workflow, otherwise we keep mopping with the tap open.
*Needed for the future:* a "new article to Wikidata wizard". I imagine that when a user is ready with writing an article, he clicks on Publish page. As soon as the page is saved the user gets a pop-up dialogue. The user is first asked (in the dialogue) to search in Wikidata to see if already an item exists about this subject. With a completely new subject or empty item, the second step is that the dialogue suggests (based on the published article) a few statements the user can click and confirm. Most new articles are about subjects that are part of some sort of series or about a subject with a default set of properties we expect to be always present (like a building: country, located in the administrative territorial entity and coordinates).
I think we can be more precise about what Wikidata contains: it contains chaotic data in a structured way, which is often not structurally added nor maintained.
To get more quality, we not only must have the data structured on items and among items, but also the way how we think about working with the data needs more structure. We currently work with individual items, and without an integral perspective on the data: we have no overview.
== Wikidata gives no overview ==
I sometimes heard users say that Wikidata can provide an overview. That is however not true. Wikidata does not give an overview! Wikidata can't give an overview itself, but a tool can create an overview with the use of data from Wikidata.
To get more quality on Wikidata, more overview is needed. Overview over what is missing but should have been added on every item of a series. Overview over what unexpected use of properties can be found in a series of items. Tools that currently exist are especially good in detecting what data has been added, but not what data is missing or is weird for this type of item.
*Needed for the future:* a tool "get me more like this item", but I prefer to call it a "smart tool". When looking at an item, I often find myself wondering about what other items of this series has as statements.If a series contains 50(+) items, I have to open every single item to see if anything weird is going on or anything is missing in these items. I wish I could press a button "get me more like this". The tool then shows a full series of items with the same label (like only the year changes) (but also takes into account the labels in multiple languages at the same time) and or with the same description and or with the same/similar properties. The tool gives suggestions what to include, but it is also possible to indicate that the tool should ignore certain things. In this way I can easily find a certain sports tournament with 50 editions the past 50 years. And then includes also those editions of that tournament that have no article (on WP) in my native language (and thus no label in my language), but have an edition in for example in the Italian WP. The tools shows all the properties added, without having to indicate myself which properties should be shown, and can show the labels and descriptions in multiple languages.
== Labels, descriptions and aliases ==
If I have to describe one of the main things I do on Wikidata it would be fixing language. The number one thing to fix are capitals -> lower case. I click edit, change the capital of the label, change the capital of the description (if it is only one), and often changing the capitals from all the aliases, click save. This sounds not much work, but with visiting 100 items, it becomes a lot of work. And this was just one language, often I fix it for English and French too. Can't this be made easier?
*Needed for the future:* a tool with what I can fix capitals in one click. In 99,9% of the cases they are capitals that need to be fixed to lower case. Especially ligatures take a lot of work. If someone works on this, take into account the ligatures https://en.wikipedia.org/wiki/Ligature_(writing) and for Dutch also IJ -> ij.
*Needed for the future:* a Wikidata game that can easily find items where capitals are used while it should be lower case.
With many subjects the labels and descriptions are all right or all wrong if it comes to capitals. One group of subjects is more challenging, but in number as in the combination lower case/capitals: taxons. Many labels got imported from Wikipedia. In Dutch for example, the local names should be lower case and the scientific names with a capital. This is currently a big mess on thousands of items.
Another thing I have to fix frequently are dots in descriptions. Apparently some users like to use a dot in there, while they shouldn't. Finding the places where this took place is very hard...
*Needed for the future:* being able to run a query on the labels, descriptions and aliases. Many errors and issues can be find in their and need to get solved, but finding them is not easy. I recently came across a series of items with a spelling error.
Did you know that there are more than 20 places in the world that are called Amsterdam? How useful is then a description "building in Amsterdam"? Yes, a large number of users find it too much work to add the country of where a certain item is located.
*Needed for the future:* a tool/query with what I can quickly get an overview of all the descriptions that doesn't contain a country.
*Needed for the future:* a Wikidata game that gives me descriptions without country while they should have one.
We have arrived at useful labels and descriptions. A lot of work needs to be done in that field. Many subjects do not have a unique label as there are other subjects with the same name. To select the right item, a description is needed to clarify the context of that item.
*Needed for the future:* a Wikidata game that can generate descriptions. For many items the description can simply be <subject =P31> in <location/administrative territorial entity =P131/P276>, <country> At the same time this can be added in your local language as in English, so everyone knows what the topic is about. (Bonus: there are still items that do not contain a country, maybe something to be fixed right away?)
A Wikidata game can help to find items with missing labels/languages, but it should be possible to simple query these.
On the other hand, I also have came across items with many wrong descriptions, especially "Wikimedia category" and "Wikimedia disambiguation page". Sometimes this can't be simply reverted causing a lot of manual labour. On a recent occasion it took 50 minutes https://www.wikidata.org/w/index.php?title=Q89509298&action=history to get the page saved!
*Needed for the future:* a tool instantly removes in one item all the labels of disambiguation pages, Wikimedia category or Wikimedia list article.
Having at least a label in English is very welcome, otherwise there is no clue what Q1234567 is about. There are bots who add missing labels, including copying the page names from Commons. The sitelinks on Commons are often Commons categories that are connected to items about that individual subject. The bots adding the missing labels sadly also copy the prefix Category:when entering the labels, which is often wrong. Simple solution: Only add the Category: prefix if P31 has Wikimedia category as statement.
I personally think that the biggest weakness of Wikidata are the missing labels, and then in particular the missing labels in English. If an item has no label at all, it is basically useless. If an item only has a label on a local language (and not English), it only can be used in that local language which is a minority of the world. At the same time, while in many countries most people speak also English next to their local language, in many other countries this is not the case and people don't understand English. This is a matter of accessibility and therefore it has priority.
*Needed for the future:* a program to get for (almost) all items a label available in English + translations of this label in many local languages.
*Needed for the future:* the minimal requirement for batch uploads that they contain at least a label in English.
*Needed for the future:* a tool that helps with translations. There are many items with the same name. Currently we have to add a translation to each single item. A tool would be handy to find all items with the same name in English (like: Saint Servatius Church), and then being able to add a translation only once which the tool add to all the items.
*Needed for the future:* a tool that can do transliteration. Transliteration is a huge barrier for the usability of the data, as many labels are only added in one script, while the user uses natively another script. This especially involves names.
Especially smaller language communities have a hard time on Wikidata. A small language community means that only a very limited part of the (essential) items on Wikidata gets translated into their local language. At the same time, if you work on adding statements to Wikidata (while being a non-English speaker), you highly depend on translations being available in your own language. If something has no label in your local language -> it will not be found when searching with the local word -> you can't add a statement or you likely add a wrong statement.
Without any statements an item is practically useless, so various users are searching for items without any statements to add them. While doing that, I recently came across a few items with only a sitelink to a Wikipedia language version. This language is not available in Google Translate or any other translation tool I could find, resulting in that nothing could be done with these items.
== Statements ==
Every item on Wikidata should have have at least a statement instance of or subclass of (P31/P279) (or both), because these two properties define what the item is about. Without these properties, we are practically blind. It is great to see some of you are working on getting all the items to have these properties on them. (I recently completed that for all items with a sitelink to nlwiki: 23 373x -> 0x.) More help is needed for the many other items!
*Needed for the future:* a Wikidata game that brings up items without P31/P279 and gives suggestion(s) to add.
While doing the project of adding P31/P279, I noticed that various users still do not understand the difference between these two properties. This means that on various items users have added the P31/P279 wrong. We need to think on how we can find the items where that is the case and fix those. There is also a grey zone: a series of items for what it is not precisely clear whether it should be. Perhaps a project who can take care of those cases?
In addition to P31/P279, each theme of items has a fixed set of basic statements that always should have been added. For example with taxon as P31, also needed are scientific name (P225), taxon rank (P105), and parent taxon (P171), For example with building as P31, also needed are country (P17), administrative territorial entity (P131), and coordinates (P625). This seems pretty obvious, but a recent large data import still forgot some basic statements, which still hasn't been fixed. The goal of adding data is that the data can be used. By having some basic statements missing, the quality becomes too low. I think such data imports should not be allowed.
*Needed for the future:* a program/project to get for (almost) all items the basic statements present.
We then probably should also have attention for the quality. For example the administrative territorial entity (P131) should not be too generic. A village or a building should get as P131 the smallest administrative territorial entity as possible (in many countries the municipality).
With the various fixed sets of basic statements I estimate that it is possible to cover at least 90% of all the items (taxa, geographic features, people, structures, astronomical objects, publications, etc.). The remaining ones are harder, often more specialistic. Those have often for P31 "term" or "concept". To make those items more useable these items need to get more statements that provide a better context. Then properties like "aspect of" (P1269) and "characterized by" (P1552) are needed.
== Identifiers ==
About identifiers can't be said much: they do what do have to do. Even within Wikidata they help a lot, as a symbol is shown when the same identifier has been used on an other item en thus solving duplicates. The focus where most identifiers seem to be related to are sports, popular culture (music, movies, etc) and monument identifiers. In many other fields no properties for identifiers have been created yet.
For a generic user, most work regarding identifiers is in finding out how to find the specific identifier on which website so it can be added to the item. I think there we should have more attention for so we can make it easier for users to add them. Another thing is that for the theme I am working on, it is not easy to see where identifiers are missing but do likely exist.
*Needed for the future:* a tool that lists all potential items (in general or of a set of items/query) where an identifier likely is missing.
If identifiers are added to items, an icon next to it often indicates which statements are missing on that same item. For example, if I add a monument identifier, it also indicates that for example a country (P17) have been added. A great help to get items more complete. At the same time identifiers are often forgotten. The other way round would be welcome too: when an item gets for example building als P31, it should also suggest to add a country (P17), an administrative territorial entity (P131), coordinates (P625), and perhaps more.
== Other ==
Besides the ones already mentioned there are some tools/software/issues that would make the work easier or need to be solved.
*Needed for the future:* a tool that looks up all coordinates nearby certain coordinates. Like the Special:Nearby, but then any given location.
*Needed for the future:* better suggestions when adding statements. For example, when I added bridges (those things to cross a river), I get suggestions for properties related to astronomy. When an item has a Wikipedia article as sitelink, it would be great if a statement suggester would use the Wikipedia article to give suggestions. For example, why do I have to indicate manually the country (again) if this already has been indicated twice in the Wikipedia article?
Ten years ago Wikidata started. Those years past by quickly. We all together have put so much work in it with a great result as outcome. But we are not ready yet. For the next ten years I expect our main focus to be improving the quality.
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
I agree with all these criticisms of the information in Wikidata. There are quite a few important classes in Wikidata where there are missing, questionable, or incorrect structural data. Look at colors (instances of Q1075), where some colors are both instances and subclasses of color; or ships (instances of Q11446), where some ships are subclasses of ship; or the superclasses of geographic region (Q82794), which include set; or the instances of woman (Q467), of which there are only 28.
I believe that these structural problems in Wikidata are a major, probably the major, reason that Wikidata does not have considerably more uptake than it currently does. Certainly every time I think of using Wikidata I have to think hard about what I need to do to ensure that the structural problems in Wikidata will not pose too much of a problem for my use. (In most cases I come to the reluctant conclusion that they will.)
It's not so much that there are examples of bad structural data, it is that examples are so easy to find. And it's not so much that the problems arise from bad policies, it is that there are no enforced policies. And it's even not so much that these are unknown problems as most of them have been previously reported.
It is for the above reasons that I believe that lack of tool support is not the major driver of the problems, and certainly tools that can only point out problems are not going to be a significant help in solving the problems. Instead I believe that what is driving the structural problems with Wikidata is that there is insufficient effort paid by the Wikidata community to identify and implement fixes for the structural problems. Tool support is important, I agree, but without people in the Wikidata community putting a higher priority on fixing data in Wikidata than even adding more data to Wikidata structural problems will continue.
I also feel that it does very little good to ask people who are adding new data to Wikidata to only create data with good structure when there are so may existing problems. Instead the existing problems first need to be fixed up. This will both show that the Wikidata community cares about good structure and show people who are adding new data how new data should be added instead of the current situation which in too many cases provides examples of how not to structure data. Consider a tool that retrieves items that are similar to an item being added. If this comparison item has bad structuring nearby it is very likely that the new item will be either given similar or linked to the existing bad structuring.
As far as labels, descriptions, and aliases go I agree that the current situation is poor. But what I believe is missing most is enough description that the intent of an item, particularly a class, can be correctly determined. I often end up with only a poor idea of what items should be an instance of a class, particularly when considering several classes at once. The various geographic classes are a prime example here for me. In my view many of the natural language information associated with Wikidata items should be tagged with the English Wikipedia multiple issues template.
Queries that show the above problems:
SELECT ?item ?itemLabel WHERE { Â ?item wdt:P31 wd:Q1075. Â ?item wdt:P279* wd:Q1075. Â SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
SELECT ?item ?itemLabel WHERE { Â ?item wdt:P31 wd:Q11446. Â ?item wdt:P279* wd:Q11446. Â SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } Â }
SELECT ?item ?itemLabel WHERE { Â wd:Q82794 wdt:P279* ?item . Â SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
SELECT ?item ?itemLabel WHERE { Â ?item wdt:P31/wdt:P279* wd:Q467. Â SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
Peter F. Patel-Schneider
Dear all,
Thanks, Romaine, for this detailed and careful analysis of the situation. I think much of this is spot-on. I think one of the main insights here is that we need more uniformity. Wikidata in many places is still used like some exotic "structured" format for entering plain texts, which make sense to human readers but prevent or confuse automated usage. The key is to "see" collections of items rather than single pages.
It seems Wikidata would need more stakeholder communities for specific areas (say sports events) to oversee and guide the modeling of the items in this kind. We need more WikiProjects.
Regarding the question whether solutions need to be technical or social, I'd say both must go together. I also have often been disheartened by the sheer effort that it would require to add even the most obvious statements to a larger set of items. Geography is a good example: there are so many nearby places that share the same geo-administrative history (take a look at the country, P17, of Dresden, Q1731), yet it is practically impossible to add this to any significant amount of the thousands of Germany cities ... Here, like in many of the cases Romaine has described, the technical limitations may smother necessary community activity. (The specific case might also be an example of something where an approach of "data sharing" is needed, i.e. a modeling paradigm that simply allows us to say "this place has the same history of P17 statements as this other place"; but that's not the main topic of this post).
New tools may also enable and encourage communities to grow that have not formed in the past decade. One aspect here might be that it is difficult for communities to appreciate the result of their efforts. For example, it is very difficult to create a uniform appearance for a group of pages, already since the order of statements (in a group of the same property) is so hard to change, and also since the pages are already very long. Even if one can achieve complete semantic uniformity, one will not currently have much opportunity to "see" this success. There are unsolved challenges here that cannot be compared with the relatively simple and small data that one can find in a typical Wikipedia Infobox. External developers and maybe even researchers could contribute here, but they would also benefit form the input and concrete ideas from WikiProjects (Romain's email already had quite a number of directly implementable ideas in it ... this kind of constructive input is already half of the solution).
Cheers,
Markus
On 31/10/2022 23:40, Romaine Wiki wrote:
Yesterday it was 10 years ago when Wikidata was founded and two weeks ago Wikidata reached the amount of 100 million items. This is a good moment to see what we have (and don't have), to look a bit back, and also some hope for the future.
The idea to describe this already started in September and since then I have done various analysis to get a picture. This, however, will not be a complete overview as there are too many factors involved, just a general picture of what I came across.
(Spoiler: This e-mail gets more structure further below. :-p)
== Structured? ==
Wikidata, it is said it contains structured data. I think we need to be more precise with it: it is how the data is stored that is structured. And this structured data is _only_ present on an individual item. If we zoom out a little bit, and view multiple items of a serie, among items the data is often missing, fragmented, differently organised, and sometimes even problematic. On a multi-item-level (serie-level) it highly depends if a user has done all the work to synchronise the various items all together or not.
*Example:* I came across a serie of items about a certain sports tournament with an edition organised each year for 50 years on a row. For P31 (instance of), on 5 items it was called an event, on 25 items it was called a sporting event, on on 13 items a tournament, on some others a competition, and a few without P31. To be clear, each edition had the same setup, was for the same sport, everything the same. The articles on Wikipedia are better structured!
This is just a simple serie of items. Zooming out another level, the differences between series are huge, which makes the quality low.
How is a new item added? In the past ten years many items have been added with bots/tools based on the articles on Wikipedia. (Yes, for I ignore here other additions.) In future still many items will be created when an article on Wikipedia has been created. In the worst case, the user adds the sitelink and the items stays empty (practically useless!). A little bit better, the user adds P31/P279 (instance of/subclass of) (not useful, but it helps). A bit more better, also other statements are added (an item becomes useful). Better when a user checks one/two other items in a series. Much better when a user checks all items of the row of subjects. And fantastic when a user checks all items in a series and in other series.
Realistic for most new items? No, this is way too much effort. At the same time, to get quality data, it is needed.
*Example:* About a month ago there were 13 000 items with a sitelink to the Dutch Wikipedia without the basic statements P31/P279. This is just one language version, we have hundreds of wikis!
After some time after a new article has been written, users use a bot/tool to mass import new articles from Wikipedia to Wikidata with zero/little statements. We should be happy that they do this work, but these items are largely empty and do not contain useful/needed data. Also many duplicates are created this way. We need to go to the source and find a solution there, re-thinking the workflow, otherwise we keep mopping with the tap open.
*Needed for the future:* a "new article to Wikidata wizard". I imagine that when a user is ready with writing an article, he clicks on Publish page. As soon as the page is saved the user gets a pop-up dialogue. The user is first asked (in the dialogue) to search in Wikidata to see if already an item exists about this subject. With a completely new subject or empty item, the second step is that the dialogue suggests (based on the published article) a few statements the user can click and confirm. Most new articles are about subjects that are part of some sort of series or about a subject with a default set of properties we expect to be always present (like a building: country, located in the administrative territorial entity and coordinates).
I think we can be more precise about what Wikidata contains: it contains chaotic data in a structured way, which is often not structurally added nor maintained.
To get more quality, we not only must have the data structured on items and among items, but also the way how we think about working with the data needs more structure. We currently work with individual items, and without an integral perspective on the data: we have no overview.
== Wikidata gives no overview ==
I sometimes heard users say that Wikidata can provide an overview. That is however not true. Wikidata does not give an overview! Wikidata can't give an overview itself, but a tool can create an overview with the use of data from Wikidata.
To get more quality on Wikidata, more overview is needed. Overview over what is missing but should have been added on every item of a series. Overview over what unexpected use of properties can be found in a series of items. Tools that currently exist are especially good in detecting what data has been added, but not what data is missing or is weird for this type of item.
*Needed for the future:* a tool "get me more like this item", but I prefer to call it a "smart tool". When looking at an item, I often find myself wondering about what other items of this series has as statements.If a series contains 50(+) items, I have to open every single item to see if anything weird is going on or anything is missing in these items. I wish I could press a button "get me more like this". The tool then shows a full series of items with the same label (like only the year changes) (but also takes into account the labels in multiple languages at the same time) and or with the same description and or with the same/similar properties. The tool gives suggestions what to include, but it is also possible to indicate that the tool should ignore certain things. In this way I can easily find a certain sports tournament with 50 editions the past 50 years. And then includes also those editions of that tournament that have no article (on WP) in my native language (and thus no label in my language), but have an edition in for example in the Italian WP. The tools shows all the properties added, without having to indicate myself which properties should be shown, and can show the labels and descriptions in multiple languages.
== Labels, descriptions and aliases ==
If I have to describe one of the main things I do on Wikidata it would be fixing language. The number one thing to fix are capitals -> lower case. I click edit, change the capital of the label, change the capital of the description (if it is only one), and often changing the capitals from all the aliases, click save. This sounds not much work, but with visiting 100 items, it becomes a lot of work. And this was just one language, often I fix it for English and French too. Can't this be made easier?
*Needed for the future:* a tool with what I can fix capitals in one click. In 99,9% of the cases they are capitals that need to be fixed to lower case. Especially ligatures take a lot of work. If someone works on this, take into account the ligatures https://en.wikipedia.org/wiki/Ligature_(writing) and for Dutch also IJ -> ij.
*Needed for the future:* a Wikidata game that can easily find items where capitals are used while it should be lower case.
With many subjects the labels and descriptions are all right or all wrong if it comes to capitals. One group of subjects is more challenging, but in number as in the combination lower case/capitals: taxons. Many labels got imported from Wikipedia. In Dutch for example, the local names should be lower case and the scientific names with a capital. This is currently a big mess on thousands of items.
Another thing I have to fix frequently are dots in descriptions. Apparently some users like to use a dot in there, while they shouldn't. Finding the places where this took place is very hard...
*Needed for the future:* being able to run a query on the labels, descriptions and aliases. Many errors and issues can be find in their and need to get solved, but finding them is not easy. I recently came across a series of items with a spelling error.
Did you know that there are more than 20 places in the world that are called Amsterdam? How useful is then a description "building in Amsterdam"? Yes, a large number of users find it too much work to add the country of where a certain item is located.
*Needed for the future:* a tool/query with what I can quickly get an overview of all the descriptions that doesn't contain a country.
*Needed for the future:* a Wikidata game that gives me descriptions without country while they should have one.
We have arrived at useful labels and descriptions. A lot of work needs to be done in that field. Many subjects do not have a unique label as there are other subjects with the same name. To select the right item, a description is needed to clarify the context of that item.
*Needed for the future:* a Wikidata game that can generate descriptions. For many items the description can simply be <subject =P31> in <location/administrative territorial entity =P131/P276>, <country> At the same time this can be added in your local language as in English, so everyone knows what the topic is about. (Bonus: there are still items that do not contain a country, maybe something to be fixed right away?)
A Wikidata game can help to find items with missing labels/languages, but it should be possible to simple query these.
On the other hand, I also have came across items with many wrong descriptions, especially "Wikimedia category" and "Wikimedia disambiguation page". Sometimes this can't be simply reverted causing a lot of manual labour. On a recent occasion it took 50 minutes https://www.wikidata.org/w/index.php?title=Q89509298&action=history to get the page saved!
*Needed for the future:* a tool instantly removes in one item all the labels of disambiguation pages, Wikimedia category or Wikimedia list article.
Having at least a label in English is very welcome, otherwise there is no clue what Q1234567 is about. There are bots who add missing labels, including copying the page names from Commons. The sitelinks on Commons are often Commons categories that are connected to items about that individual subject. The bots adding the missing labels sadly also copy the prefix Category:when entering the labels, which is often wrong. Simple solution: Only add the Category: prefix if P31 has Wikimedia category as statement.
I personally think that the biggest weakness of Wikidata are the missing labels, and then in particular the missing labels in English. If an item has no label at all, it is basically useless. If an item only has a label on a local language (and not English), it only can be used in that local language which is a minority of the world. At the same time, while in many countries most people speak also English next to their local language, in many other countries this is not the case and people don't understand English. This is a matter of accessibility and therefore it has priority.
*Needed for the future:* a program to get for (almost) all items a label available in English + translations of this label in many local languages.
*Needed for the future:* the minimal requirement for batch uploads that they contain at least a label in English.
*Needed for the future:* a tool that helps with translations. There are many items with the same name. Currently we have to add a translation to each single item. A tool would be handy to find all items with the same name in English (like: Saint Servatius Church), and then being able to add a translation only once which the tool add to all the items.
*Needed for the future:* a tool that can do transliteration. Transliteration is a huge barrier for the usability of the data, as many labels are only added in one script, while the user uses natively another script. This especially involves names.
Especially smaller language communities have a hard time on Wikidata. A small language community means that only a very limited part of the (essential) items on Wikidata gets translated into their local language. At the same time, if you work on adding statements to Wikidata (while being a non-English speaker), you highly depend on translations being available in your own language. If something has no label in your local language -> it will not be found when searching with the local word -> you can't add a statement or you likely add a wrong statement.
Without any statements an item is practically useless, so various users are searching for items without any statements to add them. While doing that, I recently came across a few items with only a sitelink to a Wikipedia language version. This language is not available in Google Translate or any other translation tool I could find, resulting in that nothing could be done with these items.
== Statements ==
Every item on Wikidata should have have at least a statement instance of or subclass of (P31/P279) (or both), because these two properties define what the item is about. Without these properties, we are practically blind. It is great to see some of you are working on getting all the items to have these properties on them. (I recently completed that for all items with a sitelink to nlwiki: 23 373x -> 0x.) More help is needed for the many other items!
*Needed for the future:* a Wikidata game that brings up items without P31/P279 and gives suggestion(s) to add.
While doing the project of adding P31/P279, I noticed that various users still do not understand the difference between these two properties. This means that on various items users have added the P31/P279 wrong. We need to think on how we can find the items where that is the case and fix those. There is also a grey zone: a series of items for what it is not precisely clear whether it should be. Perhaps a project who can take care of those cases?
In addition to P31/P279, each theme of items has a fixed set of basic statements that always should have been added. For example with taxon as P31, also needed are scientific name (P225), taxon rank (P105), and parent taxon (P171), For example with building as P31, also needed are country (P17), administrative territorial entity (P131), and coordinates (P625). This seems pretty obvious, but a recent large data import still forgot some basic statements, which still hasn't been fixed. The goal of adding data is that the data can be used. By having some basic statements missing, the quality becomes too low. I think such data imports should not be allowed.
*Needed for the future:* a program/project to get for (almost) all items the basic statements present.
We then probably should also have attention for the quality. For example the administrative territorial entity (P131) should not be too generic. A village or a building should get as P131 the smallest administrative territorial entity as possible (in many countries the municipality).
With the various fixed sets of basic statements I estimate that it is possible to cover at least 90% of all the items (taxa, geographic features, people, structures, astronomical objects, publications, etc.). The remaining ones are harder, often more specialistic. Those have often for P31 "term" or "concept". To make those items more useable these items need to get more statements that provide a better context. Then properties like "aspect of" (P1269) and "characterized by" (P1552) are needed.
== Identifiers ==
About identifiers can't be said much: they do what do have to do. Even within Wikidata they help a lot, as a symbol is shown when the same identifier has been used on an other item en thus solving duplicates. The focus where most identifiers seem to be related to are sports, popular culture (music, movies, etc) and monument identifiers. In many other fields no properties for identifiers have been created yet.
For a generic user, most work regarding identifiers is in finding out how to find the specific identifier on which website so it can be added to the item. I think there we should have more attention for so we can make it easier for users to add them. Another thing is that for the theme I am working on, it is not easy to see where identifiers are missing but do likely exist.
*Needed for the future:* a tool that lists all potential items (in general or of a set of items/query) where an identifier likely is missing.
If identifiers are added to items, an icon next to it often indicates which statements are missing on that same item. For example, if I add a monument identifier, it also indicates that for example a country (P17) have been added. A great help to get items more complete. At the same time identifiers are often forgotten. The other way round would be welcome too: when an item gets for example building als P31, it should also suggest to add a country (P17), an administrative territorial entity (P131), coordinates (P625), and perhaps more.
== Other ==
Besides the ones already mentioned there are some tools/software/issues that would make the work easier or need to be solved.
*Needed for the future:* a tool that looks up all coordinates nearby certain coordinates. Like the Special:Nearby, but then any given location.
*Needed for the future:* better suggestions when adding statements. For example, when I added bridges (those things to cross a river), I get suggestions for properties related to astronomy. When an item has a Wikipedia article as sitelink, it would be great if a statement suggester would use the Wikipedia article to give suggestions. For example, why do I have to indicate manually the country (again) if this already has been indicated twice in the Wikipedia article?
Ten years ago Wikidata started. Those years past by quickly. We all together have put so much work in it with a great result as outcome. But we are not ready yet. For the next ten years I expect our main focus to be improving the quality.
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org