Hoi, The perspective you seek is in one of the links in the blogpost [1]. On 2020-08-03 there were 60,114,943 items with no links to any Wikipedia. Only 1,130,492 items had 10 or more links to a Wikipedia, any Wikipedia. We do not have numbers for more than 20 etc but we know that we represent 300 languages. From my perspective, I do not want to judge what is notable and what is not. What I do know is that we only represent the information that is in Wikidata. My question would then be: how much data do we have for any one item. That question is answered somewhat in these same statistics; 53.38% (47,096,074 items) of Wikidata's items have more than 10 statements. Another aspect is the number of labels.. 7,397,701 items have more than 10 labels.
For me the items that have the highest priority to get automated descriptions are red links. Next are the friends and false friends enabling disambiguation. Finally the items that people seek and cannot find. I think that it is simplest not to make a choice and have them for any and all items in any and all languages. Once we gain many more labels in a language the impact for each label will be in relation to the number of times the item or property is used.
Automated descriptions have been with us for many years. The easiest way to demonstrate this is Reasonator looking for "John Smith", I give you German [2] and you can replace the language code de for en for English fr for French or ru for Russian. Automated descriptions will be really useful when used with the "Special:MediaSearch" [3] it enables search for pictures in any language.
Lsjbot [3] has had a huge impact on Wikipedia so much so that it has its own article. We can iterate on its code and use the data of Wikidata instead. Once we are able to re-create it for one of the languages used by Lsjbot, we can use it as a template for other languages. This is when it becomes relevant to experiment with true language technology.
Thanks, GerardM
[1] https://wikidata-todo.toolforge.org/stats.php?reverse [2] https://reasonator.toolforge.org/?find=John+Smith&lang=de [3] https://commons.wikimedia.org/wiki/Special:MediaSearch?type=bitmap&q=%D0... [4] https://en.wikipedia.org/wiki/Lsjbot
On Mon, 10 Aug 2020 at 16:42, Grounder UK grounderuk@gmail.com wrote:
Thanks, Gerard.
I've tried to get some idea of what these items are, but my SPARQL just times out. It's not clear to me, you see, how many of these items are not already, in practice, "knowable in any language". If they represent things with names or titles and they are not notable enough for a Wikipedia article in any language, perhaps their name or title is all anyone needs. Or, perhaps, transliteration into their language's script. Some of the items must actually represent translations or derivative works of some other item.
Anyway, to keep it simple, I'd agree that automatically generating a language-neutral description for a Wikidata Item could be a high priority. But the higher priority, it seems to me, is identifying which items would most benefit from such a description. Some of those could be fixed right now! In my mind, though, I'm already ignoring people and their works, given at least one external identifier. Then I'm thinking places just need a decent geotag... Maybe what we really need is a high quality search string to be derived, so that anyone can make use of their search-engine of choice.
For Wikipedia items, again, I agree that exploring all the bi-directional links should be fruitful. As a start, we could extract the structural context of the link (that it is a link from an identifiable section of the Wikipedia page and, if it is the case, that it links to an identifiable section of a Wikipedia page). We could enhance this data with the textual context of the link (the sentence it is in and the adjacent sentences, with their links). (At this stage, I'd be slightly concerned about copyright implications, so we might leave the textual context in its original Wikipedia. Here it could already enhance the backlinks ("What links here") with meaningful context.) Interpretation of contextual links into a language-neutral form would then allow equivalent links to be identified in other Wikipedias. If we establish that several Wikipedias seem to have semantically equivalent links, I would think it reasonable to quote the textual context from just one of them as a reference in support of the generalized claim in the copyright-free domain (where only the idea is expressed and no person is author of its language-neutral expression).
I don't know anything about LSJBOT so, as always, a link would be helpful.
Best regards, Al.
On Sunday, 9 August 2020, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, I am amazed by all the competing ideas, notions I have read on the mailing list so far. It is bewildering and does not give me a notion of what is to be done.
I have thought about it and for me it is simple. For every Wikipedia article there are two Wikidata items that have no Wikipedia article. It follows that the first item of business is to make these knowable in any language. The best way to do this is by providing automated descriptions that aid in disambiguation.
When a Wikipedia article exists, it links to many articles. All of them have their own Wikidata item and all can be described either in a Wikidata triple or in structured text.
When sufficient data is available, a text can be generated. This has been demonstrated by LSJBOT and it is why a Cebuano Wikipedia has so many articles. A template as used by LSJBOT can be adapted for every language.
My point is that all the research in the world makes no difference when we do not apply what we know. Thanks, GerardM
https://ultimategerardm.blogspot.com/2020/08/keeping-it-simple-for-abstract-...
Abstract-Wikipedia mailing list Abstract-Wikipedia@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia