Hoi,
The perspective you seek is in one of the links in the blogpost [1].
On 2020-08-03 there were 60,114,943 items with no links to any Wikipedia.
Only 1,130,492 items had 10 or more links to a Wikipedia, any Wikipedia. We
do not have numbers for more than 20 etc but we know that we represent 300
languages. From my perspective, I do not want to judge what is notable and
what is not. What I do know is that we only represent the information that
is in Wikidata. My question would then be: how much data do we have for any
one item. That question is answered somewhat in these same statistics;
53.38% (47,096,074 items) of Wikidata's items have more than 10 statements.
Another aspect is the number of labels.. 7,397,701 items have more than 10
labels.
For me the items that have the highest priority to get automated
descriptions are red links. Next are the friends and false friends enabling
disambiguation. Finally the items that people seek and cannot find. I think
that it is simplest not to make a choice and have them for any and all
items in any and all languages. Once we gain many more labels in a language
the impact for each label will be in relation to the number of times the
item or property is used.
Automated descriptions have been with us for many years. The easiest way to
demonstrate this is Reasonator looking for "John Smith", I give you German
[2] and you can replace the language code de for en for English fr for
French or ru for Russian. Automated descriptions will be really useful when
used with the "Special:MediaSearch" [3] it enables search for pictures in
any language.
Lsjbot [3] has had a huge impact on Wikipedia so much so that it has its
own article. We can iterate on its code and use the data of Wikidata
instead. Once we are able to re-create it for one of the languages used by
Lsjbot, we can use it as a template for other languages. This is when it
becomes relevant to experiment with true language technology.
Thanks,
GerardM
[1]
On Mon, 10 Aug 2020 at 16:42, Grounder UK <grounderuk(a)gmail.com> wrote:
Thanks, Gerard.
I've tried to get some idea of what these items are, but my SPARQL just
times out. It's not clear to me, you see, how many of these items are not
already, in practice, "knowable in any language". If they represent things
with names or titles and they are not notable enough for a Wikipedia
article in any language, perhaps their name or title is all anyone needs.
Or, perhaps, transliteration into their language's script. Some of the
items must actually represent translations or derivative works of some
other item.
Anyway, to keep it simple, I'd agree that automatically generating a
language-neutral description for a Wikidata Item could be a high priority.
But the higher priority, it seems to me, is identifying which items would
most benefit from such a description. Some of those could be fixed right
now! In my mind, though, I'm already ignoring people and their works, given
at least one external identifier. Then I'm thinking places just need a
decent geotag... Maybe what we really need is a high quality search string
to be derived, so that anyone can make use of their search-engine of choice.
For Wikipedia items, again, I agree that exploring all the bi-directional
links should be fruitful. As a start, we could extract the structural
context of the link (that it is a link from an identifiable section of the
Wikipedia page and, if it is the case, that it links to an identifiable
section of a Wikipedia page). We could enhance this data with the textual
context of the link (the sentence it is in and the adjacent sentences, with
their links). (At this stage, I'd be slightly concerned about copyright
implications, so we might leave the textual context in its original
Wikipedia. Here it could already enhance the backlinks ("What links here")
with meaningful context.) Interpretation of contextual links into a
language-neutral form would then allow equivalent links to be identified in
other Wikipedias. If we establish that several Wikipedias seem to have
semantically equivalent links, I would think it reasonable to quote the
textual context from just one of them as a reference in support of the
generalized claim in the copyright-free domain (where only the idea is
expressed and no person is author of its language-neutral expression).
I don't know anything about LSJBOT so, as always, a link would be helpful.
Best regards,
Al.
On Sunday, 9 August 2020, Gerard Meijssen <gerard.meijssen(a)gmail.com>
wrote:
Hoi,
I am amazed by all the competing ideas, notions I have read on the
mailing list so far. It is bewildering and does not give me a notion of
what is to be done.
I have thought about it and for me it is simple. For every Wikipedia
article there are two Wikidata items that have no Wikipedia article. It
follows that the first item of business is to make these knowable in any
language. The best way to do this is by providing automated descriptions
that aid in disambiguation.
When a Wikipedia article exists, it links to many articles. All of them
have their own Wikidata item and all can be described either in a Wikidata
triple or in structured text.
When sufficient data is available, a text can be generated. This has been
demonstrated by LSJBOT and it is why a Cebuano Wikipedia has so many
articles. A template as used by LSJBOT can be adapted for every language.
My point is that all the research in the world makes no difference when
we do not apply what we know.
Thanks,
GerardM
https://ultimategerardm.blogspot.com/2020/08/keeping-it-simple-for-abstract…
_______________________________________________
Abstract-Wikipedia mailing list
Abstract-Wikipedia(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/abstract-wikipedia