First you say that the heuristic isn't perfect, then you say that "As long
as we don't have notability criteria in a machine readable format we can
only work with heuristics." and then "And I really don't believe machine
readable notability criteria is something we should strive for." If the
heuristic isn't perfect then alternatives should be investigated. There are
already machine readable notability criterias in there, the only thing
missing is exposing them, probably by using the existing relations.
On Tue, Apr 5, 2016 at 11:32 AM, Lydia Pintscher <
Lydia.Pintscher(a)wikimedia.de> wrote:
On Sun, Apr 3, 2016 at 4:28 PM John Erling Blad
<jeblad(a)gmail.com> wrote:
Just read through the doc, and found some
important points. I post each
one in a separate mail.
Since it is hard to decide which content is
actually notable, the items
appear-
ing in the search should be limited to the ones
having at least one
statements
and two sitelinks to the same project (like
Wikipedia or Wikivoyage).
This is a good baseline, but figuring out what is notable locally is a
bit more involved. A language is used in a local area, and within that area
some items are more important just because they reside within the area.
This is quite noticeable in the differences between nnwiki and nowiki which
both basically covers "Norway". Also items that somehow relates to the
local area or language is more noticeable than those outside those areas.
By traversing upwords in the claims using the "part of" property it is
possible to build a priority on the area involved. It is possible to
traverse "nationality" and a few other properties.
Things directly noticeable like an area enclosed in an area using the
language is somewhat easy to identify, but things that are noticeable by
association with another noticeable thing is not. Like a Danish slave ship
operated by a Norwegian firm, the ship is thus noticeable in nowiki. I
would say that all things linked as an item from other noticeable things
should be included. Some would perhaps say that "items with second order
relevance should be included".
Yes the heuristic we're using isn't perfect. However I believe it is good
enough for 99% of the cases while being really simple. This is what we need
at the beginning. As we go along we can learn and see if other things make
more sense.
We have taken the exact same approach to ranking for item suggestions on
Wikidata. At first all we took into account was the number of sitelinks on
the items. This definitely wasn't a perfect measure for how relevant an
item is but it was absolutely good enough while introducing very little
complexity. As we've learned more and as Wikidata grows it was no longer
good enough so we switched the algorithm to also take into account the
number of labels. This is still relatively low complexity while producing
good results.
For the particular case of notability: As long as we don't have notability
criteria in a machine readable format we can only work with heuristics. And
I really don't believe machine readable notability criteria is something we
should strive for.
Cheers
Lydia
--
Lydia Pintscher -
http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt
für Körperschaften I Berlin, Steuernummer 27/029/42207.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata