On Sun, Apr 3, 2016 at 4:28 PM John Erling Blad jeblad@gmail.com wrote:
Just read through the doc, and found some important points. I post each one in a separate mail.
Since it is hard to decide which content is actually notable, the items
appear-
ing in the search should be limited to the ones having at least one
statements
and two sitelinks to the same project (like Wikipedia or Wikivoyage).
This is a good baseline, but figuring out what is notable locally is a bit more involved. A language is used in a local area, and within that area some items are more important just because they reside within the area. This is quite noticeable in the differences between nnwiki and nowiki which both basically covers "Norway". Also items that somehow relates to the local area or language is more noticeable than those outside those areas. By traversing upwords in the claims using the "part of" property it is possible to build a priority on the area involved. It is possible to traverse "nationality" and a few other properties.
Things directly noticeable like an area enclosed in an area using the language is somewhat easy to identify, but things that are noticeable by association with another noticeable thing is not. Like a Danish slave ship operated by a Norwegian firm, the ship is thus noticeable in nowiki. I would say that all things linked as an item from other noticeable things should be included. Some would perhaps say that "items with second order relevance should be included".
Yes the heuristic we're using isn't perfect. However I believe it is good enough for 99% of the cases while being really simple. This is what we need at the beginning. As we go along we can learn and see if other things make more sense. We have taken the exact same approach to ranking for item suggestions on Wikidata. At first all we took into account was the number of sitelinks on the items. This definitely wasn't a perfect measure for how relevant an item is but it was absolutely good enough while introducing very little complexity. As we've learned more and as Wikidata grows it was no longer good enough so we switched the algorithm to also take into account the number of labels. This is still relatively low complexity while producing good results. For the particular case of notability: As long as we don't have notability criteria in a machine readable format we can only work with heuristics. And I really don't believe machine readable notability criteria is something we should strive for.
Cheers Lydia