First you say that the heuristic isn't perfect, then you say that "As long as we don't have notability criteria in a machine readable format we can only work with heuristics." and then "And I really don't believe machine readable notability criteria is something we should strive for." If the heuristic isn't perfect then alternatives should be investigated. There are already machine readable notability criterias in there, the only thing missing is exposing them, probably by using the existing relations.

On Tue, Apr 5, 2016 at 11:32 AM, Lydia Pintscher <Lydia.Pintscher@wikimedia.de> wrote:
On Sun, Apr 3, 2016 at 4:28 PM John Erling Blad <jeblad@gmail.com> wrote:
Just read through the doc, and found some important points. I post each one in a separate mail.

> Since it is hard to decide which content is actually notable, the items appear-
> ing in the search should be limited to the ones having at least one statements
> and two sitelinks to the same project (like Wikipedia or Wikivoyage).

This is a good baseline, but figuring out what is notable locally is a bit more involved. A language is used in a local area, and within that area some items are more important just because they reside within the area. This is quite noticeable in the differences between nnwiki and nowiki which both basically covers "Norway". Also items that somehow relates to the local area or language is more noticeable than those outside those areas. By traversing upwords in the claims using the "part of" property it is possible to build a priority on the area involved. It is possible to traverse "nationality" and a few other properties.

Things directly noticeable like an area enclosed in an area using the language is somewhat easy to identify, but things that are noticeable by association with another noticeable thing is not. Like a Danish slave ship operated by a Norwegian firm, the ship is thus noticeable in nowiki. I would say that all things linked as an item from other noticeable things should be included. Some would perhaps say that "items with second order relevance should be included".

Yes the heuristic we're using isn't perfect. However I believe it is good enough for 99% of the cases while being really simple. This is what we need at the beginning. As we go along we can learn and see if other things make more sense.
We have taken the exact same approach to ranking for item suggestions on Wikidata. At first all we took into account was the number of sitelinks on the items. This definitely wasn't a perfect measure for how relevant an item is but it was absolutely good enough while introducing very little complexity. As we've learned more and as Wikidata grows it was no longer good enough so we switched the algorithm to also take into account the number of labels. This is still relatively low complexity while producing good results.
For the particular case of notability: As long as we don't have notability criteria in a machine readable format we can only work with heuristics. And I really don't believe machine readable notability criteria is something we should strive for.

Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata