Always agreed, it's a classification problem.
So what claims/statements do I rule out ? Or what should I only rule in
(claims/statements) when wanting to return only "real" entities ? Can
someone help with those negative claims/statements that I am looking for ?
So far, I only have got
filtering out any entry with P31:Q13406463 should omit most
of them from your results.
Freebase simply decided to not keep Wikipedia topic pages that simply held
"lists of entities", but instead Freebase liked to easily generate those
same "lists of entities" by using queries. There was no need to have hand
coded lists in Freebase. It was a graph database and could generate all
kinds of lists programmaticlly for a user, and keep those lists as views
against our user profile for easy tweaking or re-use when we wanted to.
(stored user queries)
On Mon, Jun 15, 2015 at 2:56 PM, Stas Malyshev <smalyshev(a)wikimedia.org>
In Freebase, we had bot scripts that went through
and removed "Lists of
Things" topic entities since they are lists of entities and not useful
clumped together and normalized in a graph database.
Why delete them? Wikidata has a number of things which are not your
standard "entity" - lists, sources, news, quotes, service entries,
narrative articles (e.g.
- it's not
exactly "entity" like "human" or "fire"), etc. So I
don't think the
approach that singles out and excludes lists would help much - if you
have an application that needs "individual entities" like "Douglas
Adams" or "London" and exclude other types will have to exclude much
more than just lists - but I think the approach of asking for exactly
what you need and ignoring the rest may prove more efficient. I'm not
sure there's really well-defined criteria to specify what "individual
entity" actually is - I'm sure you have one that matches your
application, but some other application may have completely different one.
Generally, this can be solved by better classification I think, but so
far I'm not sure what to base this classification on.
Wikidata mailing list