I'd agree with Stas - it depends immensely what you mean by "real"
entities, and it would be best to define your desired subjects
explicitly if possible rather than relying on removing (eg if you
want to know about people, put in a filter for P31:Q5).
To remove things like disambiguation pages, categories, or lists, you
would want (to use Magnus's WDQ syntax) something like
claim[31:(tree)] - anything that is an instance of a
subclass of Q17379835, "Wikimedia page outside the main knowledge
tree". This will remove (probably) all of our internal admin content.
However, defining what constitutes a "list" is challenging; many
(most?) WP list articles contain a non-trivial amount of content in
addition to the list per se. As a result, some are labelled "lists"
(like Q6642364, a list of buildings in a city), while others are
notionally articles on the class which also include a list of members
(eg Q8344876, a list of recipients of an award). This complexity is
not likely to go away any time soon, especially given cases where
(say) the English article thinks it's a list of the set X, and the
Spanish one thinks it's an article about the set X.
On 15 June 2015 at 22:22, Thad Guidry <thadguidry(a)gmail.com> wrote:
Always agreed, it's a classification problem.
So what claims/statements do I rule out ? Or what should I only rule in
(claims/statements) when wanting to return only "real" entities ? Can
someone help with those negative claims/statements that I am looking for ?
So far, I only have got
filtering out any entry with P31:Q13406463 should omit most
of them from your results.
Freebase simply decided to not keep Wikipedia topic pages that simply held
"lists of entities", but instead Freebase liked to easily generate those
same "lists of entities" by using queries. There was no need to have hand
coded lists in Freebase. It was a graph database and could generate all
kinds of lists programmaticlly for a user, and keep those lists as views
against our user profile for easy tweaking or re-use when we wanted to.
(stored user queries)
On Mon, Jun 15, 2015 at 2:56 PM, Stas Malyshev <smalyshev(a)wikimedia.org>
In Freebase, we had bot scripts that went through
and removed "Lists of
Things" topic entities since they are lists of entities and not useful
clumped together and normalized in a graph database.
Why delete them? Wikidata has a number of things which are not your
standard "entity" - lists, sources, news, quotes, service entries,
narrative articles (e.g.
- it's not
exactly "entity" like "human" or "fire"), etc. So I
don't think the
approach that singles out and excludes lists would help much - if you
have an application that needs "individual entities" like "Douglas
Adams" or "London" and exclude other types will have to exclude much
more than just lists - but I think the approach of asking for exactly
what you need and ignoring the rest may prove more efficient. I'm not
sure there's really well-defined criteria to specify what "individual
entity" actually is - I'm sure you have one that matches your
application, but some other application may have completely different one.
Generally, this can be solved by better classification I think, but so
far I'm not sure what to base this classification on.
Wikidata mailing list
Wikidata mailing list
- Andrew Gray