I'd agree with Stas - it depends immensely what you mean by "real" entities, and it would be best to define your desired subjects explicitly if possible rather than relying on removing (eg if you want to know about people, put in a filter for P31:Q5).
To remove things like disambiguation pages, categories, or lists, you would want (to use Magnus's WDQ syntax) something like claim[31:(tree[17379835][][279])] - anything that is an instance of a subclass of Q17379835, "Wikimedia page outside the main knowledge tree". This will remove (probably) all of our internal admin content.
However, defining what constitutes a "list" is challenging; many (most?) WP list articles contain a non-trivial amount of content in addition to the list per se. As a result, some are labelled "lists" (like Q6642364, a list of buildings in a city), while others are notionally articles on the class which also include a list of members (eg Q8344876, a list of recipients of an award). This complexity is not likely to go away any time soon, especially given cases where (say) the English article thinks it's a list of the set X, and the Spanish one thinks it's an article about the set X.
Andrew.
On 15 June 2015 at 22:22, Thad Guidry thadguidry@gmail.com wrote:
Stas,
Always agreed, it's a classification problem.
So what claims/statements do I rule out ? Or what should I only rule in (claims/statements) when wanting to return only "real" entities ? Can someone help with those negative claims/statements that I am looking for ? So far, I only have got
filtering out any entry with P31:Q13406463 should omit most of them from your results.
All,
Freebase simply decided to not keep Wikipedia topic pages that simply held "lists of entities", but instead Freebase liked to easily generate those same "lists of entities" by using queries. There was no need to have hand coded lists in Freebase. It was a graph database and could generate all kinds of lists programmaticlly for a user, and keep those lists as views against our user profile for easy tweaking or re-use when we wanted to. (stored user queries)
Thad +ThadGuidry
On Mon, Jun 15, 2015 at 2:56 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
In Freebase, we had bot scripts that went through and removed "Lists of Things" topic entities since they are lists of entities and not useful clumped together and normalized in a graph database.
Why delete them? Wikidata has a number of things which are not your standard "entity" - lists, sources, news, quotes, service entries, narrative articles (e.g. https://en.wikipedia.org/wiki/Control_of_fire_by_early_humans - it's not exactly "entity" like "human" or "fire"), etc. So I don't think the approach that singles out and excludes lists would help much - if you have an application that needs "individual entities" like "Douglas Adams" or "London" and exclude other types will have to exclude much more than just lists - but I think the approach of asking for exactly what you need and ignoring the rest may prove more efficient. I'm not sure there's really well-defined criteria to specify what "individual entity" actually is - I'm sure you have one that matches your application, but some other application may have completely different one. Generally, this can be solved by better classification I think, but so far I'm not sure what to base this classification on. -- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata