Re: [Wikitech-l] category intersection conversations

9 May 2013

On 2013-05-08 11:48 PM, "James Forrester" &lt;jforrester(a)wikimedia.org&gt;
wrote:
...

 On 8 May 2013 18:26, Sumana Harihareswara &lt;sumanah(a)wikimedia.org&gt; wrote:

  Recently a lot of people have been talking about
what's possible and
 what's necessary regarding MediaWiki, CatScan-like tools, and real
 category intersection; this mail has some pointers.

 The long-term solution is a sparkly query for, e.g., people with aspects
 novelist + Singaporean, and it would be great if Wikidata could be the
 data-source.  Generally people don't really want to search using
 hierarchical categories; they want tags and they want AND. But
 MediaWiki's current power users do use hierarchical labels, so any
 change would have to deal with current users' expectations.  Also my
 head hurts just thinking of the "but my intuitively obvious ontology is
 better than yours" arguments.

 To put a nice clear stake in the ground, a magic-world-of-loveliness
 sparkly proposal for 2015* might be: 
Just to clarify, you mean sparkles in the way that a unicorn sparkles as
its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store
based)?

...

 * Categories are implemented in Wikidata
 * -> They're in whatever language the user wants (so fr:Chat and en:Cat and
...
  nl:kat and zh-han-t:貓 …) 
Issue (probably can be dealt with somehow or maybe rare enough not to
care): conflicts - what if the name of one cat in french is the same as a
different category in spanish. May be non issue if done using wikidata
numeric ids

...
  * -> They're properly queryable 
Various groups have variois definitions of this

...
  * -> They're shared between wikis (pooled
expertise) 
Between wikipedias or all wikimedia wikis... category structure has varried
meaning between projects. Category:North_America has different types of
pages in enwikinews compared to enwikipedia.
...

 * Pages are implicitly in the parent categories of their explicit categories
...
  * -> Pages in <Politicians from the
Netherlands> are in <People from the
 Netherlands by profession> (its first parent) and <People from the
 Netherlands> (its first parent's parent) and <Politicians> (its second
 parent) and <People> (its second parent's parent) and …
 * -> Yes, this poses issues given the sometimes cyclic nature of
 categories' hierarchies, but this is relatively trivial to code around 
In the current structure. It doesnt make sense for Bob to be in list of
people by professions. It makes less sense the futher you traverse the
cayegory graph. Otoh better querying capabilities may turn the category
system into more of a flat namespace making that less of an issue.

...

 * Readers can search, querying across categories regardless of whether
 they're implicit or explicit
 * -> A search for the intersection of <People from the Netherlands> with
 <Politicians> will effectively return results for <Politicians from the
 Netherlands> (and the user doesn't need to know or care that this is an
 extant or non-extant category) 
We would need some system to turn fake cats into real queries. I suppose
users could make redirects. The alternative of magic nlp sounds difficult

...
  * -> Searches might be more than just
intersections, e.g. "<Painters from
 the United Kingdom> AND <Living people> NOT <Members of the Royal
Academy>"
...
  or whatever.
 * -> Such queries might be cached (and, indeed, the intersections that
 people search for might be used to suggest new categorisation schemata that
...
  wikis had previously not considered - e.g. <British
politicians> & <People
 with pet cats> & <People who died in hot-ballooning accidents) 
Dealing with cache invalidation (unless it is quite coarse grained) may be
difficult.
...

 * Editors can tag articles with leaf or branch categories, potentially
 over-lapping and the system will rationalise the categories on save to the
 minimally-spanning subset (or whatever is most useful for users, the
 database, and/or both) 
That's quite an interesting idea, and one I haven't heard before from
previous times this has been brought up.

One concern id have is how to figure out which categories to list at the
bottom of the page (all that could fit, or only the base categories, and
how to determine what that is)

...
  * -> Editors don't need to know the hierarchy
of categories *a priori* when
...
  adding pages to them (yay, less difficulty)
 * -> Power editors don't need to type in loads of different categories if
 they have a very specific one in mind (yay, still flexible)
 * -> Categories shown to readers aren't necessarily the categories saved in
...
  the database, at editorial judgement (otherwise, would
a page not be in
 just a single category, namely the intersection of all its tagged
 categories?)

 Apart from the time and resources needed to make this happen and
 operational, does this sound like something we'd want to do? It feels like
 this, or something like it, would serve our editors and readers the best
 from their perspective, if not our sysadmins. :-)

 [Snip]

 > I think the best place to pursue this topic is probably in
 > https://meta.wikimedia.org/wiki/Talk:Beyond_categories .  It's unlikely
 > Wikimedia Foundation will be able to make engineers available to work on
 > this anytime soon, but I would not be surprised if the Wikidata
 > developer community or volunteers found this interesting enough to work on.
...

 I guess I should post this there too, maybe once someone's told me if
it's
...
  mad-cap. ;-)

I think you have captured what a lot of people want in a somewhat dreamy
sense. However there is still a lot to do to make that vision concrete. In
particular i think there would be non trivial ui challanges to make this
understandable to the user.

----

...
 From what I hear wikidata phase 3 is going to basically
be support for inline queries. Details are vauge but if they support the typical
types of
queries you associate with semantic networks - there is category
intersection right there.

If any of the wikidata folk could comment on what sort of queries are
planned for phase 3, performance/scaling considerations, technologies being
considered (triple store?) Id be very interested in hearing. (I recognize
that future plans may not exist yet)

more generally it would be interesting to know the performance
characteristics of SPARQL type query systems, since people seem to be
talking about them. Are they a non starter or could they be feasible?
Semantic and efficient are not words I associate with each other, but that
is due to rumour not actual data. (Although my brief googling doesnt
exactly look promising)

-bawolff

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] category intersection conversations