On 2013-05-08 11:48 PM, "James Forrester" <jforrester(a)wikimedia.org>
wrote:
On 8 May 2013 18:26, Sumana Harihareswara <sumanah(a)wikimedia.org> wrote:
Recently a lot of people have been talking about
what's possible and
what's necessary regarding MediaWiki, CatScan-like tools, and real
category intersection; this mail has some pointers.
The long-term solution is a sparkly query for, e.g., people with aspects
novelist + Singaporean, and it would be great if Wikidata could be the
data-source. Generally people don't really want to search using
hierarchical categories; they want tags and they want AND. But
MediaWiki's current power users do use hierarchical labels, so any
change would have to deal with current users' expectations. Also my
head hurts just thinking of the "but my intuitively obvious ontology is
better than yours" arguments.
To put a nice clear stake in the ground, a magic-world-of-loveliness
sparkly proposal for 2015* might be:
Just to clarify, you mean sparkles in the way that a unicorn sparkles as
its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store
based)?
* Categories are implemented in Wikidata
* -> They're in whatever language the user wants (so fr:Chat and en:Cat
and
nl:kat and zh-han-t:貓 …)
Issue (probably can be dealt with somehow or maybe rare enough not to
care): conflicts - what if the name of one cat in french is the same as a
different category in spanish. May be non issue if done using wikidata
numeric ids
* -> They're properly queryable
Various groups have variois definitions of this
* -> They're shared between wikis (pooled
expertise)
Between wikipedias or all wikimedia wikis... category structure has varried
meaning between projects. Category:North_America has different types of
pages in enwikinews compared to enwikipedia.
* Pages are implicitly in the parent categories of their explicit
categories
* -> Pages in <Politicians from the
Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the
Netherlands> (its first parent's parent) and <Politicians> (its second
parent) and <People> (its second parent's parent) and …
* -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
In the current structure. It doesnt make sense for Bob to be in list of
people by professions. It makes less sense the futher you traverse the
cayegory graph. Otoh better querying capabilities may turn the category
system into more of a flat namespace making that less of an issue.
* Readers can search, querying across categories regardless of whether
they're implicit or explicit
* -> A search for the intersection of <People from the Netherlands> with
<Politicians> will effectively return results for <Politicians from the
Netherlands> (and the user doesn't need to know or care that this is an
extant or non-extant category)
We would need some system to turn fake cats into real queries. I suppose
users could make redirects. The alternative of magic nlp sounds difficult
* -> Searches might be more than just
intersections, e.g. "<Painters from
the United Kingdom> AND <Living people> NOT <Members of the Royal
Academy>"
or whatever.
* -> Such queries might be cached (and, indeed, the intersections that
people search for might be used to suggest new categorisation schemata
that
wikis had previously not considered - e.g. <British
politicians> & <People
with pet cats> & <People who died in hot-ballooning accidents)
Dealing with cache invalidation (unless it is quite coarse grained) may be
difficult.
* Editors can tag articles with leaf or branch categories, potentially
over-lapping and the system will rationalise the categories on save to the
minimally-spanning subset (or whatever is most useful for users, the
database, and/or both)
That's quite an interesting idea, and one I haven't heard before from
previous times this has been brought up.
One concern id have is how to figure out which categories to list at the
bottom of the page (all that could fit, or only the base categories, and
how to determine what that is)
* -> Editors don't need to know the hierarchy
of categories *a priori*
when
adding pages to them (yay, less difficulty)
* -> Power editors don't need to type in loads of different categories if
they have a very specific one in mind (yay, still flexible)
* -> Categories shown to readers aren't necessarily the categories saved
in
the database, at editorial judgement (otherwise, would
a page not be in
just a single category, namely the intersection of all its tagged
categories?)
Apart from the time and resources needed to make this happen and
operational, does this sound like something we'd want to do? It feels like
this, or something like it, would serve our editors and readers the best
from their perspective, if not our sysadmins. :-)
[Snip]
> I think the best place to pursue this topic is probably in
>
https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely
> Wikimedia Foundation will be able to make engineers available to work on
> this anytime soon, but I would not be surprised if the Wikidata
> developer community or volunteers found this interesting enough to work
on.
I guess I should post this there too, maybe once someone's told me if
it's
mad-cap. ;-)
I think you have captured what a lot of people want in a somewhat dreamy
sense. However there is still a lot to do to make that vision concrete. In
particular i think there would be non trivial ui challanges to make this
understandable to the user.
----
From what I hear wikidata phase 3 is going to basically
be support for
inline queries. Details are vauge but if they support the typical
types of
queries you associate with semantic networks - there is category
intersection right there.
If any of the wikidata folk could comment on what sort of queries are
planned for phase 3, performance/scaling considerations, technologies being
considered (triple store?) Id be very interested in hearing. (I recognize
that future plans may not exist yet)
more generally it would be interesting to know the performance
characteristics of SPARQL type query systems, since people seem to be
talking about them. Are they a non starter or could they be feasible?
Semantic and efficient are not words I associate with each other, but that
is due to rumour not actual data. (Although my brief googling doesnt
exactly look promising)
-bawolff