On 2013-05-08 11:48 PM, "James Forrester" jforrester@wikimedia.org wrote:
On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:
Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.
The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.
To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:
Just to clarify, you mean sparkles in the way that a unicorn sparkles as its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store based)?
- Categories are implemented in Wikidata
- -> They're in whatever language the user wants (so fr:Chat and en:Cat
and
nl:kat and zh-han-t:貓 …)
Issue (probably can be dealt with somehow or maybe rare enough not to care): conflicts - what if the name of one cat in french is the same as a different category in spanish. May be non issue if done using wikidata numeric ids
- -> They're properly queryable
Various groups have variois definitions of this
- -> They're shared between wikis (pooled expertise)
Between wikipedias or all wikimedia wikis... category structure has varried meaning between projects. Category:North_America has different types of pages in enwikinews compared to enwikipedia.
- Pages are implicitly in the parent categories of their explicit
categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
In the current structure. It doesnt make sense for Bob to be in list of people by professions. It makes less sense the futher you traverse the cayegory graph. Otoh better querying capabilities may turn the category system into more of a flat namespace making that less of an issue.
- Readers can search, querying across categories regardless of whether
they're implicit or explicit
- -> A search for the intersection of <People from the Netherlands> with
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)
We would need some system to turn fake cats into real queries. I suppose users could make redirects. The alternative of magic nlp sounds difficult
- -> Searches might be more than just intersections, e.g. "<Painters from
the United Kingdom> AND <Living people> NOT <Members of the Royal
Academy>"
or whatever.
- -> Such queries might be cached (and, indeed, the intersections that
people search for might be used to suggest new categorisation schemata
that
wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)
Dealing with cache invalidation (unless it is quite coarse grained) may be difficult.
- Editors can tag articles with leaf or branch categories, potentially
over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both)
That's quite an interesting idea, and one I haven't heard before from previous times this has been brought up.
One concern id have is how to figure out which categories to list at the bottom of the page (all that could fit, or only the base categories, and how to determine what that is)
- -> Editors don't need to know the hierarchy of categories *a priori*
when
adding pages to them (yay, less difficulty)
- -> Power editors don't need to type in loads of different categories if
they have a very specific one in mind (yay, still flexible)
- -> Categories shown to readers aren't necessarily the categories saved
in
the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)
Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)
[Snip]
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work
on.
I guess I should post this there too, maybe once someone's told me if
it's
mad-cap. ;-)
I think you have captured what a lot of people want in a somewhat dreamy sense. However there is still a lot to do to make that vision concrete. In particular i think there would be non trivial ui challanges to make this understandable to the user.
----
From what I hear wikidata phase 3 is going to basically be support for
inline queries. Details are vauge but if they support the typical types of queries you associate with semantic networks - there is category intersection right there.
If any of the wikidata folk could comment on what sort of queries are planned for phase 3, performance/scaling considerations, technologies being considered (triple store?) Id be very interested in hearing. (I recognize that future plans may not exist yet)
more generally it would be interesting to know the performance characteristics of SPARQL type query systems, since people seem to be talking about them. Are they a non starter or could they be feasible? Semantic and efficient are not words I associate with each other, but that is due to rumour not actual data. (Although my brief googling doesnt exactly look promising)
-bawolff