category intersection conversations

List overview All Threads
Download

newer

older

Git for idiots

bot activity in #mediawiki on...

Sumana Harihareswara

8 May 2013 8 May '13

6:26 p.m.

Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.

The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.

Conversations have been a bit scattered:

https://meta.wikimedia.org/wiki/Beyond_categories

http://lists.wikimedia.org/pipermail/wikidata-l/2013-May/thread.html#2202 ("Question about wikipedia categories.")

https://en.wikipedia.org/wiki/Wikipedia_talk:Category_intersection#A_working...

CatScan, which can find articles in category intersections: https://en.wikipedia.org/wiki/Wikipedia:CatScan

http://lists.wikimedia.org/pipermail/gendergap/2013-April/003552.html

https://bugzilla.wikimedia.org/show_bug.cgi?id=5244 "Allow searching in intersections, etc. of categories"

I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work on.

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

Show replies by date

James Forrester

8 May 8 May

7:47 p.m.

On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:

...

Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.

The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.

To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:

* Categories are implemented in Wikidata * -> They're in whatever language the user wants (so fr:Chat and en:Cat and nl:kat and zh-han-t:貓 …) * -> They're properly queryable * -> They're shared between wikis (pooled expertise)

* Pages are implicitly in the parent categories of their explicit categories * -> Pages in <Politicians from the Netherlands> are in <People from the Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and … * -> Yes, this poses issues given the sometimes cyclic nature of categories' hierarchies, but this is relatively trivial to code around

* Readers can search, querying across categories regardless of whether they're implicit or explicit * -> A search for the intersection of <People from the Netherlands> with <Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category) * -> Searches might be more than just intersections, e.g. "<Painters from the United Kingdom> AND <Living people> NOT <Members of the Royal Academy>" or whatever. * -> Such queries might be cached (and, indeed, the intersections that people search for might be used to suggest new categorisation schemata that wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)

* Editors can tag articles with leaf or branch categories, potentially over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both) * -> Editors don't need to know the hierarchy of categories *a priori* when adding pages to them (yay, less difficulty) * -> Power editors don't need to type in loads of different categories if they have a very specific one in mind (yay, still flexible) * -> Categories shown to readers aren't necessarily the categories saved in the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)

Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)

[Snip]

...

I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work on.

I guess I should post this there too, maybe once someone's told me if it's mad-cap. ;-)

-- James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc. jforrester@wikimedia.org | @jdforrester

Brian Wolff

9:13 p.m.

On 2013-05-08 11:48 PM, "James Forrester" jforrester@wikimedia.org wrote:

...

On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:

...
Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.

The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.

To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:

Just to clarify, you mean sparkles in the way that a unicorn sparkles as its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store based)?

...

Categories are implemented in Wikidata

-> They're in whatever language the user wants (so fr:Chat and en:Cat

and

...

nl:kat and zh-han-t:貓 …)

Issue (probably can be dealt with somehow or maybe rare enough not to care): conflicts - what if the name of one cat in french is the same as a different category in spanish. May be non issue if done using wikidata numeric ids

...

-> They're properly queryable

Various groups have variois definitions of this

...

-> They're shared between wikis (pooled expertise)

Between wikipedias or all wikimedia wikis... category structure has varried meaning between projects. Category:North_America has different types of pages in enwikinews compared to enwikipedia.

...

Pages are implicitly in the parent categories of their explicit

categories

...

-> Pages in <Politicians from the Netherlands> are in <People from the

Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

In the current structure. It doesnt make sense for Bob to be in list of people by professions. It makes less sense the futher you traverse the cayegory graph. Otoh better querying capabilities may turn the category system into more of a flat namespace making that less of an issue.

...

Readers can search, querying across categories regardless of whether

they're implicit or explicit

-> A search for the intersection of <People from the Netherlands> with

<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)

We would need some system to turn fake cats into real queries. I suppose users could make redirects. The alternative of magic nlp sounds difficult

...

-> Searches might be more than just intersections, e.g. "<Painters from

the United Kingdom> AND <Living people> NOT <Members of the Royal

Academy>"

...

or whatever.

-> Such queries might be cached (and, indeed, the intersections that

people search for might be used to suggest new categorisation schemata

that

...

wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)

Dealing with cache invalidation (unless it is quite coarse grained) may be difficult.

...

Editors can tag articles with leaf or branch categories, potentially

over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both)

That's quite an interesting idea, and one I haven't heard before from previous times this has been brought up.

One concern id have is how to figure out which categories to list at the bottom of the page (all that could fit, or only the base categories, and how to determine what that is)

...

-> Editors don't need to know the hierarchy of categories *a priori*

when

...

adding pages to them (yay, less difficulty)

-> Power editors don't need to type in loads of different categories if

they have a very specific one in mind (yay, still flexible)

-> Categories shown to readers aren't necessarily the categories saved

...

the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)

Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)

[Snip]

...
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work

on.

...

I guess I should post this there too, maybe once someone's told me if

it's

...

mad-cap. ;-)

I think you have captured what a lot of people want in a somewhat dreamy sense. However there is still a lot to do to make that vision concrete. In particular i think there would be non trivial ui challanges to make this understandable to the user.

----

...

From what I hear wikidata phase 3 is going to basically be support for

inline queries. Details are vauge but if they support the typical types of queries you associate with semantic networks - there is category intersection right there.

If any of the wikidata folk could comment on what sort of queries are planned for phase 3, performance/scaling considerations, technologies being considered (triple store?) Id be very interested in hearing. (I recognize that future plans may not exist yet)

more generally it would be interesting to know the performance characteristics of SPARQL type query systems, since people seem to be talking about them. Are they a non starter or could they be feasible? Semantic and efficient are not words I associate with each other, but that is due to rumour not actual data. (Although my brief googling doesnt exactly look promising)

-bawolff

Matthew Flaschen

9:23 p.m.

On 05/09/2013 12:13 AM, Brian Wolff wrote:

...

Between wikipedias or all wikimedia wikis... category structure has varried meaning between projects. Category:North_America has different types of pages in enwikinews compared to enwikipedia.

Right. It would have to be grouped by top-level project (all Wikipedias, all Wikinewses, etc.). But even then you could run into inter-project conflicts. However, I think many of the *intra*-project category disagreements are actually about when intersections are appropriate, so it might not be that bad (since the trend would be towards intersections as just queries).

Matt Flaschen

Denny Vrandečić

14 May 14 May

8:15 a.m.

2013/5/9 Brian Wolff bawolff@gmail.com

...

From what I hear wikidata phase 3 is going to basically be support for inline queries. Details are vauge but if they support the typical types of queries you associate with semantic networks - there is category intersection right there.

Right.

...

If any of the wikidata folk could comment on what sort of queries are planned for phase 3, performance/scaling considerations, technologies being considered (triple store?) Id be very interested in hearing. (I recognize that future plans may not exist yet)

For now we aim at "property value" or "property value-restriction" type queries, and their intersections, i.e.

"lives in -> California" "born -> before 1980" "lives in -> California AND born -> before 1980"

We are considering several different technologies, and we had a number of discussions already with a number of people.

My current gut feeling, based on the limited tests we did so far is, that we start with our normal SQL setup for the first two type of queries, and that we extend to Solr for the third type of queries (i.e. intersections).

...

more generally it would be interesting to know the performance characteristics of SPARQL type query systems, since people seem to be talking about them. Are they a non starter or could they be feasible?

We are currently not aiming to provide an unrestricted SPARQL endpoint. But, I know that a few other organizations are very interested in setting that up. If it shows that it would be feasible for us, I'd be very happy if we did it too.

...

Semantic and efficient are not words I associate with each other, but that is due to rumour not actual data. (Although my brief googling doesnt exactly look promising)

That's indeed a rumor, but leads to another discussion :)

Cheers, Denny

...

-bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

Brad Jorsch

9 May 9 May

9:28 a.m.

On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:

...

Pages are implicitly in the parent categories of their explicit categories

-> Pages in <Politicians from the Netherlands> are in <People from the

Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.

Let's consider what would happen to one of my favorite examples on enwiki: * The article for Romania is in <Black Sea countries>. Ok. * And that category is in <Black Sea>, so Romania is in that too. Which is a little strange, but not too bad. * And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>. Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.

And it gets worse the further up you go. You would have Romania in <Liquids> a few more levels up.

For this to work, each wiki would have to redo its category hierarchy as a real ontology based on is-a relationships, rather than the current is-somehow-related-to. Or we would have to introduce some magic word or something to tell MediaWiki that <Politicians> is-a <People> is a valid inference while <Black Sea countries> is-a <Black Sea> isn't.

In other words, code-wise adding "tags" to an article is the same as categories with inference and querying. But trying to use the existing category setup as it exists on something like enwiki as "tags" for inference (or querying, to a lesser extent) seems like GIGO.

...

Readers can search, querying across categories regardless of whether

they're implicit or explicit

-> A search for the intersection of <People from the Netherlands> with

<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)

A person who is originally from the Netherlands but moved to Germany and became a politician there would be in <People from the Netherlands> and <Politicians>, but maybe should not be in <Politicians from the Netherlands> depending on how exactly you define that category.

-- Brad Jorsch Software Engineer Wikimedia Foundation

James Forrester

9:38 a.m.

[I worry we're talking about operational details, which should be a wider discussion, rather than a technology/feasibility conversation to which this list is more suited. Perhaps moving this on-wiki would be best?]

On 9 May 2013 09:28, Brad Jorsch bjorsch@wikimedia.org wrote:

...

On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:

...

Pages are implicitly in the parent categories of their explicit

categories

...

-> Pages in <Politicians from the Netherlands> are in <People from the

Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.

Let's consider what would happen to one of my favorite examples on enwiki:

The article for Romania is in <Black Sea countries>. Ok.

And that category is in <Black Sea>, so Romania is in that too.

Which is a little strange, but not too bad.

And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.

Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.

And it gets worse the further up you go. You would have Romania in <Liquids> a few more levels up.

For this to work, each wiki would have to redo its category hierarchy as a real ontology based on is-a relationships, rather than the current is-somehow-related-to. Or we would have to introduce some magic word or something to tell MediaWiki that <Politicians> is-a <People> is a valid inference while <Black Sea countries> is-a <Black Sea> isn't.

In other words, code-wise adding "tags" to an article is the same as categories with inference and querying. But trying to use the existing category setup as it exists on something like enwiki as "tags" for inference (or querying, to a lesser extent) seems like GIGO.

Quite - the bit of my proposal where the categories would get created on Wikidata from scratch as a synthesis of the needs of the editing community. :-)

Implicitly, these would have clear semantics about the correctitude of their usage governed by something analogous to how Wikidata's community are managing the roll-out of statements on the system. In terms of tools to prevent this becoming an issue, Wikidata's nature means we could easily make sure that the domain of a category would be limited (e.g. "Fluids" maps to "substances", not "instances of substances").

...

...

Readers can search, querying across categories regardless of whether

they're implicit or explicit

-> A search for the intersection of <People from the Netherlands> with

<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)

A person who is originally from the Netherlands but moved to Germany and became a politician there would be in <People from the Netherlands> and <Politicians>, but maybe should not be in <Politicians from the Netherlands> depending on how exactly you define that category.

Indeed; I deliberately chose to use <Politicians from the Netherlands> rather than <Politicians of the Netherlands> or <Politicians in the Netherlands> which are distinct categories with entirely different semantics, but you're right that semantics would need to be clear.

-- James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc. jforrester@wikimedia.org | @jdforrester

Luke Welling WMF

11:21 a.m.

Without deliberately making it an even longer term plan, as I think it is a great idea, another long goal solution to the same problem would be (as Flow gets Wikipedians into the idea of tagging) that categories get largely replaced by tags. That way they lose much of their absoluteness and therefore some of their controversy.

Categories are hard for Wikipedia because compromise is not possible. Consensus can be reached on a subtly different compromise version of the wording of a sentence or paragraph, but there is no compromise on categories. A category either exists or does not. A page either goes in or does not.

With tags, a biography could relatively uncontroversially be tagged as "Novelist, Woman, Best Selling, American, Blonde Haired, Enjoys Spicy Food" even if nearly everybody agrees that half the tags while true are entirely unimportant and not relevant to the subject's area of notability. Whether some tags like race and appearance should exist at all may still generate debate, but if they are only ever available modifiers and not hard categories their offense would be softened.

For some subjects, entirely uncontroversial tags could be extracted from Wikidata.

It would be content shakeup and therefore perhaps politically difficult, but it would take a lot of the technical challenge out of joins, even permitting joins (automatically or manually) with tags translated into equivalent versions in other languages.

All possible combinations of tag derived categories would then "exist", and it would just be a matter of debate as to whether there is a justification to add a link from a page to "Biography+Novelist+Enjoys Spicy Food" or if that is a meaningless category. If reverted, the one person interested in that exact category could still always visit it, it's just that other users would not be directed to it unless they probe talk page debates.

Luke Welling

On Thu, May 9, 2013 at 12:38 PM, James Forrester jforrester@wikimedia.orgwrote:

...

[I worry we're talking about operational details, which should be a wider discussion, rather than a technology/feasibility conversation to which this list is more suited. Perhaps moving this on-wiki would be best?]

On 9 May 2013 09:28, Brad Jorsch bjorsch@wikimedia.org wrote:

...
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:

...

Pages are implicitly in the parent categories of their explicit

categories

...

-> Pages in <Politicians from the Netherlands> are in <People from

the

...
...
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.

Let's consider what would happen to one of my favorite examples on

enwiki:

...

The article for Romania is in <Black Sea countries>. Ok.

And that category is in <Black Sea>, so Romania is in that too.

Which is a little strange, but not too bad.

And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.

Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.

And it gets worse the further up you go. You would have Romania in <Liquids> a few more levels up.

For this to work, each wiki would have to redo its category hierarchy as a real ontology based on is-a relationships, rather than the current is-somehow-related-to. Or we would have to introduce some magic word or something to tell MediaWiki that <Politicians> is-a <People> is a valid inference while <Black Sea countries> is-a <Black Sea> isn't.

In other words, code-wise adding "tags" to an article is the same as categories with inference and querying. But trying to use the existing category setup as it exists on something like enwiki as "tags" for inference (or querying, to a lesser extent) seems like GIGO.

Quite - the bit of my proposal where the categories would get created on Wikidata from scratch as a synthesis of the needs of the editing community. :-)

Implicitly, these would have clear semantics about the correctitude of their usage governed by something analogous to how Wikidata's community are managing the roll-out of statements on the system. In terms of tools to prevent this becoming an issue, Wikidata's nature means we could easily make sure that the domain of a category would be limited (e.g. "Fluids" maps to "substances", not "instances of substances").

...
...

Readers can search, querying across categories regardless of whether

they're implicit or explicit

-> A search for the intersection of <People from the Netherlands>

with

...
...
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)

A person who is originally from the Netherlands but moved to Germany and became a politician there would be in <People from the Netherlands> and <Politicians>, but maybe should not be in <Politicians from the Netherlands> depending on how exactly you define that category.

Indeed; I deliberately chose to use <Politicians from the Netherlands> rather than <Politicians of the Netherlands> or <Politicians in the Netherlands> which are distinct categories with entirely different semantics, but you're right that semantics would need to be clear.

J.

James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc.

jforrester@wikimedia.org | @jdforrester _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian Wolff

12:07 p.m.

On 2013-05-09 3:21 PM, "Luke Welling WMF" lwelling@wikimedia.org wrote:

...

Without deliberately making it an even longer term plan, as I think it is

...

great idea, another long goal solution to the same problem would be (as Flow gets Wikipedians into the idea of tagging) that categories get

largely

...

replaced by tags. That way they lose much of their absoluteness and therefore some of their controversy.

Categories are hard for Wikipedia because compromise is not possible. Consensus can be reached on a subtly different compromise version of the wording of a sentence or paragraph, but there is no compromise on categories. A category either exists or does not. A page either goes in

...

does not.

With tags, a biography could relatively uncontroversially be tagged as "Novelist, Woman, Best Selling, American, Blonde Haired, Enjoys Spicy

Food"

...

even if nearly everybody agrees that half the tags while true are entirely unimportant and not relevant to the subject's area of notability. Whether some tags like race and appearance should exist at all may still generate debate, but if they are only ever available modifiers and not hard categories their offense would be softened.

For some subjects, entirely uncontroversial tags could be extracted from Wikidata.

It would be content shakeup and therefore perhaps politically difficult, but it would take a lot of the technical challenge out of joins, even permitting joins (automatically or manually) with tags translated into equivalent versions in other languages.

All possible combinations of tag derived categories would then "exist",

and

...

it would just be a matter of debate as to whether there is a justification to add a link from a page to "Biography+Novelist+Enjoys Spicy Food" or if that is a meaningless category. If reverted, the one person interested in that exact category could still always visit it, it's just that other

users

...

would not be directed to it unless they probe talk page debates.

Luke Welling

Nobody has ever been able to explain to me the technical difference between tag and category. (Other than being able to query intersections, which is wanted for cats anyhow)

Just change mediawiki:pagecategories to "tags" and change some social conventions - boom you have tags.

-bawolff

James Forrester

12:12 p.m.

On 9 May 2013 12:07, Brian Wolff bawolff@gmail.com wrote:

...

Nobody has ever been able to explain to me the technical difference between tag and category. (Other than being able to query intersections, which is wanted for cats anyhow)

The theory is that tags are non-hierarchical, casually-applied and well-supported in software (from intersections to more). You can see how people feel what we have is somewhat different from this world vision. :-)

-- James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc. jforrester@wikimedia.org | @jdforrester

Brian Wolff

12:35 p.m.

On 2013-05-09 4:13 PM, "James Forrester" jforrester@wikimedia.org wrote:

...

On 9 May 2013 12:07, Brian Wolff bawolff@gmail.com wrote:

...
Nobody has ever been able to explain to me the technical difference

between

...

...
tag and category. (Other than being able to query intersections, which

...

...
wanted for cats anyhow)

The theory is that tags are non-hierarchical, casually-applied and well-supported in software (from intersections to more). You can see how people feel what we have is somewhat different from this world vision. :-)

J.

James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc.

jforrester@wikimedia.org | @jdforrester _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Categories can be flat if people want them to be. Categories can be casual if people want them to be (guess the red link discourages but that's trivial to change)

People seem to want lots of things. When it comes to the tag camp, other than non-crappy category intersection, we seem to have the things people want, which confuses me why people are asking for them.

-bawolff

Sumana Harihareswara

12:58 p.m.

On 05/09/2013 03:07 PM, Brian Wolff wrote:

...

Just change mediawiki:pagecategories to "tags" and change some social conventions - boom you have tags.

Just a reminder - customs and assumptions are sometimes harder to change than digital or physical environments.

-- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

David Gerard

12:40 p.m.

On 9 May 2013 19:21, Luke Welling WMF lwelling@wikimedia.org wrote:

...

With tags, a biography could relatively uncontroversially be tagged as "Novelist, Woman, Best Selling, American, Blonde Haired, Enjoys Spicy Food" even if nearly everybody agrees that half the tags while true are entirely unimportant and not relevant to the subject's area of notability. Whether some tags like race and appearance should exist at all may still generate debate, but if they are only ever available modifiers and not hard categories their offense would be softened.

It appears sir is less than entirely familiar with Wikipedia edit wars. There is NO dispute that will not lead to six months of wikidrama.

So yeah, whatever we use to add tags needs to be able to add citations as well, right there next to the tag.

- d.

James Forrester

12:49 p.m.

On 9 May 2013 12:40, David Gerard dgerard@gmail.com wrote:

...

So yeah, whatever we use to add tags needs to be able to add citations as well, right there next to the tag.

Well, to me that's an even more compelling reason create such a system (Categories as-is don't have them): "WikiTags let you cite the source of why the subject is tagged that way".

-- James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc. jforrester@wikimedia.org | @jdforrester

Matthew Flaschen

11:46 p.m.

On 05/09/2013 03:49 PM, James Forrester wrote:

...

On 9 May 2013 12:40, David Gerard dgerard@gmail.com wrote:

...
So yeah, whatever we use to add tags needs to be able to add citations as well, right there next to the tag.

Well, to me that's an even more compelling reason create such a system (Categories as-is don't have them): "WikiTags let you cite the source of why the subject is tagged that way".

Or a reason to have all the real data in Wikiata, and have NewCategories be a way to query Wikidata. Wikidata already has (basic but improving) support for adding a citation to every statement.

But this (like switching to tags) would be a big change.

Matt Flaschen

Dmitriy Sintsov

10:43 p.m.

On 09.05.2013 20:28, Brad Jorsch wrote:

...

On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:

...

Pages are implicitly in the parent categories of their explicit categories

-> Pages in <Politicians from the Netherlands> are in <People from the

Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.

Let's consider what would happen to one of my favorite examples on enwiki:

The article for Romania is in <Black Sea countries>. Ok.

And that category is in <Black Sea>, so Romania is in that too.

Which is a little strange, but not too bad.

And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.

Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.

There is probably nothing contradictionary in your Black sea category relation example because "Seas of <country>" implies that <country> has *multiple* seas, while Romania has only *one* sea border (no offence, there are lot of small countries and large country does not always means a happy life). <Landforms of Ukraine> is a little bit more weird, but could be explained as long and complex area of Crimean peninsula. So, the categories actually are not so wrong. Dmitriy

Bartosz Dziewoński

10:55 p.m.

Dmitriy, yes, and that's the point - in practice Wikipedia categories are not transitive.

We've been working on this slowly at pl.wp lately, trying to split the categories into two kinds marked with appropriate templates: "topic" categories, including all related entries, e.g. Category:Water; and "object" ones, which would imply transitiveness, like the Politicians example provided above.

Topic cats could contain object ones, but not vice versa. This seems like a good compromise to me, but requires a good deal of work, and we're slowly progressing.

-- -- Matma Rex

Brad Jorsch

10 May 10 May

6:57 a.m.

On Fri, May 10, 2013 at 1:43 AM, Dmitriy Sintsov questpc@rambler.ru wrote:

...

On 09.05.2013 20:28, Brad Jorsch wrote:

...
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:

...

Pages are implicitly in the parent categories of their explicit

categories

-> Pages in <Politicians from the Netherlands> are in <People from the

Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.

Let's consider what would happen to one of my favorite examples on enwiki:

The article for Romania is in <Black Sea countries>. Ok.

And that category is in <Black Sea>, so Romania is in that too.

Which is a little strange, but not too bad.

And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.

Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.

There is probably nothing contradictionary in your Black sea category relation example because "Seas of <country>" implies that <country> has *multiple* seas, while Romania has only *one* sea border (no offence, there are lot of small countries and large country does not always means a happy life). <Landforms of Ukraine> is a little bit more weird, but could be explained as long and complex area of Crimean peninsula. So, the categories actually are not so wrong.

I think you misunderstood. The point was that the article on *Romania* would end up in <Seas of Russia> and <Landforms of Ukraine>.

OTOH, I missed the part of James's original proposal about creating the whole ontology using this inference system from scratch on Wikidata based on strict is-a relationships. So <Black Sea countries> wouldn't be in <Black Sea> in the Wikidata ontology, because countries aren't the sea.

-- Brad Jorsch Software Engineer Wikimedia Foundation

Denny Vrandečić

14 May 14 May

8:08 a.m.

I agree with most of the use cases, and I think they will possible with Wikidata.

My suggestion would be to wait for this year, and then see which of the use cases are still open: I think that by the end of the year we should have made all of them possible (besides the "Searches might be more than interesection", I am not sure about this one)

2013/5/9 James Forrester jforrester@wikimedia.org

...

On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:

...
Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.

The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.

To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:

Categories are implemented in Wikidata

-> They're in whatever language the user wants (so fr:Chat and en:Cat and

nl:kat and zh-han-t:貓 …)

-> They're properly queryable

-> They're shared between wikis (pooled expertise)

Pages are implicitly in the parent categories of their explicit

categories

-> Pages in <Politicians from the Netherlands> are in <People from the

Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …

-> Yes, this poses issues given the sometimes cyclic nature of

categories' hierarchies, but this is relatively trivial to code around

Readers can search, querying across categories regardless of whether

they're implicit or explicit

-> A search for the intersection of <People from the Netherlands> with

<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)

-> Searches might be more than just intersections, e.g. "<Painters from

the United Kingdom> AND <Living people> NOT <Members of the Royal Academy>" or whatever.

-> Such queries might be cached (and, indeed, the intersections that

people search for might be used to suggest new categorisation schemata that wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)

Editors can tag articles with leaf or branch categories, potentially

over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both)

-> Editors don't need to know the hierarchy of categories *a priori* when

adding pages to them (yay, less difficulty)

-> Power editors don't need to type in loads of different categories if

they have a very specific one in mind (yay, still flexible)

-> Categories shown to readers aren't necessarily the categories saved in

the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)

Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)

[Snip]

...
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work

on.

I guess I should post this there too, maybe once someone's told me if it's mad-cap. ;-)

J.

James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc.

jforrester@wikimedia.org | @jdforrester _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

4214

Age (days ago)

4219

Last active (days ago)

wikitech-l@lists.wikimedia.org

18 comments

10 participants

tags (0)

participants (10)

Bartosz Dziewoński
Brad Jorsch
Brian Wolff
David Gerard
Denny Vrandečić
Dmitriy Sintsov
James Forrester
Luke Welling WMF
Matthew Flaschen
Sumana Harihareswara