Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.
The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.
Conversations have been a bit scattered:
https://meta.wikimedia.org/wiki/Beyond_categories
http://lists.wikimedia.org/pipermail/wikidata-l/2013-May/thread.html#2202 ("Question about wikipedia categories.")
https://en.wikipedia.org/wiki/Wikipedia_talk:Category_intersection#A_working...
CatScan, which can find articles in category intersections: https://en.wikipedia.org/wiki/Wikipedia:CatScan
http://lists.wikimedia.org/pipermail/gendergap/2013-April/003552.html
https://bugzilla.wikimedia.org/show_bug.cgi?id=5244 "Allow searching in intersections, etc. of categories"
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work on.
On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:
Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.
The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.
To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:
* Categories are implemented in Wikidata * -> They're in whatever language the user wants (so fr:Chat and en:Cat and nl:kat and zh-han-t:貓 …) * -> They're properly queryable * -> They're shared between wikis (pooled expertise)
* Pages are implicitly in the parent categories of their explicit categories * -> Pages in <Politicians from the Netherlands> are in <People from the Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and … * -> Yes, this poses issues given the sometimes cyclic nature of categories' hierarchies, but this is relatively trivial to code around
* Readers can search, querying across categories regardless of whether they're implicit or explicit * -> A search for the intersection of <People from the Netherlands> with <Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category) * -> Searches might be more than just intersections, e.g. "<Painters from the United Kingdom> AND <Living people> NOT <Members of the Royal Academy>" or whatever. * -> Such queries might be cached (and, indeed, the intersections that people search for might be used to suggest new categorisation schemata that wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)
* Editors can tag articles with leaf or branch categories, potentially over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both) * -> Editors don't need to know the hierarchy of categories *a priori* when adding pages to them (yay, less difficulty) * -> Power editors don't need to type in loads of different categories if they have a very specific one in mind (yay, still flexible) * -> Categories shown to readers aren't necessarily the categories saved in the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)
Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)
[Snip]
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work on.
I guess I should post this there too, maybe once someone's told me if it's mad-cap. ;-)
J.
On 2013-05-08 11:48 PM, "James Forrester" jforrester@wikimedia.org wrote:
On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:
Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.
The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.
To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:
Just to clarify, you mean sparkles in the way that a unicorn sparkles as its hopping over a rainbow, not sparkle as in SPARQL (semantic triple store based)?
- Categories are implemented in Wikidata
- -> They're in whatever language the user wants (so fr:Chat and en:Cat
and
nl:kat and zh-han-t:貓 …)
Issue (probably can be dealt with somehow or maybe rare enough not to care): conflicts - what if the name of one cat in french is the same as a different category in spanish. May be non issue if done using wikidata numeric ids
- -> They're properly queryable
Various groups have variois definitions of this
- -> They're shared between wikis (pooled expertise)
Between wikipedias or all wikimedia wikis... category structure has varried meaning between projects. Category:North_America has different types of pages in enwikinews compared to enwikipedia.
- Pages are implicitly in the parent categories of their explicit
categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
In the current structure. It doesnt make sense for Bob to be in list of people by professions. It makes less sense the futher you traverse the cayegory graph. Otoh better querying capabilities may turn the category system into more of a flat namespace making that less of an issue.
- Readers can search, querying across categories regardless of whether
they're implicit or explicit
- -> A search for the intersection of <People from the Netherlands> with
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)
We would need some system to turn fake cats into real queries. I suppose users could make redirects. The alternative of magic nlp sounds difficult
- -> Searches might be more than just intersections, e.g. "<Painters from
the United Kingdom> AND <Living people> NOT <Members of the Royal
Academy>"
or whatever.
- -> Such queries might be cached (and, indeed, the intersections that
people search for might be used to suggest new categorisation schemata
that
wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)
Dealing with cache invalidation (unless it is quite coarse grained) may be difficult.
- Editors can tag articles with leaf or branch categories, potentially
over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both)
That's quite an interesting idea, and one I haven't heard before from previous times this has been brought up.
One concern id have is how to figure out which categories to list at the bottom of the page (all that could fit, or only the base categories, and how to determine what that is)
- -> Editors don't need to know the hierarchy of categories *a priori*
when
adding pages to them (yay, less difficulty)
- -> Power editors don't need to type in loads of different categories if
they have a very specific one in mind (yay, still flexible)
- -> Categories shown to readers aren't necessarily the categories saved
in
the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)
Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)
[Snip]
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work
on.
I guess I should post this there too, maybe once someone's told me if
it's
mad-cap. ;-)
I think you have captured what a lot of people want in a somewhat dreamy sense. However there is still a lot to do to make that vision concrete. In particular i think there would be non trivial ui challanges to make this understandable to the user.
----
From what I hear wikidata phase 3 is going to basically be support for
inline queries. Details are vauge but if they support the typical types of queries you associate with semantic networks - there is category intersection right there.
If any of the wikidata folk could comment on what sort of queries are planned for phase 3, performance/scaling considerations, technologies being considered (triple store?) Id be very interested in hearing. (I recognize that future plans may not exist yet)
more generally it would be interesting to know the performance characteristics of SPARQL type query systems, since people seem to be talking about them. Are they a non starter or could they be feasible? Semantic and efficient are not words I associate with each other, but that is due to rumour not actual data. (Although my brief googling doesnt exactly look promising)
-bawolff
On 05/09/2013 12:13 AM, Brian Wolff wrote:
Between wikipedias or all wikimedia wikis... category structure has varried meaning between projects. Category:North_America has different types of pages in enwikinews compared to enwikipedia.
Right. It would have to be grouped by top-level project (all Wikipedias, all Wikinewses, etc.). But even then you could run into inter-project conflicts. However, I think many of the *intra*-project category disagreements are actually about when intersections are appropriate, so it might not be that bad (since the trend would be towards intersections as just queries).
Matt Flaschen
2013/5/9 Brian Wolff bawolff@gmail.com
From what I hear wikidata phase 3 is going to basically be support for inline queries. Details are vauge but if they support the typical types of queries you associate with semantic networks - there is category intersection right there.
Right.
If any of the wikidata folk could comment on what sort of queries are planned for phase 3, performance/scaling considerations, technologies being considered (triple store?) Id be very interested in hearing. (I recognize that future plans may not exist yet)
For now we aim at "property value" or "property value-restriction" type queries, and their intersections, i.e.
"lives in -> California" "born -> before 1980" "lives in -> California AND born -> before 1980"
We are considering several different technologies, and we had a number of discussions already with a number of people.
My current gut feeling, based on the limited tests we did so far is, that we start with our normal SQL setup for the first two type of queries, and that we extend to Solr for the third type of queries (i.e. intersections).
more generally it would be interesting to know the performance characteristics of SPARQL type query systems, since people seem to be talking about them. Are they a non starter or could they be feasible?
We are currently not aiming to provide an unrestricted SPARQL endpoint. But, I know that a few other organizations are very interested in setting that up. If it shows that it would be feasible for us, I'd be very happy if we did it too.
Semantic and efficient are not words I associate with each other, but that is due to rumour not actual data. (Although my brief googling doesnt exactly look promising)
That's indeed a rumor, but leads to another discussion :)
Cheers, Denny
-bawolff _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:
- Pages are implicitly in the parent categories of their explicit categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.
Let's consider what would happen to one of my favorite examples on enwiki: * The article for Romania is in <Black Sea countries>. Ok. * And that category is in <Black Sea>, so Romania is in that too. Which is a little strange, but not too bad. * And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>. Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.
And it gets worse the further up you go. You would have Romania in <Liquids> a few more levels up.
For this to work, each wiki would have to redo its category hierarchy as a real ontology based on is-a relationships, rather than the current is-somehow-related-to. Or we would have to introduce some magic word or something to tell MediaWiki that <Politicians> is-a <People> is a valid inference while <Black Sea countries> is-a <Black Sea> isn't.
In other words, code-wise adding "tags" to an article is the same as categories with inference and querying. But trying to use the existing category setup as it exists on something like enwiki as "tags" for inference (or querying, to a lesser extent) seems like GIGO.
- Readers can search, querying across categories regardless of whether
they're implicit or explicit
- -> A search for the intersection of <People from the Netherlands> with
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)
A person who is originally from the Netherlands but moved to Germany and became a politician there would be in <People from the Netherlands> and <Politicians>, but maybe should not be in <Politicians from the Netherlands> depending on how exactly you define that category.
[I worry we're talking about operational details, which should be a wider discussion, rather than a technology/feasibility conversation to which this list is more suited. Perhaps moving this on-wiki would be best?]
On 9 May 2013 09:28, Brad Jorsch bjorsch@wikimedia.org wrote:
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:
- Pages are implicitly in the parent categories of their explicit
categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.
Let's consider what would happen to one of my favorite examples on enwiki:
- The article for Romania is in <Black Sea countries>. Ok.
- And that category is in <Black Sea>, so Romania is in that too.
Which is a little strange, but not too bad.
- And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.
Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.
And it gets worse the further up you go. You would have Romania in <Liquids> a few more levels up.
For this to work, each wiki would have to redo its category hierarchy as a real ontology based on is-a relationships, rather than the current is-somehow-related-to. Or we would have to introduce some magic word or something to tell MediaWiki that <Politicians> is-a <People> is a valid inference while <Black Sea countries> is-a <Black Sea> isn't.
In other words, code-wise adding "tags" to an article is the same as categories with inference and querying. But trying to use the existing category setup as it exists on something like enwiki as "tags" for inference (or querying, to a lesser extent) seems like GIGO.
Quite - the bit of my proposal where the categories would get created on Wikidata from scratch as a synthesis of the needs of the editing community. :-)
Implicitly, these would have clear semantics about the correctitude of their usage governed by something analogous to how Wikidata's community are managing the roll-out of statements on the system. In terms of tools to prevent this becoming an issue, Wikidata's nature means we could easily make sure that the domain of a category would be limited (e.g. "Fluids" maps to "substances", not "instances of substances").
- Readers can search, querying across categories regardless of whether
they're implicit or explicit
- -> A search for the intersection of <People from the Netherlands> with
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)
A person who is originally from the Netherlands but moved to Germany and became a politician there would be in <People from the Netherlands> and <Politicians>, but maybe should not be in <Politicians from the Netherlands> depending on how exactly you define that category.
Indeed; I deliberately chose to use <Politicians from the Netherlands> rather than <Politicians of the Netherlands> or <Politicians in the Netherlands> which are distinct categories with entirely different semantics, but you're right that semantics would need to be clear.
J.
Without deliberately making it an even longer term plan, as I think it is a great idea, another long goal solution to the same problem would be (as Flow gets Wikipedians into the idea of tagging) that categories get largely replaced by tags. That way they lose much of their absoluteness and therefore some of their controversy.
Categories are hard for Wikipedia because compromise is not possible. Consensus can be reached on a subtly different compromise version of the wording of a sentence or paragraph, but there is no compromise on categories. A category either exists or does not. A page either goes in or does not.
With tags, a biography could relatively uncontroversially be tagged as "Novelist, Woman, Best Selling, American, Blonde Haired, Enjoys Spicy Food" even if nearly everybody agrees that half the tags while true are entirely unimportant and not relevant to the subject's area of notability. Whether some tags like race and appearance should exist at all may still generate debate, but if they are only ever available modifiers and not hard categories their offense would be softened.
For some subjects, entirely uncontroversial tags could be extracted from Wikidata.
It would be content shakeup and therefore perhaps politically difficult, but it would take a lot of the technical challenge out of joins, even permitting joins (automatically or manually) with tags translated into equivalent versions in other languages.
All possible combinations of tag derived categories would then "exist", and it would just be a matter of debate as to whether there is a justification to add a link from a page to "Biography+Novelist+Enjoys Spicy Food" or if that is a meaningless category. If reverted, the one person interested in that exact category could still always visit it, it's just that other users would not be directed to it unless they probe talk page debates.
Luke Welling
On Thu, May 9, 2013 at 12:38 PM, James Forrester jforrester@wikimedia.orgwrote:
[I worry we're talking about operational details, which should be a wider discussion, rather than a technology/feasibility conversation to which this list is more suited. Perhaps moving this on-wiki would be best?]
On 9 May 2013 09:28, Brad Jorsch bjorsch@wikimedia.org wrote:
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:
- Pages are implicitly in the parent categories of their explicit
categories
- -> Pages in <Politicians from the Netherlands> are in <People from
the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.
Let's consider what would happen to one of my favorite examples on
enwiki:
- The article for Romania is in <Black Sea countries>. Ok.
- And that category is in <Black Sea>, so Romania is in that too.
Which is a little strange, but not too bad.
- And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.
Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.
And it gets worse the further up you go. You would have Romania in <Liquids> a few more levels up.
For this to work, each wiki would have to redo its category hierarchy as a real ontology based on is-a relationships, rather than the current is-somehow-related-to. Or we would have to introduce some magic word or something to tell MediaWiki that <Politicians> is-a <People> is a valid inference while <Black Sea countries> is-a <Black Sea> isn't.
In other words, code-wise adding "tags" to an article is the same as categories with inference and querying. But trying to use the existing category setup as it exists on something like enwiki as "tags" for inference (or querying, to a lesser extent) seems like GIGO.
Quite - the bit of my proposal where the categories would get created on Wikidata from scratch as a synthesis of the needs of the editing community. :-)
Implicitly, these would have clear semantics about the correctitude of their usage governed by something analogous to how Wikidata's community are managing the roll-out of statements on the system. In terms of tools to prevent this becoming an issue, Wikidata's nature means we could easily make sure that the domain of a category would be limited (e.g. "Fluids" maps to "substances", not "instances of substances").
- Readers can search, querying across categories regardless of whether
they're implicit or explicit
- -> A search for the intersection of <People from the Netherlands>
with
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)
A person who is originally from the Netherlands but moved to Germany and became a politician there would be in <People from the Netherlands> and <Politicians>, but maybe should not be in <Politicians from the Netherlands> depending on how exactly you define that category.
Indeed; I deliberately chose to use <Politicians from the Netherlands> rather than <Politicians of the Netherlands> or <Politicians in the Netherlands> which are distinct categories with entirely different semantics, but you're right that semantics would need to be clear.
J.
James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc.
jforrester@wikimedia.org | @jdforrester _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 2013-05-09 3:21 PM, "Luke Welling WMF" lwelling@wikimedia.org wrote:
Without deliberately making it an even longer term plan, as I think it is
a
great idea, another long goal solution to the same problem would be (as Flow gets Wikipedians into the idea of tagging) that categories get
largely
replaced by tags. That way they lose much of their absoluteness and therefore some of their controversy.
Categories are hard for Wikipedia because compromise is not possible. Consensus can be reached on a subtly different compromise version of the wording of a sentence or paragraph, but there is no compromise on categories. A category either exists or does not. A page either goes in
or
does not.
With tags, a biography could relatively uncontroversially be tagged as "Novelist, Woman, Best Selling, American, Blonde Haired, Enjoys Spicy
Food"
even if nearly everybody agrees that half the tags while true are entirely unimportant and not relevant to the subject's area of notability. Whether some tags like race and appearance should exist at all may still generate debate, but if they are only ever available modifiers and not hard categories their offense would be softened.
For some subjects, entirely uncontroversial tags could be extracted from Wikidata.
It would be content shakeup and therefore perhaps politically difficult, but it would take a lot of the technical challenge out of joins, even permitting joins (automatically or manually) with tags translated into equivalent versions in other languages.
All possible combinations of tag derived categories would then "exist",
and
it would just be a matter of debate as to whether there is a justification to add a link from a page to "Biography+Novelist+Enjoys Spicy Food" or if that is a meaningless category. If reverted, the one person interested in that exact category could still always visit it, it's just that other
users
would not be directed to it unless they probe talk page debates.
Luke Welling
Nobody has ever been able to explain to me the technical difference between tag and category. (Other than being able to query intersections, which is wanted for cats anyhow)
Just change mediawiki:pagecategories to "tags" and change some social conventions - boom you have tags.
-bawolff
On 9 May 2013 12:07, Brian Wolff bawolff@gmail.com wrote:
Nobody has ever been able to explain to me the technical difference between tag and category. (Other than being able to query intersections, which is wanted for cats anyhow)
The theory is that tags are non-hierarchical, casually-applied and well-supported in software (from intersections to more). You can see how people feel what we have is somewhat different from this world vision. :-)
J.
On 2013-05-09 4:13 PM, "James Forrester" jforrester@wikimedia.org wrote:
On 9 May 2013 12:07, Brian Wolff bawolff@gmail.com wrote:
Nobody has ever been able to explain to me the technical difference
between
tag and category. (Other than being able to query intersections, which
is
wanted for cats anyhow)
The theory is that tags are non-hierarchical, casually-applied and well-supported in software (from intersections to more). You can see how people feel what we have is somewhat different from this world vision. :-)
J.
James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc.
jforrester@wikimedia.org | @jdforrester _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Categories can be flat if people want them to be. Categories can be casual if people want them to be (guess the red link discourages but that's trivial to change)
People seem to want lots of things. When it comes to the tag camp, other than non-crappy category intersection, we seem to have the things people want, which confuses me why people are asking for them.
-bawolff
On 05/09/2013 03:07 PM, Brian Wolff wrote:
Just change mediawiki:pagecategories to "tags" and change some social conventions - boom you have tags.
Just a reminder - customs and assumptions are sometimes harder to change than digital or physical environments.
On 9 May 2013 19:21, Luke Welling WMF lwelling@wikimedia.org wrote:
With tags, a biography could relatively uncontroversially be tagged as "Novelist, Woman, Best Selling, American, Blonde Haired, Enjoys Spicy Food" even if nearly everybody agrees that half the tags while true are entirely unimportant and not relevant to the subject's area of notability. Whether some tags like race and appearance should exist at all may still generate debate, but if they are only ever available modifiers and not hard categories their offense would be softened.
It appears sir is less than entirely familiar with Wikipedia edit wars. There is NO dispute that will not lead to six months of wikidrama.
So yeah, whatever we use to add tags needs to be able to add citations as well, right there next to the tag.
- d.
On 9 May 2013 12:40, David Gerard dgerard@gmail.com wrote:
So yeah, whatever we use to add tags needs to be able to add citations as well, right there next to the tag.
Well, to me that's an even more compelling reason create such a system (Categories as-is don't have them): "WikiTags let you cite the source of why the subject is tagged that way".
J.
On 05/09/2013 03:49 PM, James Forrester wrote:
On 9 May 2013 12:40, David Gerard dgerard@gmail.com wrote:
So yeah, whatever we use to add tags needs to be able to add citations as well, right there next to the tag.
Well, to me that's an even more compelling reason create such a system (Categories as-is don't have them): "WikiTags let you cite the source of why the subject is tagged that way".
Or a reason to have all the real data in Wikiata, and have NewCategories be a way to query Wikidata. Wikidata already has (basic but improving) support for adding a citation to every statement.
But this (like switching to tags) would be a big change.
Matt Flaschen
On 09.05.2013 20:28, Brad Jorsch wrote:
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:
- Pages are implicitly in the parent categories of their explicit categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.
Let's consider what would happen to one of my favorite examples on enwiki:
- The article for Romania is in <Black Sea countries>. Ok.
- And that category is in <Black Sea>, so Romania is in that too.
Which is a little strange, but not too bad.
- And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.
Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.
There is probably nothing contradictionary in your Black sea category relation example because "Seas of <country>" implies that <country> has *multiple* seas, while Romania has only *one* sea border (no offence, there are lot of small countries and large country does not always means a happy life). <Landforms of Ukraine> is a little bit more weird, but could be explained as long and complex area of Crimean peninsula. So, the categories actually are not so wrong. Dmitriy
Dmitriy, yes, and that's the point - in practice Wikipedia categories are not transitive.
We've been working on this slowly at pl.wp lately, trying to split the categories into two kinds marked with appropriate templates: "topic" categories, including all related entries, e.g. Category:Water; and "object" ones, which would imply transitiveness, like the Politicians example provided above.
Topic cats could contain object ones, but not vice versa. This seems like a good compromise to me, but requires a good deal of work, and we're slowly progressing.
On Fri, May 10, 2013 at 1:43 AM, Dmitriy Sintsov questpc@rambler.ru wrote:
On 09.05.2013 20:28, Brad Jorsch wrote:
On Wed, May 8, 2013 at 10:47 PM, James Forrester jforrester@wikimedia.org wrote:
- Pages are implicitly in the parent categories of their explicit
categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
Category cycles are the least of it. The fact that the existing category hierarchy isn't based on any sensible-for-inference ontology is a bigger problem.
Let's consider what would happen to one of my favorite examples on enwiki:
- The article for Romania is in <Black Sea countries>. Ok.
- And that category is in <Black Sea>, so Romania is in that too.
Which is a little strange, but not too bad.
- And <Black Sea> is in <Seas of Russia> and <Landforms of Ukraine>.
Huh? Romania doesn't belong in either of those, despite that being equivalent to your example where pages in <Politicians from the Netherlands> also end up in <People> via <Politicians>.
There is probably nothing contradictionary in your Black sea category relation example because "Seas of <country>" implies that <country> has *multiple* seas, while Romania has only *one* sea border (no offence, there are lot of small countries and large country does not always means a happy life). <Landforms of Ukraine> is a little bit more weird, but could be explained as long and complex area of Crimean peninsula. So, the categories actually are not so wrong.
I think you misunderstood. The point was that the article on *Romania* would end up in <Seas of Russia> and <Landforms of Ukraine>.
OTOH, I missed the part of James's original proposal about creating the whole ontology using this inference system from scratch on Wikidata based on strict is-a relationships. So <Black Sea countries> wouldn't be in <Black Sea> in the Wikidata ontology, because countries aren't the sea.
I agree with most of the use cases, and I think they will possible with Wikidata.
My suggestion would be to wait for this year, and then see which of the use cases are still open: I think that by the end of the year we should have made all of them possible (besides the "Searches might be more than interesection", I am not sure about this one)
2013/5/9 James Forrester jforrester@wikimedia.org
On 8 May 2013 18:26, Sumana Harihareswara sumanah@wikimedia.org wrote:
Recently a lot of people have been talking about what's possible and what's necessary regarding MediaWiki, CatScan-like tools, and real category intersection; this mail has some pointers.
The long-term solution is a sparkly query for, e.g., people with aspects novelist + Singaporean, and it would be great if Wikidata could be the data-source. Generally people don't really want to search using hierarchical categories; they want tags and they want AND. But MediaWiki's current power users do use hierarchical labels, so any change would have to deal with current users' expectations. Also my head hurts just thinking of the "but my intuitively obvious ontology is better than yours" arguments.
To put a nice clear stake in the ground, a magic-world-of-loveliness sparkly proposal for 2015* might be:
- Categories are implemented in Wikidata
- -> They're in whatever language the user wants (so fr:Chat and en:Cat and
nl:kat and zh-han-t:貓 …)
-> They're properly queryable
-> They're shared between wikis (pooled expertise)
Pages are implicitly in the parent categories of their explicit
categories
- -> Pages in <Politicians from the Netherlands> are in <People from the
Netherlands by profession> (its first parent) and <People from the Netherlands> (its first parent's parent) and <Politicians> (its second parent) and <People> (its second parent's parent) and …
- -> Yes, this poses issues given the sometimes cyclic nature of
categories' hierarchies, but this is relatively trivial to code around
- Readers can search, querying across categories regardless of whether
they're implicit or explicit
- -> A search for the intersection of <People from the Netherlands> with
<Politicians> will effectively return results for <Politicians from the Netherlands> (and the user doesn't need to know or care that this is an extant or non-extant category)
- -> Searches might be more than just intersections, e.g. "<Painters from
the United Kingdom> AND <Living people> NOT <Members of the Royal Academy>" or whatever.
- -> Such queries might be cached (and, indeed, the intersections that
people search for might be used to suggest new categorisation schemata that wikis had previously not considered - e.g. <British politicians> & <People with pet cats> & <People who died in hot-ballooning accidents)
- Editors can tag articles with leaf or branch categories, potentially
over-lapping and the system will rationalise the categories on save to the minimally-spanning subset (or whatever is most useful for users, the database, and/or both)
- -> Editors don't need to know the hierarchy of categories *a priori* when
adding pages to them (yay, less difficulty)
- -> Power editors don't need to type in loads of different categories if
they have a very specific one in mind (yay, still flexible)
- -> Categories shown to readers aren't necessarily the categories saved in
the database, at editorial judgement (otherwise, would a page not be in just a single category, namely the intersection of all its tagged categories?)
Apart from the time and resources needed to make this happen and operational, does this sound like something we'd want to do? It feels like this, or something like it, would serve our editors and readers the best from their perspective, if not our sysadmins. :-)
[Snip]
I think the best place to pursue this topic is probably in https://meta.wikimedia.org/wiki/Talk:Beyond_categories . It's unlikely Wikimedia Foundation will be able to make engineers available to work on this anytime soon, but I would not be surprised if the Wikidata developer community or volunteers found this interesting enough to work
on.
I guess I should post this there too, maybe once someone's told me if it's mad-cap. ;-)
J.
James D. Forrester Product Manager, VisualEditor Wikimedia Foundation, Inc.
jforrester@wikimedia.org | @jdforrester _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org