I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
So far I have identified four rough types of categories. I'll invent the notion a(X) to mean that article X is in category a. a(b(X)) means that a is a subcategory of b, and X is in b.
Taxonomies: Tend to end in "s" and satisfy the rule that "If a(X) then X is an a") is a logical sentence. Tend to form strict hierarchies, where if a(X) and b(a), then it's perfectly natural and normal that b(a(X)). Eg, Bridges in France is a subcat of Bridges, and every entry in "bridges in France" is definitely a Bridge. It's rare for an article to be in more than two taxonomic categories at once.
Themes: Tend not to be plurals, and tend not to form strict hierarchies. Often it is the case that b looks like it belongs in a, but then a(b(X)) is nonsense for certain X. Eg, Paris might be in European cities, and the film Amelie might be in Paris, but it's silly to say that Amelie is in European cities. (or many worse examples)
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
Meta-attributes: These are categories about *articles* rather than article subjects. The most common examples are stubs ("France geography stubs"), sources ("1911 Encyclopaedia Britannica") and disputes of various kinds ("Articles lacking sources").
To me, these types of categories are all fairly incompatible, and really get in the way of using categories to do anything useful. It's pointless trying to draw tree structures when you have attributes and meta-attributes involved, for example.
So my questions are these: *Can anyone think of other types of categories I might have missed? *How could Wikipedia be better if this general problem was addressed? *How could this problem be addressed?
Steve
On 6/3/06, Steve Bennett stevagewp@gmail.com wrote:
I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
........
So my questions are these: *Can anyone think of other types of categories I might have missed? *How could Wikipedia be better if this general problem was addressed? *How could this problem be addressed?
Steve
I'd say you can't, and possibly shouldn't try to address this. Any more rigorous and useful system would entail vast, vast amounts of work which no one wants to do (compare with the remarkable, even blockbuster sucess of en's Persondata - and Persondata is much easier than devising and implementing a rigorous category system, since such systems are domain-specific and simply *hard* to do well), especially when there are much more pressing issues, like simple accuracy or decent writing to worry about.
~maru
I've thought a lot about categories and differing (and incompatible) uses from about a week after the categories were first introduced. I agree that categories could be very useful if used carefully.
To my mind there are only two main types of categories: taxonomies/attributes, and themes. I don't think the strict taxonomy category really exists (for instance, using your "bridges" example, the Golden Gate Bridge is in both [[Category:Bridges in California]] and [[Category:Bridges completed in 1937]]). The third type of category, which you label meta-attributes, is another type, I suppose, but I kind of ignore these types of categories.
I think the way to make these two main types of categories compatible is to enforce two rules: 1) taxonomies/attributes are always plural (and themes are never plural) 2) themes are never subcategories of taxonomies/attributes (but the reverse is allowed).
Also, there is one exception to the taxonomy rule (mainly just because it's so commonplace): the article about the taxonomy itself can be within the taxonomy, but it must be specially tagged so that it is listed at the very beginning of the category. (for instance, [[Woman]] must be tagged [[Category:Women| ]] or [[Category:Women|*]]).
Also, I think all the taxonomy/attribute categories should be under a single parent category. So if you started at the top and only went down, you'd get all the attributes categories, and none of the themes categories. If I remember correctly, this even used to be the case.
Meta-attributes should eventually be replaced by a better meta-tagging system (or moved to the talk page). Someone was working on a sort of to-do management system a while ago. I'm not sure what became of it. In the mean time, keeping the meta tags under some [[Category:Meta-Wiki]] should be sufficient to keep them separate.
Anthony
On 6/3/06, Steve Bennett stevagewp@gmail.com wrote:
I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
So far I have identified four rough types of categories. I'll invent the notion a(X) to mean that article X is in category a. a(b(X)) means that a is a subcategory of b, and X is in b.
Taxonomies: Tend to end in "s" and satisfy the rule that "If a(X) then X is an a") is a logical sentence. Tend to form strict hierarchies, where if a(X) and b(a), then it's perfectly natural and normal that b(a(X)). Eg, Bridges in France is a subcat of Bridges, and every entry in "bridges in France" is definitely a Bridge. It's rare for an article to be in more than two taxonomic categories at once.
Themes: Tend not to be plurals, and tend not to form strict hierarchies. Often it is the case that b looks like it belongs in a, but then a(b(X)) is nonsense for certain X. Eg, Paris might be in European cities, and the film Amelie might be in Paris, but it's silly to say that Amelie is in European cities. (or many worse examples)
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
Meta-attributes: These are categories about *articles* rather than article subjects. The most common examples are stubs ("France geography stubs"), sources ("1911 Encyclopaedia Britannica") and disputes of various kinds ("Articles lacking sources").
To me, these types of categories are all fairly incompatible, and really get in the way of using categories to do anything useful. It's pointless trying to draw tree structures when you have attributes and meta-attributes involved, for example.
So my questions are these: *Can anyone think of other types of categories I might have missed? *How could Wikipedia be better if this general problem was addressed? *How could this problem be addressed?
Steve _______________________________________________ WikiEN-l mailing list WikiEN-l@Wikipedia.org To unsubscribe from this mailing list, visit: http://mail.wikipedia.org/mailman/listinfo/wikien-l
On 6/3/06, Anthony DiPierro wikilegal@inbox.org wrote:
Also, there is one exception to the taxonomy rule (mainly just because it's so commonplace): the article about the taxonomy itself can be within the taxonomy, but it must be specially tagged so that it is listed at the very beginning of the category.
This personally really, really bugs me. I remove such cases when editing articles.
The article about the category should link to the category (often within 'See also'). The category description should link the the article.
-Matt
On 6/4/06, Matt Brown morven@gmail.com wrote:
This personally really, really bugs me. I remove such cases when editing articles.
The article about the category should link to the category (often within 'See also'). The category description should link the the article.
What you're saying makes sense, but does not currently have consensus. Implementing it therefore sounds like a bad idea bound to cause upset.
Steve
On 6/3/06, Steve Bennett stevagewp@gmail.com wrote:
What you're saying makes sense, but does not currently have consensus. Implementing it therefore sounds like a bad idea bound to cause upset.
I suspect that simply no real discussion of this has taken place yet.
On 6/3/06, Steve Bennett stevagewp@gmail.com wrote:
What you're saying makes sense, but does not currently have consensus. Implementing it therefore sounds like a bad idea bound to cause upset.
I suspect that simply no real discussion of this has taken place yet.
-Matt
Matt Brown wrote:
On 6/3/06, Steve Bennett stevagewp@gmail.com wrote:
What you're saying makes sense, but does not currently have consensus. Implementing it therefore sounds like a bad idea bound to cause upset.
I suspect that simply no real discussion of this has taken place yet.
There's not much to discuss. Either you like recursive inclusions or you don't. I tend to favour them, but am not too dilligent about remembering to add them.
Ec
On Sat, 03 Jun 2006 19:54:27 +0200, Steve Bennett wrote:
I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
Agreed. Still: can you give some specific examples of wonderful things that could be done but are not possible now? That would tell us what problem you are trying to solve.
So far I have identified four rough types of categories. I'll invent the notion a(X) to mean that article X is in category a. a(b(X)) means that a is a subcategory of b, and X is in b.
ITYM "b is a subcategory of a".
Taxonomies: Tend to end in "s" and satisfy the rule that "If a(X) then X is an a") is a logical sentence. Tend to form strict hierarchies, where if a(X) and b(a), then it's perfectly natural and normal that b(a(X)). Eg, Bridges in France is a subcat of Bridges, and every entry in "bridges in France" is definitely a Bridge. It's rare for an article to be in more than two taxonomic categories at once.
"Bridges in France" may not be the best example. "Bridges in France" is just an intersection of two attributes ("in France", "Bridges"), and their relative position in a hierarchy is undefined. Hence more than one hierarchy: You can drill down "France" ... "Buildings and structurces in France" or "Bridges" ... "Bridges by country".
Compare with taxons in the classification of species: an actual hierarchy, and only one path from the top down to any species -- there you are dividing into subsets (and intersections make no sense).
Categories based on such intersections of attributes are conceptually bad. Look at the categories for an article like [[Marie Curie]]: She's French three times, female four times, Polish four times (not counting "Natives of Warsaw"), etc. Why not create [[Category:Polish women who were born in 1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?
If we don't have a term for (or an article about) it, there probably shouldn't be a category for it, either (I'm sure a determined mind could come up with an exception).
Themes: Tend not to be plurals, and tend not to form strict hierarchies. Often it is the case that b looks like it belongs in a, but then a(b(X)) is nonsense for certain X. Eg, Paris might be in European cities, and the film Amelie might be in Paris, but it's silly to say that Amelie is in European cities. (or many worse examples)
Well yes, Amelie _is_ related to European cities. It is relevant for a list of movies that are set in European cities. The real problem is that the initial relation is entirely unqualified: Amelie is neither a part nor a member of Paris.
You could conceivably create a category "set in Paris" for the film and have that be a subcategory of "set in European cities". Problem is, you need to propagate that modifier backwards all the way to the top or you will have the same situation you described.
The best solution I've seen is qualifying relations (something like the [[Semantic MediaWiki]]). For instance: Amelie is set in [[set in::Paris]].
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
There is no way to make hierarchies not suck, especially if you have to maintain them manually (as we do now). Don't try to impose hierarchies unless they emerge quite naturally from the subject.
Meta-attributes: These are categories about *articles* rather than article subjects. The most common examples are stubs ("France geography stubs"), sources ("1911 Encyclopaedia Britannica") and disputes of various kinds ("Articles lacking sources").
Actually, "France geography stubs" contains two attributes (France, geography). Only the "stub" part is not about the subject. But yeah, it's a problem.
Another one that you didn't mention is articles that merge several concepts into one: This happens for instance if a biography is merged with the thing that made the person notable. You get articles that are in people and object categories at the same time (e.g. programmers, software).
To me, these types of categories are all fairly incompatible, and really get in the way of using categories to do anything useful. It's pointless trying to draw tree structures when you have attributes and meta-attributes involved, for example.
So the problem you are trying to solve is drawing tree structures? I'm afraid your problem may not be shortcomings in WP, but the real world.
So my questions are these: *Can anyone think of other types of categories I might have missed?
Basically, you have identified: 1) is an intersection of [Bridges in France / in France & Bridges] 2) is a subset of [Bridges in France / Bridges] 3) is a member of [Paris / European Cities] and all your attribute examples 4) is related to (or more specifically: is set in) [Amelie (movie) / Paris] 5) information about the article
1) can be computed and shouldn't exist as categories. I'm not sure whether we care about the difference between 2) and 3). 5) you can quite easily deal with using namespaces (depending on the problem, of course). The meat is in 4): You can add any number of named relations there, and most of the current ugliness is there.
*How could Wikipedia be better if this general problem was addressed?
What was the problem again?
Anyhow, I guess my main point is that hierarchies are overrated. They are most useful when you don't have a computer to sort things out for you.
Roger
On 6/3/06, Roger Luethi collector@hellgate.ch wrote:
On Sat, 03 Jun 2006 19:54:27 +0200, Steve Bennett wrote:
I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
Agreed. Still: can you give some specific examples of wonderful things that could be done but are not possible now? That would tell us what problem you are trying to solve.
I've personally run into this when trying to automatically create, for example, a list of all Wikipedia articles on people. You can't just start at [[category:people]] and work your way down, because you wind up going to [[Category:Women]] (fine, all women are people) then [[Category:Feminine hygene]] (bad).
Categories based on such intersections of attributes are conceptually bad. Look at the categories for an article like [[Marie Curie]]: She's French three times, female four times, Polish four times (not counting "Natives of Warsaw"), etc. Why not create [[Category:Polish women who were born in 1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?
Because there would only be one person in that category.
If we don't have a term for (or an article about) it, there probably shouldn't be a category for it, either (I'm sure a determined mind could come up with an exception).
If the category system could effectively build these intersection categories on the fly, I'd agree. But the category system can't currently do that. (And it's been around a reasonably long time, with that as an obvious flaw, and no one has fixed it.)
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
There is no way to make hierarchies not suck, especially if you have to maintain them manually (as we do now). Don't try to impose hierarchies unless they emerge quite naturally from the subject.
I made a proposal. All subcategories of attributes must be a subset of the parent attribute. Seems like a perfectly reasonable way to make hierarchies not suck.
Anthony
On Sat, 03 Jun 2006 17:27:59 -0400, Anthony DiPierro wrote:
Agreed. Still: can you give some specific examples of wonderful things that could be done but are not possible now? That would tell us what problem you are trying to solve.
I've personally run into this when trying to automatically create, for example, a list of all Wikipedia articles on people. You can't just start at [[category:people]] and work your way down, because you wind up going to [[Category:Women]] (fine, all women are people) then [[Category:Feminine hygene]] (bad).
Okay, that's equivalent to Steve's Amelie-Paris relation. I agree that's a problem.
Categories based on such intersections of attributes are conceptually bad. Look at the categories for an article like [[Marie Curie]]: She's French three times, female four times, Polish four times (not counting "Natives of Warsaw"), etc. Why not create [[Category:Polish women who were born in 1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?
Because there would only be one person in that category.
That's why nobody made it, but not why it shouldn't be done.
It would be nigh impossible to do well because once we start combining attributes to create new categories, we are looking at maintaining links between articles and an exploding number of subcategories.
But even if we maintained a complete and up-to-date system of subcats, we'd still make it hard for people to find articles using categories. For some fairly sensible reasons, the rule is to include articles only to the subcategory, but not to the parent. There is no way to list articles based on a subset of criteria (the articles in subcategories are effectively hidden on separate pages which is only helpful if you know which one to pick).
If the category system could effectively build these intersection categories on the fly, I'd agree. But the category system can't currently do that. (And it's been around a reasonably long time, with that as an obvious flaw, and no one has fixed it.)
You are right, we can't effectively build these intersection categories on the fly at the moment, but we _could_ automatically create or update such intersection categories if the categories weren't the mess that Steve and you describe. Kind of like the search index.
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
There is no way to make hierarchies not suck, especially if you have to maintain them manually (as we do now). Don't try to impose hierarchies unless they emerge quite naturally from the subject.
I made a proposal. All subcategories of attributes must be a subset of the parent attribute. Seems like a perfectly reasonable way to make hierarchies not suck.
The devil is in the details.
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
Or going back to [[Category:Women]]: You could declare that only articles on instances of women (i.e. biographies) can ever be under that category, and that only sets of such articles can ever be subcategories of the category women. -- You could even create a separate [[Category:Woman]], subcategories like "female reproductive organs" containing articles like uterus. -- But how would you express the undisputed relationship between female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
Roger
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
On Sat, 03 Jun 2006 17:27:59 -0400, Anthony DiPierro wrote:
Categories based on such intersections of attributes are conceptually bad. Look at the categories for an article like [[Marie Curie]]: She's French three times, female four times, Polish four times (not counting "Natives of Warsaw"), etc. Why not create [[Category:Polish women who were born in 1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?
Because there would only be one person in that category.
That's why nobody made it, but not why it shouldn't be done.
I'd say it's both. There shouldn't be categories with only one article in them. IMO that's just common sense.
In the current system categories should have a fair number of articles in them. If there are too many, they should be broken up. If there are too few, they should be combined. There isn't a crystal clear line what constitutes too many and what constitutes too few, but a category with only one article in it clearly has too few.
The problem of categories having too many articles in them wouldn't really be a problem if the software allowed you to automatically compute category intersections. But the software doesn't do this, so people make do with what they've got.
It would be nigh impossible to do well because once we start combining attributes to create new categories, we are looking at maintaining links between articles and an exploding number of subcategories.
But even if we maintained a complete and up-to-date system of subcats, we'd still make it hard for people to find articles using categories. For some fairly sensible reasons, the rule is to include articles only to the subcategory, but not to the parent. There is no way to list articles based on a subset of criteria (the articles in subcategories are effectively hidden on separate pages which is only helpful if you know which one to pick).
If the category system could effectively build these intersection categories on the fly, I'd agree. But the category system can't currently do that. (And it's been around a reasonably long time, with that as an obvious flaw, and no one has fixed it.)
You are right, we can't effectively build these intersection categories on the fly at the moment, but we _could_ automatically create or update such intersection categories if the categories weren't the mess that Steve and you describe. Kind of like the search index.
You're right. And that's what my simple rule that "All subcategories of attributes must be a subset of the parent attribute" is meant to address. If that were the case, it would be possible to automatically recursively descend a parent category to find *all* the articles to which it applies. And then computing the intersection of any two parent categories would be possible. I actually had software which did this, but it doesn't work right because the subcategory rule isn't being followed.
Once the software is written to compute intersections of categories within the Mediawiki software, it would be relatively simple to recategorize the articles into their parent categories, such that no information was lost. The way this would be done is that all articles in a subcategory which had multiple parent attribute categories would be automatically moved into the parent categories. This would be repeated until no such situations continued to exist. The ad-hoc structure could still be kept, but it could be calculated on the fly (along with new types of intersections which could be easily added).
(Now that I do this on an example, I see that this algorithm would probably have to be tweaked to deal with subcatgories of [[Category:Categories by topic]], but that's not too bad.)
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
There is no way to make hierarchies not suck, especially if you have to maintain them manually (as we do now). Don't try to impose hierarchies unless they emerge quite naturally from the subject.
I made a proposal. All subcategories of attributes must be a subset of the parent attribute. Seems like a perfectly reasonable way to make hierarchies not suck.
The devil is in the details.
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
[[Category:Paris]] is a theme, not an attribute, so [[Category:Paris]] should not be a subcategory of [[Category:Capitals in Europe]].
Or going back to [[Category:Women]]: You could declare that only articles on instances of women (i.e. biographies) can ever be under that category, and that only sets of such articles can ever be subcategories of the category women. -- You could even create a separate [[Category:Woman]], subcategories like "female reproductive organs" containing articles like uterus. -- But how would you express the undisputed relationship between female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
Roger
[[Category:Women]] could be a subcategory of [[Category:Woman]]. Making an attribute a subcategory of a theme is allowed, it is the reverse that is not allowed.
In any event, things wouldn't be perfect. Ultimately the best solution would involve fixing the category system itself, a process which should be approached carefully so as to avoid making the same mistakes all over again. The advantage of my proposal to not allow themes as subcategories of attributes is that it can be implemented today, without much disruption, and without modifying any code. Plus, it allows for a relatively straightforward upgrade path when the category system is fixed. The proposal itself is not the fix, it's a temporary workaround.
As an alternative, it would probably be possible to do all of this even without enforcing the subcategory rule. But all purely attribute categories would have to be identified as such. I'll have to think about that.
Anthony
On Sun, 04 Jun 2006 11:31:02 -0400, Anthony DiPierro wrote:
That's why nobody made it, but not why it shouldn't be done.
I'd say it's both. There shouldn't be categories with only one article in them. IMO that's just common sense.
Yes, for all (current) practical purposes. My point was that there are other reasons that should stop people from creating such categories whether or not there is more than one article in it.
compute category intersections. But the software doesn't do this, so people make do with what they've got.
Right. When making suggestions, we need to be careful to indicate whether we propose a work-around for broken software or if we are talking about an ideal solution, or something in between. Plus the problem we are trying to solve. I know I failed already in this short thread :-).
Once the software is written to compute intersections of categories within the Mediawiki software, it would be relatively simple to recategorize the articles into their parent categories, such that no information was lost. The way this would be done is that all articles in a subcategory which had multiple parent attribute categories would be automatically moved into the parent categories. This would be repeated until no such situations continued to exist. The ad-hoc structure could still be kept, but it could be calculated on the fly (along with new types of intersections which could be easily added).
So instead of in the category "Bridges in France", the "Pont du Gard" would now be in the categories "Bridges" and "France"!?
(Now that I do this on an example, I see that this algorithm would probably have to be tweaked to deal with subcatgories of [[Category:Categories by topic]], but that's not too bad.)
I suspect there are more corner cases than we imagine.
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
[[Category:Paris]] is a theme, not an attribute, so [[Category:Paris]] should not be a subcategory of [[Category:Capitals in Europe]].
One problem I'm seeing here is that your proposal focuses on one single type of hierarchies. Paris, France, Europe are now all themes. So if I ask for an intersection of categories "Bridges" and "France", the results include bridges that for any reason are related to the "France" theme. There is no way to ask for Bridges in France, or Paris.
One solution would be to create "is part of" or "is in" categories in addition to the "is a" categories. Then you can have a hierarchy with "Pont du Gard" -> "France" -> "Europe", or "cylinder" -> "engine" -> "car".
We'd have two indepent types of categories, plus a third, the catch-all "theme".
female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
[[Category:Women]] could be a subcategory of [[Category:Woman]].
Heh. Looking at [[Category:Women]], it's now a subcat of both [[Category:People]] and [[Category:Humans]].
mistakes all over again. The advantage of my proposal to not allow themes as subcategories of attributes is that it can be implemented today, without much disruption, and without modifying any code.
What I missed in your proposal is how you retain existing theme information. Duplicate categories in a theme namespace?
Roger
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
On Sun, 04 Jun 2006 11:31:02 -0400, Anthony DiPierro wrote:
Once the software is written to compute intersections of categories within the Mediawiki software, it would be relatively simple to recategorize the articles into their parent categories, such that no information was lost. The way this would be done is that all articles in a subcategory which had multiple parent attribute categories would be automatically moved into the parent categories. This would be repeated until no such situations continued to exist. The ad-hoc structure could still be kept, but it could be calculated on the fly (along with new types of intersections which could be easily added).
So instead of in the category "Bridges in France", the "Pont du Gard" would now be in the categories "Bridges" and "France"!?
No, it'd be in "Bridges" and "Buildings and structures in France".
(Now that I do this on an example, I see that this algorithm would probably have to be tweaked to deal with subcatgories of [[Category:Categories by topic]], but that's not too bad.)
I suspect there are more corner cases than we imagine.
I'm not sure, but I have downloaded the latest db to start playing with this.
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
[[Category:Paris]] is a theme, not an attribute, so [[Category:Paris]] should not be a subcategory of [[Category:Capitals in Europe]].
One problem I'm seeing here is that your proposal focuses on one single type of hierarchies. Paris, France, Europe are now all themes. So if I ask for an intersection of categories "Bridges" and "France", the results include bridges that for any reason are related to the "France" theme. There is no way to ask for Bridges in France, or Paris.
We could always make "Buildings and structures in France" a subcategory of "Stuff in France". But you're right, it's not quite as clean as I had expected.
I wonder though if this is just another aspect of the [[Category:Categories by topic]] issue.
One solution would be to create "is part of" or "is in" categories in addition to the "is a" categories. Then you can have a hierarchy with "Pont du Gard" -> "France" -> "Europe", or "cylinder" -> "engine" -> "car".
We'd have two indepent types of categories, plus a third, the catch-all "theme".
female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
[[Category:Women]] could be a subcategory of [[Category:Woman]].
Heh. Looking at [[Category:Women]], it's now a subcat of both [[Category:People]] and [[Category:Humans]].
The latter should probably be [[Category:Humanity]] :).
Here's another fun fact. [[Category:Humans]] is a subcat of [[Category:People]]. Ick.
And now that I look at it, [[Category:Humans]] is also a subcat of [[Category:Apes]] (presumably a theme). 'Course that wouldn't have interested me if I was acquainted with the terminology that "humans" are "apes" :).
OK, this plural/singular thing probably isn't going to work.
mistakes all over again. The advantage of my proposal to not allow themes as subcategories of attributes is that it can be implemented today, without much disruption, and without modifying any code.
What I missed in your proposal is how you retain existing theme information. Duplicate categories in a theme namespace?
Roger
Yeah. So if [[Category:Humans]] was a theme and an attribute, we'd split it into [[Category:Humans]] (the attribute) and [[Category:Humanity]] (the theme), or something.
I suppose the plural/singular thing is still useful as a rough guideline. If things are kept separated it doesn't have to be followed strictly.
Anyway, I'm starting to think things are too far gone by now. Might as well just wait for the features first and then start reorganizing things. But I've got a little bit of hope, and my import script of the downloaded data is still running.
Anthony
Anthony DiPierro wrote:
The latter should probably be [[Category:Humanity]] :).
Here's another fun fact. [[Category:Humans]] is a subcat of [[Category:People]]. Ick.
[[Category:Klingons]] could be another sub-category. :-)
And now that I look at it, [[Category:Humans]] is also a subcat of [[Category:Apes]] (presumably a theme). 'Course that wouldn't have interested me if I was acquainted with the terminology that "humans" are "apes" :).
It's taxonomic among zoölogists.
OK, this plural/singular thing probably isn't going to work.
Not in the general case, but there are still places where it can be useful. "Science" could be about science generally; "sciences" would list the various branches."
I suppose the plural/singular thing is still useful as a rough guideline. If things are kept separated it doesn't have to be followed strictly.
Anyway, I'm starting to think things are too far gone by now. Might as well just wait for the features first and then start reorganizing things. But I've got a little bit of hope, and my import script of the downloaded data is still running.
Reorganization can keep happening, but not on any massive scale.
Ec
Roger Luethi wrote:
On Sun, 04 Jun 2006 11:31:02 -0400, Anthony DiPierro wrote:
Once the software is written to compute intersections of categories within the Mediawiki software, it would be relatively simple to recategorize the articles into their parent categories, such that no information was lost. The way this would be done is that all articles in a subcategory which had multiple parent attribute categories would be automatically moved into the parent categories. This would be repeated until no such situations continued to exist. The ad-hoc structure could still be kept, but it could be calculated on the fly (along with new types of intersections which could be easily added).
So instead of in the category "Bridges in France", the "Pont du Gard" would now be in the categories "Bridges" and "France"!?
There's no benefit from this unless we have a good search system.
(Now that I do this on an example, I see that this algorithm would probably have to be tweaked to deal with subcatgories of [[Category:Categories by topic]], but that's not too bad.)
I suspect there are more corner cases than we imagine.
Absolutely. See http://en.wiktionary.org/wiki/Wiktionary:Semantic_relations , and this could be taken even further.
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
[[Category:Paris]] is a theme, not an attribute, so [[Category:Paris]] should not be a subcategory of [[Category:Capitals in Europe]].
One problem I'm seeing here is that your proposal focuses on one single type of hierarchies. Paris, France, Europe are now all themes. So if I ask for an intersection of categories "Bridges" and "France", the results include bridges that for any reason are related to the "France" theme. There is no way to ask for Bridges in France, or Paris.
One solution would be to create "is part of" or "is in" categories in addition to the "is a" categories. Then you can have a hierarchy with "Pont du Gard" -> "France" -> "Europe", or "cylinder" -> "engine" -> "car".
We'd have two indepent types of categories, plus a third, the catch-all "theme".
That still doesn't disambiguate "cylinder" -> "geometric solid", or "engine" -> "aircraft", or "engine" -> "railway"
Ec
Anthony DiPierro wrote:
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
On Sat, 03 Jun 2006 17:27:59 -0400, Anthony DiPierro wrote:
Categories based on such intersections of attributes are conceptually bad. Look at the categories for an article like [[Marie Curie]]: She's French three times, female four times, Polish four times (not counting "Natives of Warsaw"), etc. Why not create [[Category:Polish women who were born in 1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?
Because there would only be one person in that category.
That's why nobody made it, but not why it shouldn't be done.
I'd say it's both. There shouldn't be categories with only one article in them. IMO that's just common sense.
They should be avoided, but I would not proscribe them. If you are sub-categorizing mammals you still need to deal with the ones that are so different (like the platypus) that they will end up in a one article category.
In the current system categories should have a fair number of articles in them. If there are too many, they should be broken up. If there are too few, they should be combined. There isn't a crystal clear line what constitutes too many and what constitutes too few, but a category with only one article in it clearly has too few.
The problem of categories having too many articles in them wouldn't really be a problem if the software allowed you to automatically compute category intersections. But the software doesn't do this, so people make do with what they've got.
Exactly
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
[[Category:Paris]] is a theme, not an attribute, so [[Category:Paris]] should not be a subcategory of [[Category:Capitals in Europe]].
Is it practical to have people debating whether something is a theme or an attribute?
Or going back to [[Category:Women]]: You could declare that only articles on instances of women (i.e. biographies) can ever be under that category, and that only sets of such articles can ever be subcategories of the category women. -- You could even create a separate [[Category:Woman]], subcategories like "female reproductive organs" containing articles like uterus. -- But how would you express the undisputed relationship between female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
Roger
[[Category:Women]] could be a subcategory of [[Category:Woman]]. Making an attribute a subcategory of a theme is allowed, it is the reverse that is not allowed.
Avoid distinctions that will have to be re-explained every time another newbie joins.
In any event, things wouldn't be perfect. Ultimately the best solution would involve fixing the category system itself, a process which should be approached carefully so as to avoid making the same mistakes all over again. The advantage of my proposal to not allow themes as subcategories of attributes is that it can be implemented today, without much disruption, and without modifying any code. Plus, it allows for a relatively straightforward upgrade path when the category system is fixed. The proposal itself is not the fix, it's a temporary workaround.
As an alternative, it would probably be possible to do all of this even without enforcing the subcategory rule. But all purely attribute categories would have to be identified as such. I'll have to think about that.
One can work towards this, but any enforcement is a bit like passing a law that requires everybody to think logically.
Ec
On 6/7/06, Ray Saintonge saintonge@telus.net wrote:
Anthony DiPierro wrote:
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
[[Category:Paris]] is a theme, not an attribute, so [[Category:Paris]] should not be a subcategory of [[Category:Capitals in Europe]].
Is it practical to have people debating whether something is a theme or an attribute?
There's nothing to debate. Either the subjects of articles within a category have an "is a" relationship, and are attributes, or they don't, and are themes.
[[Category:Women]] could be a subcategory of [[Category:Woman]]. Making an attribute a subcategory of a theme is allowed, it is the reverse that is not allowed.
Avoid distinctions that will have to be re-explained every time another newbie joins.
Why? The MoS is filled with rules that have to be re-explained every time another newbie joins. Do you capitalize "External Links" or do you write "External links"? The MoS says it should be the latter, but this is by no means obvious.
In order to lower the learning curve for newbies, should we abandon all attempts at having a consistent style?
In any event, things wouldn't be perfect. Ultimately the best solution would involve fixing the category system itself, a process which should be approached carefully so as to avoid making the same mistakes all over again. The advantage of my proposal to not allow themes as subcategories of attributes is that it can be implemented today, without much disruption, and without modifying any code. Plus, it allows for a relatively straightforward upgrade path when the category system is fixed. The proposal itself is not the fix, it's a temporary workaround.
As an alternative, it would probably be possible to do all of this even without enforcing the subcategory rule. But all purely attribute categories would have to be identified as such. I'll have to think about that.
One can work towards this, but any enforcement is a bit like passing a law that requires everybody to think logically.
Not at all. All that's required is that people who do think logically are allowed to fix things up and have somewhere to point to if they are challenged.
Anthony
On 6/8/06, Anthony DiPierro wikilegal@inbox.org wrote:
One can work towards this, but any enforcement is a bit like passing a law that requires everybody to think logically.
Not at all. All that's required is that people who do think logically are allowed to fix things up and have somewhere to point to if they are challenged.
Any system that requires logic and thought to operate well, but can be operated by anyone, is likely to live in a state of semi-functioning. The people who are good at categorising will be overworked, and thus there will be parts of the system that will live for days, weeks, or months, badly categorised.
There's nothing particularly wrong with that.
Steve
On 6/8/06, Steve Bennett stevagewp@gmail.com wrote:
On 6/8/06, Anthony DiPierro wikilegal@inbox.org wrote:
One can work towards this, but any enforcement is a bit like passing a law that requires everybody to think logically.
Not at all. All that's required is that people who do think logically are allowed to fix things up and have somewhere to point to if they are challenged.
Any system that requires logic and thought to operate well, but can be operated by anyone, is likely to live in a state of semi-functioning. The people who are good at categorising will be overworked, and thus there will be parts of the system that will live for days, weeks, or months, badly categorised.
There's nothing particularly wrong with that.
No, there isn't. In fact I'd say it's pretty much how wikis work.
Anthony
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
I've personally run into this when trying to automatically create, for example, a list of all Wikipedia articles on people. You can't just start at [[category:people]] and work your way down, because you wind up going to [[Category:Women]] (fine, all women are people) then [[Category:Feminine hygene]] (bad).
Okay, that's equivalent to Steve's Amelie-Paris relation. I agree that's a problem.
The problem there, now that I think about it, is that Paris should not be in the category "Paris" (as was pointed out by someone else).
Amelie should be in the category "Paris" The article Paris should be in the category "European capitals" (say) The article Paris should not be in the category "Paris" The category "Paris" should not be in the category "European capitals".
Actually, even simply obeying this last rule would solve it: Paris *the article* belongs in the taxonomic category (Paris *is a* European capital) , but "Paris" the category is thematic, so should only belong to thematic categories: maybe "Europe" in this case.
That actually seems to fix the problem. I saw this with The Beatles for example. John Lennon was in category "The Beatles" (thematic), and that category was in "British rock bands" (taxonomic), leading to the conclusion that Lennon was a British rock band.
But even if we maintained a complete and up-to-date system of subcats, we'd still make it hard for people to find articles using categories. For some fairly sensible reasons, the rule is to include articles only to the
I have never completely understood these "sensible reasons". It's redundant from a taxonomic point of view, but since category navigation works so badly (there is no way to easily see everything in a category and all its subcats), it often seems to work well from a practical perspective, so that the item actually shows up where you expect it to.
It's ok if a category consists entirely of subcategories, but if there are both articles and subcats in it, then not having an article in the category by virtue of the fact that it's included in a subcat is awkward and doesn't work well.
subcategory, but not to the parent. There is no way to list articles based on a subset of criteria (the articles in subcategories are effectively hidden on separate pages which is only helpful if you know which one to pick).
Yep.
The devil is in the details.
For instance, how do you connect the districts of Paris to the category Paris? What is a subset of the parent attribute "Paris": "Districts of Paris", or "Quartier Latin", or neither? Does it bother you if the article on a French district is now in a subcategory of "Capitals in Europe"?
Hmm, the difficulty is deciding what "subcategory" really means. I assume you're getting at the fact that a taxonomic subcategory should simply be getting more specific, and leading to more specialised subjects (so "Capitals of Europe" might have subcat "Capitals of Western Europe" or "Capitals of the European Union"), maintaining the "X is a Capital of Europe" mantra.
In this case, it would seem best that "Districts of Paris" was a category of the thematic category "Paris".
Or going back to [[Category:Women]]: You could declare that only articles on instances of women (i.e. biographies) can ever be under that category, and that only sets of such articles can ever be subcategories of the category women. -- You could even create a separate [[Category:Woman]],
This is a perfect example of a problem aluded to in the MoS on categories: Women is both a taxonomic category (it's a plural) and a theme (eg, Women throughout the ages, or whatever). Disaster is inevitable from that point onward.
The taxonomic category "Women" could be split immediately into fictional women and real women, then into living and dead women, then again by various means. The thematic category "Women" could be broken into feminism, biology, etc etc.
subcategories like "female reproductive organs" containing articles like uterus. -- But how would you express the undisputed relationship between female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
With a separate, thematic, category. What would you call it? I don't know. In practice, this would probably only work through a whole separate structure, leaving Categories only for taxonmic categories ("X is a Y"), and creating a structure called Subjects or Themes or something.
Steve
Steve Bennett wrote:
The problem there, now that I think about it, is that Paris should not be in the category "Paris" (as was pointed out by someone else).
Actually, if we consider "Paris" a thematic category, then it makes sense for Paris to be in it, since it certainly fits the theme.
But I do agree that the category "Paris" should not be a subcategory of "European capitals", since the things in the "Paris" category (with one possible exception) are not European capitals.
Hmm, the difficulty is deciding what "subcategory" really means. I assume you're getting at the fact that a taxonomic subcategory should simply be getting more specific, and leading to more specialised subjects (so "Capitals of Europe" might have subcat "Capitals of Western Europe" or "Capitals of the European Union"), maintaining the "X is a Capital of Europe" mantra.
In this case, it would seem best that "Districts of Paris" was a category of the thematic category "Paris".
...which could, however, be a subcategory of "France", which in turn could be a subcategory of "Europe" -- all of these being thematic categories, and each theme being a subset of its parent theme.
In fact, we _could_ have a parallel taxonomic category tree that focused solely on geography, with "Districts of Paris" being a subset of "Places in Paris", which in turn is a subset of "Places in France" (but _not_ "Cities in France") and "Places in Europe" and ultimately "Places". So we'd have a tree of taxonomic "Places in [Region]" categories, each one having subcategories named either "[Divisions] in [Region]" or "Places in [Subregion]", with the root of the tree being "Places". (I'm torn on whether we'd need a second-level category "Places on Earth", though.)
So, to summarize what I believe has been proposed:
Taxonomic categories (plural): * Subcategories are subsets (and always taxonomic). * Members are instances.
Thematic categories (singular): * Subcategories are subthemes (thematic) or sets of related things (taxonomic). * Members are related things.
The plural/singular distinction may not be ideal, but does feel very natural in many cases. It's not without problems, though: what about, for example, the category "Sheep"?
On Mon, 05 Jun 2006 00:01:26 +0300, Ilmari Karonen wrote:
The problem there, now that I think about it, is that Paris should not be in the category "Paris" (as was pointed out by someone else).
Actually, if we consider "Paris" a thematic category, then it makes sense for Paris to be in it, since it certainly fits the theme.
Heh. Valid point, and it seems that many editors would agree with you.
In fact, we _could_ have a parallel taxonomic category tree that focused solely on geography, with "Districts of Paris" being a subset of "Places in Paris", which in turn is a subset of "Places in France" (but _not_ "Cities in France") and "Places in Europe" and ultimately "Places". So we'd have a tree of taxonomic "Places in [Region]" categories, each one having subcategories named either "[Divisions] in [Region]" or "Places in [Subregion]", with the root of the tree being "Places". (I'm torn on whether we'd need a second-level category "Places on Earth", though.)
[[Category:Subdivisions by country]] does something like that.
The plural/singular distinction may not be ideal, but does feel very natural in many cases. It's not without problems, though: what about, for example, the category "Sheep"?
Now _that_ is fortunately a minor problem. Call one of them "sheep (thematic)" or something. It's not like namespace collisions are unheard of in WP.
Roger
On 6/4/06, Ilmari Karonen nospam@vyznev.net wrote:
So, to summarize what I believe has been proposed:
Taxonomic categories (plural):
- Subcategories are subsets (and always taxonomic).
- Members are instances.
Thematic categories (singular):
- Subcategories are subthemes (thematic) or sets of related things
(taxonomic).
- Members are related things.
Now we're getting somewhere. We still don't have a solution/policy for attributes or meta-attributes though.
With attributes, one problem is perhaps that people force an "is a" relation where it's not natural "X is a winner of the Y prize", whereas the more natural relation in most cases would be "did" or "has a", as in "Amélie did win the César Award" and "Paris has a world heritage listed site".
Just for fun, here are the categories currently on the Amélie article (ironically, Paris is not one of them...)
Incomplete lists (meta-attribute) Articles with unsourced statements (meta-attribute) Fantasy films (taxonomic) Romantic comedy films (taxonomic) Comedy-drama films (taxonomic) French films (taxonomic) French-language films (taxonomic) 2001 films (taxonomic) César Award winners (attribute) Best Foreign Language Film Oscar nominee (attribute) Films shot in Super 35 (?)
Some of these "taxonomic" categories would really be better split up into taxonomies and attributes that only apply to certain taxonomic categories. In this case: Films (taxonomic) French (film) (attribute) French-language (attribute) etc.
The plural/singular distinction may not be ideal, but does feel very natural in many cases. It's not without problems, though: what about, for example, the category "Sheep"?
I can't think of many items in that taxonomic category other than Dolly, in which case you could probably "cheat" by renaming the taxonomic category "Famous sheep" or "Notable sheep" or something. Breeds would be under "Breeds of sheep" and everything else sheepish would be under the thematic category "Sheep".
Unless we wanted to start a new naming convention of "Sheep-related" or something.
Steve
On Sun, 04 Jun 2006 18:07:15 +0200, Steve Bennett wrote:
Okay, that's equivalent to Steve's Amelie-Paris relation. I agree that's a problem.
The problem there, now that I think about it, is that Paris should not be in the category "Paris" (as was pointed out by someone else).
Yeah, although it's a very common thing to do. The alternative is to write "See also: [[Category:Paris]]" in the Paris article, which is harder to find (especially if there are many external links or references). That's a very practical reason, though.
Amelie should be in the category "Paris"
If Amelie goes into that category, then the category means "related to Paris" and becomes almost meaningless. Do we add every movie that's got an Eiffel Tower in it? Every article on people who were born there, died there, or lived there? Even if the category remained small, how is such a vague piece of information useful? ... It seems that Amelie should be in "Films set in Paris" (or something like that), and that category should be in the category "Paris".
The article Paris should be in the category "European capitals" (say) The article Paris should not be in the category "Paris" The category "Paris" should not be in the category "European capitals".
Actually, even simply obeying this last rule would solve it: Paris *the article* belongs in the taxonomic category (Paris *is a* European capital) , but "Paris" the category is thematic, so should only belong to thematic categories: maybe "Europe" in this case.
That actually seems to fix the problem. I saw this with The Beatles for example. John Lennon was in category "The Beatles" (thematic), and that category was in "British rock bands" (taxonomic), leading to the conclusion that Lennon was a British rock band.
It's mostly scripts that get confused. I suspect most normal users won't be confused but find that convenient.
Say I navigated down to "British rock bands" (taxonomic), discover The Beatles and would like to see what WP has got on them. With the proposed system, I need to open the article, scroll down to the bottom -- nope, if there is a category for them, the article can't be in it, so scroll back up several pages worth of navigation boxes, external links, and references to spot the "See also" section which hopefully contains a link to the category "The Beatles" if there is one.
I'm not sure we can convince people that it's a win overall.
But even if we maintained a complete and up-to-date system of subcats, we'd still make it hard for people to find articles using categories. For some fairly sensible reasons, the rule is to include articles only to the
I have never completely understood these "sensible reasons". It's redundant from a taxonomic point of view, but since category navigation works so badly (there is no way to easily see everything in a category and all its subcats), it often seems to work well from a
It's a trade-off. The sensible reason is that while your argument is correct, a category that contains hundreds of entries is equally unusable.
It's ok if a category consists entirely of subcategories, but if there are both articles and subcats in it, then not having an article in the category by virtue of the fact that it's included in a subcat is awkward and doesn't work well.
The higher category often serves as a waiting room for articles that have not been sorted into a subcat or make for a tiny subcat only (several subcats [[Category:Astronauts by nationality]] contain only one or two articles).
Hmm, the difficulty is deciding what "subcategory" really means. I assume you're getting at the fact that a taxonomic subcategory should simply be getting more specific, and leading to more specialised subjects (so "Capitals of Europe" might have subcat "Capitals of Western Europe" or "Capitals of the European Union"), maintaining the "X is a Capital of Europe" mantra.
In this case, it would seem best that "Districts of Paris" was a category of the thematic category "Paris".
(I know it doesn't work too well with this example, but bear with me) And "Districts of Paris", being an attribute, is also in (taxonomic) "Districts of European Capitals" which in turn is in (thematic) category "Europe", right?
Or going back to [[Category:Women]]: You could declare that only articles on instances of women (i.e. biographies) can ever be under that category, and that only sets of such articles can ever be subcategories of the category women. -- You could even create a separate [[Category:Woman]],
This is a perfect example of a problem aluded to in the MoS on categories: Women is both a taxonomic category (it's a plural) and a theme (eg, Women throughout the ages, or whatever). Disaster is inevitable from that point onward.
We can fix that.
The taxonomic category "Women" could be split immediately into fictional women and real women, then into living and dead women, then again by various means.
So "living women" is taxonomy, but "Living persons" is an attribute?
One obvious problem here is that you don't have the strict hierarchy that you proposed in your initial posting. For Bridges and France, there are natural hierarchies of higher or more generic concepts. The relations between France and Europe or between bridge and structure are directed. Women, living, and fictional have no directed relations. You could use them in any order.
And another thing I just noticed: The taxonomy in Category:Bridges breaks down after only one level, category [[Category:Buildings and structures]] which is a subcategory of four themes and nothing else. [[Category:Nobel Peace Prize winners]], one of your examples for an attribute, on the other hand, contains a stack of "Nobel laureates", "Prize winners", "People".
Your definitions of taxonomies and attributes need work :-).
subcategories like "female reproductive organs" containing articles like uterus. -- But how would you express the undisputed relationship between female human beings and your example [[Category:Feminine hygiene]]? How about [[Category:Women's rights]]? Add an umbrella cat "Somehow related to women" maybe?
With a separate, thematic, category. What would you call it? I don't know. In practice, this would probably only work through a whole separate structure, leaving Categories only for taxonmic categories ("X is a Y"), and creating a structure called Subjects or Themes or something.
Distinct namespaces for different types of categories. It would involve some coding and the migration must be planned, but it might be easier to explain and easier to maintain. It would also be another small step towards a semantic web.
Roger
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
On Sun, 04 Jun 2006 18:07:15 +0200, Steve Bennett wrote:
That actually seems to fix the problem. I saw this with The Beatles for example. John Lennon was in category "The Beatles" (thematic), and that category was in "British rock bands" (taxonomic), leading to the conclusion that Lennon was a British rock band.
It's mostly scripts that get confused. I suspect most normal users won't be confused but find that convenient.
Say I navigated down to "British rock bands" (taxonomic), discover The Beatles and would like to see what WP has got on them. With the proposed system, I need to open the article, scroll down to the bottom -- nope, if there is a category for them, the article can't be in it, so scroll back up several pages worth of navigation boxes, external links, and references to spot the "See also" section which hopefully contains a link to the category "The Beatles" if there is one.
I'm not sure we can convince people that it's a win overall.
You could always put "See also: [[:Category:The Beatles]]" (I think that's the syntax) in the description for [[Category:British rock bands]].
On 6/5/06, Anthony DiPierro wikilegal@inbox.org wrote:
You could always put "See also: [[:Category:The Beatles]]" (I think that's the syntax) in the description for [[Category:British rock bands]].
That's probably not bad. The page would have a basic structure like:
Description Related categories <-- new Subcategories Articles in this category
People *want* to put "related categories", but they break the category system if they make them subcats. We need to channel that desire.
Steve
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
Yeah, although it's a very common thing to do. The alternative is to write "See also: [[Category:Paris]]" in the Paris article, which is harder to find (especially if there are many external links or references). That's a very practical reason, though.
Perhaps something like the current boxes we have for Commons would do. "Wikipedia has a category dedicated to '''[:Category:Paris|Paris]'''.
If Amelie goes into that category, then the category means "related to Paris" and becomes almost meaningless. Do we add every movie that's got an Eiffel Tower in it? Every article on people who were born there, died there, or lived there? Even if the category remained small, how is such a vague piece of information useful? ... It seems that Amelie should be in "Films set in Paris" (or something like that), and that category should be in the category "Paris".
That would be good. (Pity for this example that Amélie is currently not in that category...bad choice on my part)
It's mostly scripts that get confused. I suspect most normal users won't be confused but find that convenient.
It confused me :/
Say I navigated down to "British rock bands" (taxonomic), discover The Beatles and would like to see what WP has got on them. With the proposed system, I need to open the article, scroll down to the bottom -- nope, if there is a category for them, the article can't be in it, so scroll back up several pages worth of navigation boxes, external links, and references to spot the "See also" section which hopefully contains a link to the category "The Beatles" if there is one.
Ok, taking into account the previous message, it seemed that the proposed system is actually:
For thematic categories ("The Beatles"), the archetypical article ([[The Beatles]])*should* be in the category For taxonomic categories ("British rock bands"), the archetypical article ([[British rock music]]??) should not be in the category, but should be prominently linked in the description of the category. The article should also [[:category:]] link to the category .
That seems workable and keep all the benefits to me. Note that in this case, The Beatles should be in both categories "The Beatles" and "British rock bands", but the category "The Beatles" should not be in the latter (maybe it could be in "British rock music" if that thematic category exists).
It's a trade-off. The sensible reason is that while your argument is correct, a category that contains hundreds of entries is equally unusable.
I think being able to see all the hundreds of entries in a category (and its subcats) - on request - is useful.
The higher category often serves as a waiting room for articles that have not been sorted into a subcat or make for a tiny subcat only (several subcats [[Category:Astronauts by nationality]] contain only one or two articles).
Yeah, and they end up getting greater prominence than they really deserve. Also, the existence of those articles makes the reader think that that's *all* the articles we have in that cat.
In this case, it would seem best that "Districts of Paris" was a category of the thematic category "Paris".
(I know it doesn't work too well with this example, but bear with me) And "Districts of Paris", being an attribute, is also in (taxonomic) "Districts of European Capitals" which in turn is in (thematic) category "Europe", right?
Sure. Given the proposed rule "thematic categories can have taxonomic subcategories, but not the reverse", it is possible to have an article which multiply inherits to the same supercategory. Given that tn is a thematic category, and xn is a taxonomic category, that gives:
xn(tn(X)) is not possible tn(xn(X)) is possible
t1(x1(X)), x2(x1(X)) is possible (X is in category x1, which is in both a thematic category t1 and another taxonomic category x2) t1(X), x1(t1(X)) is possible but probably not good (X is in thematic category t1, and also in thematic category x1 which is in the same thematic category...)
This is a perfect example of a problem aluded to in the MoS on categories: Women is both a taxonomic category (it's a plural) and a theme (eg, Women throughout the ages, or whatever). Disaster is inevitable from that point onward.
We can fix that.
Yep. We need to document these new rigid rules somewhere. I do like the idea of making all categories inherit from some ultimate "thematic" or "taxonomic" category. Failing that, a template to go on every category like "This is a *taxonomic* category. Only items that "are ____s" should go in it."
The taxonomic category "Women" could be split immediately into fictional women and real women, then into living and dead women, then again by various means.
So "living women" is taxonomy, but "Living persons" is an attribute?
I haven't thought this one through - ideas would be good. Ideally, we would have proper attributes, such that "living" could be stamped on someone (but that "living" could not be applied to artworks, tv shows or whatever...). I don't know a good way to taxonomise people, but I'm sure others do.
One obvious problem here is that you don't have the strict hierarchy that you proposed in your initial posting. For Bridges and France, there are natural hierarchies of higher or more generic concepts. The relations between France and Europe or between bridge and structure are directed. Women, living, and fictional have no directed relations. You could use them in any order.
Yep. (and all the other problems with my Bridges in France example.)
And another thing I just noticed: The taxonomy in Category:Bridges breaks down after only one level, category [[Category:Buildings and structures]] which is a subcategory of four themes and nothing else. [[Category:Nobel Peace Prize winners]], one of your examples for an attribute, on the other hand, contains a stack of "Nobel laureates", "Prize winners", "People".
Your definitions of taxonomies and attributes need work :-).
Heh :) Input welcome! I think the distinction between "taxonomy" and "attribute" is probably a sliding scale. It comes down to what is natural. Do we really think in terms of "nobel laureates"? I doubt it - I think we think in terms of "scientists" who *also* "won the nobel prize".
Distinct namespaces for different types of categories. It would involve some coding and the migration must be planned, but it might be easier to explain and easier to maintain. It would also be another small step towards a semantic web.
How many? What would you call them? What are the arguments against?
Steve
On Mon, 05 Jun 2006 11:43:30 +0200, Steve Bennett wrote:
theme (eg, Women throughout the ages, or whatever). Disaster is inevitable from that point onward.
We can fix that.
Yep. We need to document these new rigid rules somewhere. I do like
First we need to find out which rules make sense and then make a convincing case: Demonstrate the benefits, have answers to apparent drawbacks, a realistic migration path. Compare the problems solved to the problems left.
If we ever got that far, there should be a sandbox that challenges people to add problem and corner cases. New rules have little credibility unless you can demonstrate that you don't have to rethink parts of your system every time a new example comes up.
Unfortunately, I'm not aware of a good method for presenting and editing the kind of graphs we're talking about in a wiki.
So "living women" is taxonomy, but "Living persons" is an attribute?
I haven't thought this one through - ideas would be good. Ideally, we would have proper attributes, such that "living" could be stamped on someone (but that "living" could not be applied to artworks, tv shows or whatever...). I don't know a good way to taxonomise people, but I'm sure others do.
Ah, here I agree. So we have attributes for state (dead/alive). You probably want them for location, too. Being able to slap "in France" on an article would be helpful. Problem is, not everything is a bridge where "in <location>" has an unambiguous meaning in relation to the subject of the article. An American movie may be "set in France", or a movie set in the US may be "shot in France". And people may be "born in France" or have "died in France".
I guess the reason I am only mildly interested in hierarchies is that many interesting attributes (dead/alive, colors, professions) don't fit well into hierarchies. I think the real power comes from combining attributes.
The German WP is much closer to that. For instance, they don't have categories like "Polish Chemists". They only have the attribute categories "Polish" and "Chemist". From a practical point of view, that's less usable than what we have (they basically need to use CatScan which is fairly limited, and casual users don't know about it anyway). But it's conceptually cleaner, and they are in a better position for making interesting experiments.
Your definitions of taxonomies and attributes need work :-).
Heh :) Input welcome! I think the distinction between "taxonomy" and "attribute" is probably a sliding scale. It comes down to what is natural. Do we really think in terms of "nobel laureates"? I doubt it
Combining rigid rules with common sense is hard. I am tempted to quote your line about inevitable disaster.
Distinct namespaces for different types of categories. It would involve some coding and the migration must be planned, but it might be easier to explain and easier to maintain. It would also be another small step towards a semantic web.
How many? What would you call them? What are the arguments against?
Your first question answers the third :-). As for the second, I don't think implementation details matter much at this stage.
Roger
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
First we need to find out which rules make sense and then make a convincing case: Demonstrate the benefits, have answers to apparent drawbacks, a realistic migration path. Compare the problems solved to the problems left.
Ok. I suspect rules like "Express X by doing Y and Z" are going to work better than "Don't do X".
If we ever got that far, there should be a sandbox that challenges people to add problem and corner cases. New rules have little credibility unless you can demonstrate that you don't have to rethink parts of your system every time a new example comes up.
Would it be credible to say that for 90% of the time the new system is better, and for the other 10% we leave it the way it is?
Unfortunately, I'm not aware of a good method for presenting and editing the kind of graphs we're talking about in a wiki.
No. I'd like to try doing some experiments though. We don't necessarily need "graphs". Tables and hierarchical lists may be a start, depending on what you're talking about.
Ah, here I agree. So we have attributes for state (dead/alive). You probably want them for location, too. Being able to slap "in France" on an article would be helpful. Problem is, not everything is a bridge where "in <location>" has an unambiguous meaning in relation to the subject of the article. An American movie may be "set in France", or a movie set in the US may be "shot in France". And people may be "born in France" or have "died in France".
Yeah, I know. But I would actually rather see a film article labelled "Films", "Made in France", "Made in US" rather than labelled "Films made in France", "Films made in the US".
I guess the reason I am only mildly interested in hierarchies is that many interesting attributes (dead/alive, colors, professions) don't fit well into hierarchies. I think the real power comes from combining attributes.
Yep. But there's no software support for that atm.
The German WP is much closer to that. For instance, they don't have categories like "Polish Chemists". They only have the attribute categories "Polish" and "Chemist". From a practical point of view, that's less usable than what we have (they basically need to use CatScan which is fairly limited, and casual users don't know about it anyway). But it's conceptually cleaner, and they are in a better position for making interesting experiments.
What would actually be good would be being able to define categories in terms of attributes. Stick a {{Category:Polish chemists}} template on an article, which substitutes [[Attribute:Polish]] and [[Attribute:Chemists]], as well as containing a link to the category "Polish chemists". This category would be nothing more than a description and some sort of link to the two attributes, causing all articles with both attributes to be displayed.
IMHO it would not be a huge amount of work to implement, could be phased in gradually, and would be a huge improvement.
Steve
On Tue, 06 Jun 2006 13:51:33 +0200, Steve Bennett wrote:
Ok. I suspect rules like "Express X by doing Y and Z" are going to work better than "Don't do X".
Right. But more importantly, the way categories are used is thorougly entrenched and I suspect your chances to change it are close to zero even assuming you could make it a policy.
It would be easier to create clean trees in a parallel namespace. Say, leave [[Category:Bridges]] alone and create [[is a:Bridge]] (or, without software changes, [[Category:is a Bridge]]).
That said, it seems that you are overestimating the importance of one type of relationships at the expense of others.
Some hierarchies are perfectly natural and useful but are not "is a" relationships (Europe - France - Paris, Family - Genus - Species).
Many attributes are perfectly natural and useful but they tend not to fit hierarchies well. You starting using them as soon as you sketched out the supposedly taxonomic category women. There simply is no natural taxonomic hierarchy for women, just a bunch of attributes.
Now _if_ you want to draw a natural hierarchy with women in it, try genealogy. But guess what? That's another type of relationship that we can't deal with (X is ancester of Y, Y is ancestor of Z).
Would it be credible to say that for 90% of the time the new system is better, and for the other 10% we leave it the way it is?
Not without supporting evidence :-).
Unfortunately, I'm not aware of a good method for presenting and editing the kind of graphs we're talking about in a wiki.
No. I'd like to try doing some experiments though. We don't necessarily need "graphs". Tables and hierarchical lists may be a start, depending on what you're talking about.
Yes. The example Anthony posted today in another thread was useful:
---------------------------- snip ---------------------- Coastal_construction *Ports_and_harbours **Port_cities ***Edinburgh ****Education_in_Edinburgh [...] ---------------------------- snip ----------------------
article. An American movie may be "set in France", or a movie set in the US may be "shot in France". And people may be "born in France" or have "died in France".
Yeah, I know. But I would actually rather see a film article labelled "Films", "Made in France", "Made in US" rather than labelled "Films made in France", "Films made in the US".
Agreed. But the long-term goal should be for "Made in US" to be dynamically generated. It's just a bunch of relationships ("made in", "died in") and a list of attributes -- hierarchical even, in this case ("New York", "US", "North America"). You can have all kinds of fun with that, until someone adds a relationship like "was named after" and your software concludes that if people named after London are also named after Great Britain :-).
I guess the reason I am only mildly interested in hierarchies is that many interesting attributes (dead/alive, colors, professions) don't fit well into hierarchies. I think the real power comes from combining attributes.
Yep. But there's no software support for that atm.
But you have to deal with them anyway. Your suggested something like this:
women *real women *-living women *-dead women *fictional women
You _are_ using attributes here. So what if I'm looking for the biography of a female Polish chemist but don't know whether that woman is still alive? Do I have to check both categories, or do we maitain trees for every possible order of attributes (which is pretty much what we are doing right now, manually)?
What would actually be good would be being able to define categories in terms of attributes. Stick a {{Category:Polish chemists}} template on an article, which substitutes [[Attribute:Polish]] and [[Attribute:Chemists]], as well as containing a link to the category "Polish chemists". This category would be nothing more than a description and some sort of link to the two attributes, causing all articles with both attributes to be displayed.
Where's {{Category:Polish chemists}} coming from? Defined on a separate page? And do we also add {{Category:Female chemists}} and {{Category:Polish physicists}} and {{Category:Polish women}} to the same article?
Roger
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
You ask some good questions here!
Right. But more importantly, the way categories are used is thorougly entrenched and I suspect your chances to change it are close to zero even assuming you could make it a policy.
It depends whether it can be done in an evolutionary rather than revolutionary way. Slowly migrating hardcoded "Polish scientist" categories to some more advanced method should work - once critical mass is achieved (assuming that people can be convinced that it's a better method), it could become the dominant method. Suddenly wading in and reorganising category hierarchies is probably doomed, otoh.
It would be easier to create clean trees in a parallel namespace. Say, leave [[Category:Bridges]] alone and create [[is a:Bridge]] (or, without software changes, [[Category:is a Bridge]]).
Or even {{isa|bridge}} ? Can anything useful be done with templates' "what links here"?
That said, it seems that you are overestimating the importance of one type of relationships at the expense of others.
I think I agree. Let me try something: say we make a shallow "taxonomy" tree (or not even tree?) and allow attributes instead to be hierarchical:
Paris "isa city" +in France ("isa city" is the taxonomy, +in France is the attribute) Now, +in France can be a subattribute of +in Europe (and it could have been made +in Ile de France or whatever)
Britney Spears "isa person" +female +singer +alive +singer could be a subattribute of +entertainer
Pont du Gard "isa aqueduct" +in France, +Roman-built "isa aqueduct" can be a subcategory of "isa bridge" and "isa construction"
This seems to be relatively clean, despite the fact that the attribute hierarchies have different meanings: "in" as opposed to "is a specialisation of".
Basically what I'm proposing now is keeping taxonomies quite strict, and allowing greater flexibility in attributes. So we'll always know whether an item is a soccer team or a city, but we may lose information on the finer details if the attributes aren't managed carefully. Still a better situation than currently not being able to distinguish between a rock band and a person..
More examples to try and break things?
Some hierarchies are perfectly natural and useful but are not "is a" relationships (Europe - France - Paris, Family - Genus - Species).
I don't quite understnd your second example. "Rattus rattus" "is a" "Rattus" is ok isn't it?
Many attributes are perfectly natural and useful but they tend not to fit hierarchies well. You starting using them as soon as you sketched out the supposedly taxonomic category women. There simply is no natural taxonomic hierarchy for women, just a bunch of attributes.
Yeah, I see that. So, we stop the taxonomy at "person", and instead have hierarchical attributes? Is this actually far from the current situation? Hmm.
Now _if_ you want to draw a natural hierarchy with women in it, try genealogy. But guess what? That's another type of relationship that we can't deal with (X is ancester of Y, Y is ancestor of Z).
I think overall, having objects in a hierarchy is not the goal in itself - the goal is organising information, being able to group related information, and being able to make meaningful statements such as "43% of our articles are about people".
Agreed. But the long-term goal should be for "Made in US" to be dynamically generated. It's just a bunch of relationships ("made in", "died in") and a list of attributes -- hierarchical even, in this case ("New York", "US", "North America"). You can have all kinds of fun with that, until someone adds a relationship like "was named after" and your software concludes that if people named after London are also named after Great Britain :-).
Heh, yeah attention has to be paid to what meaning can be extrapolated from a supercategory/superattribute relationship.
But you have to deal with them anyway. Your suggested something like this:
women *real women *-living women *-dead women *fictional women
You _are_ using attributes here. So what if I'm looking for the biography of a female Polish chemist but don't know whether that woman is still alive? Do I have to check both categories, or do we maitain trees for every possible order of attributes (which is pretty much what we are doing right now, manually)?
Yeah, that doesn't work well. Better to use semantic attributes, possibly with antonym relationships built in (not sure of the immediate use, but it's probably helpful to distinguish between living/not living/unknown. So, to look for your female polish chemist, you simply look for person (or possibly, chemist), +female +polish.
Where's {{Category:Polish chemists}} coming from? Defined on a separate
Defined on a separate page by someone who thought it was a meaningful and useful category, and worth spending 2 minutes making.
page? And do we also add {{Category:Female chemists}} and {{Category:Polish physicists}} and {{Category:Polish women}} to the same article?
You could, and the software (small matter of programming) would be smart enough to take the superset of all these things:
Person +chemist +Polish Person +female +chemist Person +physicist +Polish Person +female +Polish
Net result: Person +chemist +physicist +female +Polish
Alternatively if you knew the attributes directly you could just do {{Polish chemists}} +female +physicist
Steve
On Tue, 06 Jun 2006 21:06:57 +0200, Steve Bennett wrote:
better method), it could become the dominant method. Suddenly wading in and reorganising category hierarchies is probably doomed, otoh.
Yes, something to keep in mind. Finding a better system is not trivial, but it is the easiest part.
leave [[Category:Bridges]] alone and create [[is a:Bridge]] (or, without software changes, [[Category:is a Bridge]]).
Or even {{isa|bridge}} ? Can anything useful be done with templates' "what links here"?
Not as far as I can tell. It would only distinguish relation types. Plus it's a hack.
That said, it seems that you are overestimating the importance of one type of relationships at the expense of others.
I think I agree. Let me try something: say we make a shallow "taxonomy" tree (or not even tree?) and allow attributes instead to be hierarchical:
Paris "isa city" +in France ("isa city" is the taxonomy, +in France is the attribute) Now, +in France can be a subattribute of +in Europe (and it could have been made +in Ile de France or whatever)
With the current software, that could be implemented as:
[[Paris]] [[Category:is a city]] [[Category:is a (is there a name for "city or town or village or some other place"?)]] [[Category:in France]] (or [[Category:located in France]]) [[Category:in Europe]]
Okay.
Britney Spears "isa person" +female +singer +alive +singer could be a subattribute of +entertainer
What i see is (I added more information for illustration):
[[Britney Spears]] [[Category:is a singer]] [[Category:is an entertainer]] [[Category:is a person]] [[Category:is a child actor]] [[Category:is an entertainer]] [[Category:is a person]] [[Category:is alive]] [[Category:is female]] [[Category:born 1981]] [[Category:born in McComb, Mississippi]] (that kind of category would be hard to maintain manually) [[Category:born in Mississippi]] [[Category:born in the United States]]
Here are some other fun (existing!) categories from said article:
Worst Actress Razzie: "won a Worst Actress Razzi" Soubrettes: "is a soubrette", I guess!? American child actors: ...? "is a child actor" or "was a child actor?" Hollywood Walk of Fame: Ugh. High school dropouts: Ahm?
Pont du Gard "isa aqueduct" +in France, +Roman-built "isa aqueduct" can be a subcategory of "isa bridge" and "isa construction"
[[Pont du Gard]] [[Category:is an aqueduct]] [[Category:is a bridge]] [[Category:is a construction]] [[Category:in France]] [[Category:built by Romans]] (?)
This seems to be relatively clean, despite the fact that the attribute hierarchies have different meanings: "in" as opposed to "is a specialisation of".
Agreed.
Basically what I'm proposing now is keeping taxonomies quite strict, and allowing greater flexibility in attributes. So we'll always know whether an item is a soccer team or a city, but we may lose information on the finer details if the attributes aren't managed
Examples?
Some hierarchies are perfectly natural and useful but are not "is a" relationships (Europe - France - Paris, Family - Genus - Species).
I don't quite understnd your second example. "Rattus rattus" "is a" "Rattus" is ok isn't it?
It is. But "species" is not a genus. Like city is not a country.
Many attributes are perfectly natural and useful but they tend not to fit hierarchies well. You starting using them as soon as you sketched out the supposedly taxonomic category women. There simply is no natural taxonomic hierarchy for women, just a bunch of attributes.
Yeah, I see that. So, we stop the taxonomy at "person", and instead have hierarchical attributes? Is this actually far from the current situation? Hmm.
I see two major problems with the status quo: * multi-concept categories (American child actors) force us to maintain a complex system of subcategories (but they paper over shortcoming in the software). The German WP shows it doesn't have to be this way, but it might be difficult to convince people on WP:en until Mediawiki can create intersections * categories with unclear relations that are used for everything
We can fix both problems without changes to the software (but it still comes at a cost). However, we are dangerously close to inventing a poor man's version of a semantic wiki.
I think overall, having objects in a hierarchy is not the goal in itself - the goal is organising information, being able to group related information, and being able to make meaningful statements such as "43% of our articles are about people".
My impression is that the German WP is pretty close to that. But categories are also an important navigation aid, and that's where WP:de falls short.
Yeah, that doesn't work well. Better to use semantic attributes, possibly with antonym relationships built in (not sure of the immediate use, but it's probably helpful to distinguish between living/not living/unknown. So, to look for your female polish chemist, you simply look for person (or possibly, chemist), +female +polish.
Yup.
Where's {{Category:Polish chemists}} coming from? Defined on a separate
Defined on a separate page by someone who thought it was a meaningful and useful category, and worth spending 2 minutes making.
Number of countries: > 200 Occupations: hundreds
Not every country has people in every occupation, but these are just two attributes. That's many times 2 minutes. We should not have to do this manually.
page? And do we also add {{Category:Female chemists}} and {{Category:Polish physicists}} and {{Category:Polish women}} to the same article?
You could, and the software (small matter of programming) would be smart enough to take the superset of all these things:
Person +chemist +Polish Person +female +chemist Person +physicist +Polish Person +female +Polish
Net result: Person +chemist +physicist +female +Polish
Alternatively if you knew the attributes directly you could just do {{Polish chemists}} +female +physicist
_Or_ editors could simply add all the attributes and forget about the template. Attributes on [[de:Marie Curie]] (I'm not making this up):
woman chemist physicist polish (+ some more)
I guess splitting woman into person and female seemed too awkward to the Germans. Wimps.
Roger
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
I see two major problems with the status quo:
- multi-concept categories (American child actors) force us to maintain a complex system of subcategories (but they paper over shortcoming in the software). The German WP shows it doesn't have to be this way, but it might be difficult to convince people on WP:en until Mediawiki can create intersections
Convincing people won't just require that intersections can be created by hand, but that the common intersections can be kept easily at hand.
The benefit would be great, though. You could get to "Polish female nobel prize winners" by navigating "People -> Polish people -> Female Polish People -> Polish female nobel prize winners" or "People -> Females -> Polish females -> Polish female nobel prize winners" or any of the other combinations.
- categories with unclear relations that are used for everything
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
With the current software, that could be implemented as:
[[Paris]] [[Category:is a city]] [[Category:is a (is there a name for "city or town or village or some other place"?)]] [[Category:in France]] (or [[Category:located in France]]) [[Category:in Europe]]
Okay.
Why do I get the feeling the current structure probably already looks a lot like that, but with different names?
What i see is (I added more information for illustration):
[[Britney Spears]] [[Category:is a singer]] [[Category:is an entertainer]] [[Category:is a person]] [[Category:is a child actor]] [[Category:is an entertainer]] [[Category:is a person]] [[Category:is alive]] [[Category:is female]] [[Category:born 1981]] [[Category:born in McComb, Mississippi]] (that kind of category would be hard to maintain manually)
Good, good - why is that cat hard to maintain manually?
Here are some other fun (existing!) categories from said article:
Worst Actress Razzie: "won a Worst Actress Razzi" Soubrettes: "is a soubrette", I guess!? American child actors: ...? "is a child actor" or "was a child actor?"
Yes. The fact taht these are hard to express clearly is very telling.
Hollywood Walk of Fame: Ugh. High school dropouts: Ahm?
Dropped out of high school?
[[Pont du Gard]] [[Category:is an aqueduct]] [[Category:is a bridge]] [[Category:is a construction]] [[Category:in France]] [[Category:built by Romans]] (?)
Yep, why not. (Built by Ancient Romans if you prefer...)
Basically what I'm proposing now is keeping taxonomies quite strict, and allowing greater flexibility in attributes. So we'll always know whether an item is a soccer team or a city, but we may lose information on the finer details if the attributes aren't managed
Examples?
Erm, I mean, people will probably end up being "casual" with attributes...but if we could make the taxonomic classificatins a bit more firm...not sure what I'm getting at (it's late).
I don't quite understnd your second example. "Rattus rattus" "is a" "Rattus" is ok isn't it?
It is. But "species" is not a genus. Like city is not a country.
Yes, but I'm not sure what you're point is - are you talking about "species" the article? Of course it shouldn't belong to "genus" the category...probably missing something.
I see two major problems with the status quo:
- multi-concept categories (American child actors) force us to maintain a
complex system of subcategories (but they paper over shortcoming in the software). The German WP shows it doesn't have to be this way, but it might be difficult to convince people on WP:en until Mediawiki can create intersections
Yay, how hard can it be?
- categories with unclear relations that are used for everything
Like [[Category:Lasers]]? ;)
We can fix both problems without changes to the software (but it still comes at a cost). However, we are dangerously close to inventing a poor man's version of a semantic wiki.
Would a fully fledged semantic wiki ever work on Wikipedia scale?
My impression is that the German WP is pretty close to that. But categories are also an important navigation aid, and that's where WP:de falls short.
Trust the Germans to be organised! Obviously the shortcoming is the lack of intersections though. You know, what would be really awesome would be seeing at the bottom of each article:
Categories: Bridges, in France, built by Romans See other: Bridges in France (200 articles), Bridges built by Romans (137 articles), Bridges in France built by Romans (15 articles).
Is that feasible?
Where's {{Category:Polish chemists}} coming from? Defined on a separate
Defined on a separate page by someone who thought it was a meaningful and useful category, and worth spending 2 minutes making.
Number of countries: > 200 Occupations: hundreds
Not every country has people in every occupation, but these are just two attributes. That's many times 2 minutes. We should not have to do this manually.
Ok, I think we got sidetracked. I was saying you could put the relevant attributes directly on a page, and manually search for the appropriate intersection. However, people could also create templates, or links to predefined groupings of attributes, if they find them interesting and relevant.
(Note that this would allow people to create "Jewish mass murderers" categories easily :))
_Or_ editors could simply add all the attributes and forget about the template. Attributes on [[de:Marie Curie]] (I'm not making this up):
Ok, you did understand, I was just unclear.
woman chemist physicist polish (+ some more)
I guess splitting woman into person and female seemed too awkward to the Germans. Wimps.
Heh, what do they do for young girls?
Steve
On Wed, 07 Jun 2006 00:39:52 +0200, Steve Bennett wrote:
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
With the current software, that could be implemented as:
[[Paris]] [[Category:is a city]] [[Category:is a (is there a name for "city or town or village or some other place"?)]] [[Category:in France]] (or [[Category:located in France]]) [[Category:in Europe]]
Okay.
Why do I get the feeling the current structure probably already looks a lot like that, but with different names?
Because it does. Where it's different is that the category names make it much clearer what does and what does not belong into the category, the catch-all "thematic" categories are eliminated (or made explicit by saying "is related to").
What i see is (I added more information for illustration):
[[Britney Spears]] [[Category:is a singer]] [[Category:is an entertainer]] [[Category:is a person]] [[Category:is a child actor]] [[Category:is an entertainer]] [[Category:is a person]] [[Category:is alive]] [[Category:is female]] [[Category:born 1981]] [[Category:born in McComb, Mississippi]] (that kind of category would be hard to maintain manually)
Good, good - why is that cat hard to maintain manually?
Because for every place someone was born in, you have to recreate the hierarchy of geography, which is large. An automatic system could infer from existing information that [[Category:born in McComb, Mississippi]] also means, say, [[Category:born in Mississippi]], or if you want to be really fancy, that being born in Prague implies a different country depending on the year of birth.
Here are some other fun (existing!) categories from said article:
Worst Actress Razzie: "won a Worst Actress Razzi" Soubrettes: "is a soubrette", I guess!? American child actors: ...? "is a child actor" or "was a child actor?"
Yes. The fact taht these are hard to express clearly is very telling.
What does it tell you?
Hollywood Walk of Fame: Ugh. High school dropouts: Ahm?
Dropped out of high school?
I guess <shrug>.
Erm, I mean, people will probably end up being "casual" with attributes...but if we could make the taxonomic classificatins a bit more firm...not sure what I'm getting at (it's late).
The easiest way to make them more firm is by making them self-explanatory. Otherwise it's not unreasonable for people to assume that a category is thematic.
It is. But "species" is not a genus. Like city is not a country.
Yes, but I'm not sure what you're point is - are you talking about "species" the article? Of course it shouldn't belong to "genus" the category...probably missing something.
Nevermind. It wasn't a great example, Europe-France-Paris was better.
I see two major problems with the status quo:
- multi-concept categories (American child actors) force us to maintain a
complex system of subcategories (but they paper over shortcoming in the software). The German WP shows it doesn't have to be this way, but it might be difficult to convince people on WP:en until Mediawiki can create intersections
Yay, how hard can it be?
Convincing or intersections? Hard enough, either way.
- categories with unclear relations that are used for everything
Like [[Category:Lasers]]? ;)
Exactly. If it was called [[Category:is a Laser]], there would be less confusion.
comes at a cost). However, we are dangerously close to inventing a poor man's version of a semantic wiki.
Would a fully fledged semantic wiki ever work on Wikipedia scale?
Of course. The question is not if, but when. Evaluating all relations and attributes on-the-fly may be way out there, but you could use it for offline-processing today, and that's what _you_ and many other people seem to be interested in.
Semantic Wiki does have challenges:
* The use needs to extend beyond nice statistics. Editors must see an immediate benefit. There are major concerns about hidden metadata that is invisible in the Wikipedia proper (unless you check the source, that is).
* It must try to prevent an ontology mess that we have with categories. It can't ever be the same mess by virtue of its very own nature, but you can for instance create confusion with "Relation:Is located in", "Relation:Located in", and "Relation:Has location".
Check out, for example: http://wiki.ontoworld.org/index.php/Help:Relation You will find it eerily familiar :-).
lack of intersections though. You know, what would be really awesome would be seeing at the bottom of each article:
Categories: Bridges, in France, built by Romans See other: Bridges in France (200 articles), Bridges built by Romans (137 articles), Bridges in France built by Romans (15 articles).
It does sound cool, but if you have a dozen categories on an article rather than three, you have more intersection categories than you want to put at the end of the article.
Is that feasible?
Not with the current mess, but other than that, I can't see why not.
woman chemist physicist polish (+ some more)
I guess splitting woman into person and female seemed too awkward to the Germans. Wimps.
Heh, what do they do for young girls?
Female child actors are tagged "Frau", "Kinderstar" ("woman", "child star"). So they use "woman" to mean "female human being" (even though Frau implies "adult", as "woman" does in English).
(who would have guessed that the German WP has an article about an American girl (born 1996) who played in a bunch of TV shows, including Startrek, while the English WP has no article? I am shocked, shocked I tell you!)
Roger
On 6/7/06, Roger Luethi collector@hellgate.ch wrote:
Because it does. Where it's different is that the category names make it much clearer what does and what does not belong into the category, the catch-all "thematic" categories are eliminated (or made explicit by saying "is related to").
Maybe the way to go is have only attributes, but then make certain attributes themselves belong to thematic categories. Something like:
The Outsider +Novel +By Albert Camus +Existentialist
Then make a thematic category "Existentialism" into which the constructed (that is, automatically built by an intersection) category "Existentialist novels" is grouped.
Not a good example come to think of it.
Should we try and work out, what is the basic point of thematic categories? The problem is that the articles in them are related in such different ways, that it would seem much better to have a page that actually explains the links, and fits them in in a way that makes sense: existentialist authors, existentialist works, history of existentialism etc. Even a structured way of displaying items in subcategorys on the page itself would be handy.
Just to continue this example, look at [[Category:Existentialism]]. There are subcategorical conceptual articles ("Existential desire"), related conceptual articles (Absurdism), works (Exile and the kingdom), and miscellaneous (Kierkegaard and Nietzsche comparisons). Then there's the subcategory "Existenialists" where we find Sartre and Camus. And even better, a perfect example of a thematic subcategory to a taxonomic category to a thematic category: "Søren Kierkegaard", which mostly contains his works, but also mentions a research centre and the man himself.
From a navigational point of view it would be so much more helpful
having a single page that quickly defined the topic, linked to the key articles about the topic, then presented sections "Existentialist authors" that transcluded the relevant category, "Existentialist works" that did the same, and finally ended with "related categories". If we had that, then Wikipedia would be really starting to get somewhere as a structured encyclopaedia, as opposed to simply a massive number of articles on various topics.
Good, good - why is that cat hard to maintain manually?
Because for every place someone was born in, you have to recreate the hierarchy of geography, which is large. An automatic system could infer from existing information that [[Category:born in McComb, Mississippi]] also means, say, [[Category:born in Mississippi]], or if you want to be really fancy, that being born in Prague implies a different country depending on the year of birth.
Hmm, well apart from the Prague bit, it would certainly be cool to be able to hijack a geographical structure that had been built once to create other attributes/categories. You would need something fancy like categorical operators though, which is definitely getting more advanced and complicated.
Category:Bridges in France would become: Category: (is a bridge) (in) (place:France). Since (place:Lyon) would be defined once and for all as being a subcategory (subplace?), any bridge in Lyon would automatically be a bridge in France.
To solve your example, a different "operator" (born in) would be used, combined with a place. The end result is to replace a number of parallel hierarchies with a single one combined with several different operators:
in (place) born in (place) died in (place) famous in (place) ? buried in (place) ? banned in (place) ?
One obvious advantage is that people could immediately start assigning articles to categories like "Born in New York" or even "Born in Queens", without losing them from the broader categories "American musicians" etc.
American child actors: ...? "is a child actor" or "was a child actor?"
Yes. The fact taht these are hard to express clearly is very telling.
What does it tell you?
That the categories are badly designed and ambiguous.
Erm, I mean, people will probably end up being "casual" with attributes...but if we could make the taxonomic classificatins a bit more firm...not sure what I'm getting at (it's late).
The easiest way to make them more firm is by making them self-explanatory. Otherwise it's not unreasonable for people to assume that a category is thematic.
Yep. If the category doesn't tell you exactly what it's about, it must be "random stuff with some tangential connection to X" :)
Like [[Category:Lasers]]? ;)
Exactly. If it was called [[Category:is a Laser]], there would be less confusion.
Actually [[Category:Types of lasers]] would be better - "blue laser" is not really "a laser". Very few individual lasers would deserve entries.
Hard to see how "laser" would be a good name for a thematic category though. But where else would laser eye surgery or light sabers go?
Of course. The question is not if, but when. Evaluating all relations and attributes on-the-fly may be way out there, but you could use it for offline-processing today, and that's what _you_ and many other people seem to be interested in.
Depends what "on the fly" means. "Each time an attribute is updated" would be one thing. "Each time a user requests" it would be another. The former would fit the mediawiki model a lot better.
Semantic Wiki does have challenges:
- The use needs to extend beyond nice statistics. Editors must see an
immediate benefit. There are major concerns about hidden metadata that is invisible in the Wikipedia proper (unless you check the source, that is).
Clearer groupings, better portals, automatic navigational boxes, more powerful categories (like "German-born Polish scientists buried in France (2 articles)")
- It must try to prevent an ontology mess that we have with categories.
It can't ever be the same mess by virtue of its very own nature, but you can for instance create confusion with "Relation:Is located in", "Relation:Located in", and "Relation:Has location".
Redirects solve so many of these problems.
Categories: Bridges, in France, built by Romans See other: Bridges in France (200 articles), Bridges built by Romans (137 articles), Bridges in France built by Romans (15 articles).
It does sound cool, but if you have a dozen categories on an article rather than three, you have more intersection categories than you want to put at the end of the article.
Some way where editors can guide the system would be valuable. Some way where power users can ignore the editors' recommendations and see all the possible intersections (ok, within reason) would be nice.
Is that feasible?
Not with the current mess, but other than that, I can't see why not.
We seem to have a couple of different proposals bubbling away, each requiring different amounts of work to implement. I'll see if I can summarise somewhere and synthesise something palatable, to really work out feasibility.
(who would have guessed that the German WP has an article about an American girl (born 1996) who played in a bunch of TV shows, including Startrek, while the English WP has no article? I am shocked, shocked I tell you!)
And the English Paris article is better than the French one, and the German New York article is better than the English one.
Steve
On 6/7/06, Steve Bennett stevagewp@gmail.com wrote:
On 6/7/06, Roger Luethi collector@hellgate.ch wrote:
Because it does. Where it's different is that the category names make it much clearer what does and what does not belong into the category, the catch-all "thematic" categories are eliminated (or made explicit by
saying
"is related to").
Maybe the way to go is have only attributes, but then make certain attributes themselves belong to thematic categories. Something like:
The Outsider +Novel +By Albert Camus +Existentialist
I was considering once in passing the structural and technical challenges of formatting a WP-like project in which an underlying structure of "Facts" and categorizations of those facts was set up underneath the articles. Each Fact would have attributes like references, links, etc. Articles would have prioritized lists of relevant facts, with a system to allow rating and adding and tagging the facts at the meta level.
Articles could then be written based on a snapshot of the agreed-upon facts base for the particular article.
One of the additional uses would be as an input database to various inferrence engines.
At a technical and operational level, it's a completely different project concept than Wikipedia, even though the end result might be similar from the article standpoint.
It worries me that the startup effort required probably is impractical.
It seems to me that a lot of the categorization analysis I'm seeing here is similar to a half-attempt to set up such a structure project.
It's interesting to me that conventional online references and encyclopedias don't seem to have attempted this. That's not a comment that it might not be a good thing, but they may not have tried because it's too hard, not because they didn't think of it.
Part of the hard part seems to be that getting the initial framing and conceptual rules down close enough to right is a lot harder than the initial energy required to start something like Wikipedia, and even borrowing WP resources and making it part of WP to do it as a WP adjunct in this category structure may be impractically hard.
Can you or others in this thread address your thoughts on how to move this into the realm of the practical? I would sort of like to be wrong on the nearterm practicality of an approach like this, since it seems extremely useful in the long term.
Hi. I know it's been mentioned before, but can I strongly suggest everyone speculating about what how this might work have a look at the Semantic MediaWiki demo here ->
http://wiki.ontoworld.org/index.php/Main_Page
It needs a lot of work (transitive relations in particular), but it's an interesting start.
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
I guess the reason I am only mildly interested in hierarchies is that many interesting attributes (dead/alive, colors, professions) don't fit well into hierarchies. I think the real power comes from combining attributes.
The German WP is much closer to that. For instance, they don't have categories like "Polish Chemists". They only have the attribute categories "Polish" and "Chemist". From a practical point of view, that's less usable than what we have (they basically need to use CatScan which is fairly limited, and casual users don't know about it anyway). But it's conceptually cleaner, and they are in a better position for making interesting experiments.
That brings up another, longer term, to-do for categories: they should be language independent. For instance [[Marie Curie]] is in de: and en: (they happen to have the same title, but even if they don't they are linked via interwiki links). [[Kategorie:Pole]] is linked to [[Category:Polish people]]. So there should be no need to categorize Marie Curie twice (multiply by the actual number of languages which have a Polish people category and an article on Marie Curie).
This is pretty simple theoretically. The only real problem is getting the multiple category schemes in sync. Considering your point about how the German categorization scheme differs from the English one, this might be a lot harder in practice than it is in theory.
Your definitions of taxonomies and attributes need work :-).
Heh :) Input welcome! I think the distinction between "taxonomy" and "attribute" is probably a sliding scale. It comes down to what is natural. Do we really think in terms of "nobel laureates"? I doubt it
Combining rigid rules with common sense is hard. I am tempted to quote your line about inevitable disaster.
Personally I don't see the difference between taxonomies and attributes, as described. But I suppose one (taxonomies?) could be described as partitioning (an article can only be in one taxonomy category) whereas attributes can be mixed. Under that definition though, all taxonomies are attributes (but not vice-versa). I'm not sure how close that definition is to reality though.
Anthony
On 6/6/06, Anthony DiPierro wikilegal@inbox.org wrote:
Heh :) Input welcome! I think the distinction between "taxonomy" and "attribute" is probably a sliding scale. It comes down to what is natural. Do we really think in terms of "nobel laureates"? I doubt it
Combining rigid rules with common sense is hard. I am tempted to quote your line about inevitable disaster.
Personally I don't see the difference between taxonomies and attributes, as described. But I suppose one (taxonomies?) could be described as partitioning (an article can only be in one taxonomy category) whereas attributes can be mixed. Under that definition though, all taxonomies are attributes (but not vice-versa). I'm not sure how close that definition is to reality though.
I'm inclined to think that in practice they pretty much work the same way. However, semantically, I would really like to distinguish concrete, significant, basic categories like "Ships" from much less salient, significant facts like "Born in 1793" or "Winners of Golden Raspberries". However, I suspect that even the most basic taxonomies will have bastard children with two parents: Something could be both a sport and a television show. Someone could be both a musician and a scientist. But maybe the "one taxonomy per article except in strange cases" goal is reasonable?
Steve
On Tue, 06 Jun 2006 14:07:23 +0200, Steve Bennett wrote:
I'm inclined to think that in practice they pretty much work the same way. However, semantically, I would really like to distinguish concrete, significant, basic categories like "Ships" from much less salient, significant facts like "Born in 1793" or "Winners of Golden Raspberries".
Significant? Salient? You don't want to go there.
I'd argue the most significant thing about the Titanic is "major disaster" ("Shipwrecks in the Atlantic Ocean", actually). It's not significant or salient for being a ship, but for taking 1500 people down.
And what is more salient about Halle Berry, being a women (or actress, for all I care) _or_ having won a "Worst Actress Razzie"? Well?
I'm afraid your "I know it when it see it" approach to identifying taxonomic categories is hopelessly POV.
However, I suspect that even the most basic taxonomies will have bastard children with two parents: Something could be both a sport and a television show. Someone could be both a musician and a scientist. But maybe the "one taxonomy per article except in strange cases" goal is reasonable?
Hardly. For starters, many people have held several jobs in their life. Halle Berry is also a model. Albert Einstein was also a Patent Clerk. Duke Ellington was a composer, bandleader, and pianist. Etc. pp.
Roger
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
Significant? Salient? You don't want to go there.
I'd argue the most significant thing about the Titanic is "major disaster" ("Shipwrecks in the Atlantic Ocean", actually). It's not significant or salient for being a ship, but for taking 1500 people down.
But at the end of the day, it's a ship. Indisputably so. You would certainly want to put an attribute "shipwrecked" on it, and possible "shipwrecked in 1903" or whatever. Instead of "significant" and "salient" how about "concrete"? The fact that it's a ship is concrete and essential. The fact that it shipwrecked is ancillary.
(why do I feel like I'm getting talked into trying to reinvent the field of semantics from scratch)
And what is more salient about Halle Berry, being a women (or actress, for all I care) _or_ having won a "Worst Actress Razzie"? Well?
I didn't know she had. Woman.
I'm afraid your "I know it when it see it" approach to identifying taxonomic categories is hopelessly POV.
Not many people are going to deny that Halle is a women, or that Titanic was a ship.
Hardly. For starters, many people have held several jobs in their life. Halle Berry is also a model. Albert Einstein was also a Patent Clerk. Duke Ellington was a composer, bandleader, and pianist. Etc. pp.
Halle Berry is primarily an actress. Einstein was primarily a physicist. Ellington was primarily a jazz musician. But I see your point, maybe the taxonomies should stop at "person", and the rest can be attributes.
Steve
On Tue, 06 Jun 2006 22:09:28 +0200, Steve Bennett wrote:
"salient" how about "concrete"? The fact that it's a ship is concrete and essential. The fact that it shipwrecked is ancillary.
That is POV. My POV is that Titanic is remembered as a tragic disaster. Now the Titanic _is_ of course a ship and not a disaster, but that has nothing to do with significant, salient, or concrete.
(why do I feel like I'm getting talked into trying to reinvent the field of semantics from scratch)
Alternatively, you can read the pertinent literature and come back with a solid proposal :-P.
And what is more salient about Halle Berry, being a women (or actress, for all I care) _or_ having won a "Worst Actress Razzie"? Well?
I didn't know she had. Woman.
POV. There are billions of women. Only a few people got a Razzie. It's much more remarkable. In my opinion, anyhow.
I'm afraid your "I know it when it see it" approach to identifying taxonomic categories is hopelessly POV.
Not many people are going to deny that Halle is a women, or that Titanic was a ship.
Not many will deny either that the Titanic disaster was one of the most remarkable shipwrecks in history, nor that Halle Berry won a Razzie. What's your point?
What I'm trying to show here is that I don't think I could spot with some certainty what you consider taxonomy in articles we haven't talked about. It seems you really mean everything of type "is a"; "Winners of Golden Raspberries" should really be "won a Golden Raspberry".
The basic "is a" categories I see on Berry's article: actor (including Bond girls, etc., maybe Worst Actress Razzie), model (including Beauty pageant contestants, Versace models, etc.), African-American(?). Of course you could say that occupations are "works as" relationships and everything falls apart.
Halle Berry is primarily an actress. Einstein was primarily a physicist. Ellington was primarily a jazz musician. But I see your point, maybe the taxonomies should stop at "person", and the rest can be attributes.
What does "stopping" entail? I mean, how would WP be different if we did, or didn't stop at person?
Roger
On 06/06/06, Steve Bennett stevagewp@gmail.com wrote:
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
Significant? Salient? You don't want to go there.
I'd argue the most significant thing about the Titanic is "major disaster" ("Shipwrecks in the Atlantic Ocean", actually). It's not significant or salient for being a ship, but for taking 1500 people down.
But at the end of the day, it's a ship. Indisputably so. You would certainly want to put an attribute "shipwrecked" on it, and possible "shipwrecked in 1903" or whatever. Instead of "significant" and "salient" how about "concrete"? The fact that it's a ship is concrete and essential. The fact that it shipwrecked is ancillary.
But what's our *article* about? It's evenly split between an article on the ship, and an article on the sinking of the ship - if the name of the Titanic herself hadn't entered into common usage, there'd be a good argument for calling the article [[1912 sinking of RMS Titanic]].
(Compare [[Exxon Valdez]] and [[Exxon Valdez oil spill]] - I'm surprised they're not merged)
On 06/06/06, Anthony DiPierro wikilegal@inbox.org wrote:
That brings up another, longer term, to-do for categories: they should be language independent. For instance [[Marie Curie]] is in de: and en: (they happen to have the same title, but even if they don't they are linked via interwiki links). [[Kategorie:Pole]] is linked to [[Category:Polish people]]. So there should be no need to categorize Marie Curie twice (multiply by the actual number of languages which have a Polish people category and an article on Marie Curie).
Hmm... it won't work well.
Basdically, there is no hard and fast en:Article <-> de:Artikel relationship, there's no single "meta topic" which manifests itself in specific articles in different languages. For some things, like people, it does appear so; for others, it'll break down.
This is partly due to the incomplete nature of the project, but also because different language communities - which, especially for languages like German and Polish, represent individual and reasonably distinct cultures in a way that en: doesn't - will naturally have different emphasis, there'll be different levels of coverage and different approaches to fragmenting articles.
Let's say, oh, [[History of Country]].
In one language, this might be a single article. In another, time-divided articles (overview; ancient history; history to 1500; 1500 to 1900; modern history). In a third, it might be a thematic divide (political history; religious history; military history; overview).
What combination of categories would work best for *all* of these pages?
On Tue, 06 Jun 2006 07:54:27 -0400, Anthony DiPierro wrote:
That brings up another, longer term, to-do for categories: they should be language independent. For instance [[Marie Curie]] is in de: and en: (they happen to have the same title, but even if they don't they are linked via interwiki links). [[Kategorie:Pole]] is linked to [[Category:Polish people]]. So there should be no need to categorize Marie Curie twice (multiply by the actual number of languages which have a Polish people category and an article on Marie Curie).
http://meta.wikimedia.org/wiki/Wikidata
This is pretty simple theoretically. The only real problem is getting the multiple category schemes in sync. Considering your point about how the German categorization scheme differs from the English one, this might be a lot harder in practice than it is in theory.
Right. You _could_ make it work on some categories, say "Women" or "Nobelprize winners".
It cannot possible work for all categories, though, because different languages know different categories. For instance, security and safety is the same word in German: Sicherheit. Thus, [[de:Lifebelt]] is in the German category that is linked to the English category Security.
Numbers are much easier to share. Population of a country. Weight of a molecule.
Personally I don't see the difference between taxonomies and attributes, as described. But I suppose one (taxonomies?) could be described as partitioning (an article can only be in one taxonomy category) whereas attributes can be mixed. Under that definition
I suspect this is impossible. But it's hard to tell if you're not even sure what counts as a taxonomy.
Roger
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
On Tue, 06 Jun 2006 07:54:27 -0400, Anthony DiPierro wrote:
That brings up another, longer term, to-do for categories: they should be language independent. For instance [[Marie Curie]] is in de: and en: (they happen to have the same title, but even if they don't they are linked via interwiki links). [[Kategorie:Pole]] is linked to [[Category:Polish people]]. So there should be no need to categorize Marie Curie twice (multiply by the actual number of languages which have a Polish people category and an article on Marie Curie).
http://meta.wikimedia.org/wiki/Wikidata
From what I know of that project, it's a lot more extravagant than
what I'm talking about. Infoboxes are a lot more complicated than categories. Categories are just sets of articles, and the interwiki links are already there. Infoboxes have multiple types of fields, with various constraints on each of them, and various translation issues most of which haven't even been begun.
This is pretty simple theoretically. The only real problem is getting the multiple category schemes in sync. Considering your point about how the German categorization scheme differs from the English one, this might be a lot harder in practice than it is in theory.
Right. You _could_ make it work on some categories, say "Women" or "Nobelprize winners".
It cannot possible work for all categories, though, because different languages know different categories. For instance, security and safety is the same word in German: Sicherheit. Thus, [[de:Lifebelt]] is in the German category that is linked to the English category Security.
I'd say then that either the German [[Kategorie:Sicherheit]] should be disambiguated into two different categories, or that [[Kategorie:Sicherheit]] shouldn't be linked to [[Category:Security]], because they don't define the same set. Maybe an English [[Category:Security and Safety]] could be made, with "Security" and "Safety" as subcats - then [[Kategorie:Sicherheit]] could link to [[Category:Security and Safety]].
But maybe this is a common enough thing that that's not going to be reasonable. At some point someone should look at how the en categories differ from the de ones. I figure there will be 5 major points of difference:
1) Things being categorized at different levels ([[Category:Polish women]] vs. [[Kategorie:Frau]]. 2) Interwiki links between articles on different things. 3) Interwiki links between categories defining different sets. 4) Articles categorized where they shouldn't be. 5) Articles missing from categories where they should be.
1) is the reason why I call this a "longer-term solution". 2 and/or 3 are what you describe above. 4 and 5 are the reason why it would be useful to coordinate things.
It'd be interesting to get a decent size sample and sort the differences into those 5 categories. If 2 and/or 3 were significant, then I suppose this idea fails, at least initially. If 1 is significant, and I think it might be, then the idea rests upon reaching a consensus among the different Wikipedias as to what level to put things. And that's probably going to be dependent on having on-the-fly intersection categories.
Numbers are much easier to share. Population of a country. Weight of a molecule.
Personally I don't see the difference between taxonomies and attributes, as described. But I suppose one (taxonomies?) could be described as partitioning (an article can only be in one taxonomy category) whereas attributes can be mixed. Under that definition
I suspect this is impossible. But it's hard to tell if you're not even sure what counts as a taxonomy.
Roger
Yeah, I don't know. I'm not the one who made the initial distinction between taxonomies and attributes. I was just trying to make sense of it, rather poorly I guess. :)
Anthony
On Tue, 06 Jun 2006 17:36:07 -0400, Anthony DiPierro wrote:
I'd say then that either the German [[Kategorie:Sicherheit]] should be disambiguated into two different categories, or that [[Kategorie:Sicherheit]] shouldn't be linked to [[Category:Security]], because they don't define the same set. Maybe an English [[Category:Security and Safety]] could be made, with "Security" and "Safety" as subcats - then [[Kategorie:Sicherheit]] could link to [[Category:Security and Safety]].
These are all possible solutions. However, Sicherheit is not exactly security+safety. It's just much closer than to security alone. I picked that example because it's rather extreme. Here's another one: there are two German words for technology: Technik and Technologie (each has a separate category in the German WP).
More importantly, though, many translated words mean almost the same, but not exactly, and some articles will be the corner cases that will make the category in one language but not in the other.
You would need a mechanism to exclude categories that won't translate well enough even though an interwiki link exists between them. Opt-in or opt-out, I don't know.
But maybe this is a common enough thing that that's not going to be reasonable. At some point someone should look at how the en categories differ from the de ones. I figure there will be 5 major points of difference:
- Things being categorized at different levels ([[Category:Polish
women]] vs. [[Kategorie:Frau]]. 2) Interwiki links between articles on different things. 3) Interwiki links between categories defining different sets. 4) Articles categorized where they shouldn't be. 5) Articles missing from categories where they should be.
- is the reason why I call this a "longer-term solution". 2 and/or 3
are what you describe above. 4 and 5 are the reason why it would be useful to coordinate things.
It'd be interesting to get a decent size sample and sort the differences into those 5 categories. If 2 and/or 3 were significant, then I suppose this idea fails, at least initially. If 1 is
2) is significant if only because the German WP tends to merge subjects into one article, so you have several English articles pointing to the same German one. 3) Extremes are rare, but subtle differences are pretty much inevitable. 6) Different understanding of what categories should exist.
Roger
On 6/6/06, Roger Luethi collector@hellgate.ch wrote:
On Tue, 06 Jun 2006 17:36:07 -0400, Anthony DiPierro wrote:
I'd say then that either the German [[Kategorie:Sicherheit]] should be disambiguated into two different categories, or that [[Kategorie:Sicherheit]] shouldn't be linked to [[Category:Security]], because they don't define the same set. Maybe an English [[Category:Security and Safety]] could be made, with "Security" and "Safety" as subcats - then [[Kategorie:Sicherheit]] could link to [[Category:Security and Safety]].
These are all possible solutions. However, Sicherheit is not exactly security+safety. It's just much closer than to security alone. I picked that example because it's rather extreme. Here's another one: there are two German words for technology: Technik and Technologie (each has a separate category in the German WP).
[[Kategorie:Technologie]] isn't very full, and I didn't dive into enough that I can say for sure which solution would be best, but this seems to be just the reverse of the previous example.
More importantly, though, many translated words mean almost the same, but not exactly, and some articles will be the corner cases that will make the category in one language but not in the other.
You would need a mechanism to exclude categories that won't translate well enough even though an interwiki link exists between them. Opt-in or opt-out, I don't know.
Well, for the moment I'll accept these criticisms, because I simply don't have any solid numbers to throw back.
But maybe this is a common enough thing that that's not going to be reasonable. At some point someone should look at how the en categories differ from the de ones. I figure there will be 5 major points of difference:
- Things being categorized at different levels ([[Category:Polish
women]] vs. [[Kategorie:Frau]]. 2) Interwiki links between articles on different things. 3) Interwiki links between categories defining different sets. 4) Articles categorized where they shouldn't be. 5) Articles missing from categories where they should be.
- is the reason why I call this a "longer-term solution". 2 and/or 3
are what you describe above. 4 and 5 are the reason why it would be useful to coordinate things.
It'd be interesting to get a decent size sample and sort the differences into those 5 categories. If 2 and/or 3 were significant, then I suppose this idea fails, at least initially. If 1 is
- is significant if only because the German WP tends to merge subjects
into one article, so you have several English articles pointing to the same German one. 3) Extremes are rare, but subtle differences are pretty much inevitable. 6) Different understanding of what categories should exist.
Roger
Regarding 2), I assume the accepted solution is to wikilink the merged article to all the individual ones? Wouldn't it make more sense to wikilink the redirect, which covers the individual concept? Are wikilinks between redirects and articles even allowed (I guess I should just test this one)?
6) isn't a problem in the scheme I'm considering, because categories which exist in one language but not in another simply wouldn't be translated.
In any case, I guess it's back to the drawing board unless I can get some solid numbers that suggest it isn't as big a problem as you say.
Anthony
Anthony DiPierro wrote:
That brings up another, longer term, to-do for categories: they should be language independent. For instance [[Marie Curie]] is in de: and en: (they happen to have the same title, but even if they don't they are linked via interwiki links). [[Kategorie:Pole]] is linked to [[Category:Polish people]]. So there should be no need to categorize Marie Curie twice (multiply by the actual number of languages which have a Polish people category and an article on Marie Curie).
This is pretty simple theoretically. The only real problem is getting the multiple category schemes in sync. Considering your point about how the German categorization scheme differs from the English one, this might be a lot harder in practice than it is in theory.
I think that this would be very difficult across projects, if only because the terminology is different, and there is a cultural element to determining distinctions.
To some extent I have dealt with som of this on the English Wiktionary as a byproduct of being all words in all languages, but with English definitions. Category names there would also all be in English. Thus [[Category:Mammals]] could have its usual hierarchical subcategories [[Category:Dogs]], [[Category:Rabbits]], etc. It would also hve [[Category:de:Mammals]], [[Category:ja:Mammals]], etc. which could be developed in a parallel manner for those other languages. To facilitate sorting the language codes would always be lower cased, and the categories themselves would be upper cased.
Your definitions of taxonomies and attributes need work :-).
Heh :) Input welcome! I think the distinction between "taxonomy" and "attribute" is probably a sliding scale. It comes down to what is natural. Do we really think in terms of "nobel laureates"? I doubt it
Combining rigid rules with common sense is hard. I am tempted to quote your line about inevitable disaster.
Personally I don't see the difference between taxonomies and attributes, as described. But I suppose one (taxonomies?) could be described as partitioning (an article can only be in one taxonomy category) whereas attributes can be mixed. Under that definition though, all taxonomies are attributes (but not vice-versa). I'm not sure how close that definition is to reality though.
It's best not top get hung up on pedantic distinctions.
Ec
Steve Bennett wrote:
On 6/4/06, Roger Luethi collector@hellgate.ch wrote:
I've personally run into this when trying to automatically create, for example, a list of all Wikipedia articles on people. You can't just start at [[category:people]] and work your way down, because you wind up going to [[Category:Women]] (fine, all women are people) then [[Category:Feminine hygene]] (bad).
Okay, that's equivalent to Steve's Amelie-Paris relation. I agree that's a problem.
The problem there, now that I think about it, is that Paris should not be in the category "Paris" (as was pointed out by someone else).
Amelie should be in the category "Paris" The article Paris should be in the category "European capitals" (say) The article Paris should not be in the category "Paris" The category "Paris" should not be in the category "European capitals".
Actually, even simply obeying this last rule would solve it: Paris *the article* belongs in the taxonomic category (Paris *is a* European capital) , but "Paris" the category is thematic, so should only belong to thematic categories: maybe "Europe" in this case.
That actually seems to fix the problem. I saw this with The Beatles for example. John Lennon was in category "The Beatles" (thematic), and that category was in "British rock bands" (taxonomic), leading to the conclusion that Lennon was a British rock band.
You can't avoid multiple categorizations. The category "European capitals" would not exclude Paris from the category "Cities of France". The Beatles would still be in "1960s rock bands". Do you need to priorize time over place? Eponymous recursions are really a very minor problem since they never involve more than a single article in a category.
Ec
Anthony DiPierro wrote:
On 6/3/06, Roger Luethi collector@hellgate.ch wrote:
On Sat, 03 Jun 2006 19:54:27 +0200, Steve Bennett wrote:
I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
Agreed. Still: can you give some specific examples of wonderful things that could be done but are not possible now? That would tell us what problem you are trying to solve.
I've personally run into this when trying to automatically create, for example, a list of all Wikipedia articles on people. You can't just start at [[category:people]] and work your way down, because you wind up going to [[Category:Women]] (fine, all women are people) then [[Category:Feminine hygene]] (bad).
A high level category like people does not need to have direct elements. To be simplistic about it, it would do fine with only two sub-categories, men and women. An element of a sub-category is an element of the superset category.
Categories based on such intersections of attributes are conceptually bad. Look at the categories for an article like [[Marie Curie]]: She's French three times, female four times, Polish four times (not counting "Natives of Warsaw"), etc. Why not create [[Category:Polish women who were born in 1867 and died in 1934 and won a Nobel Prize in Chemistry and in Physics]]?
Because there would only be one person in that category.
Such a category would be theoretically acceptable but totally impractical. There is an element of art to the design of category hierarchies. A category that's too narrow (like your example) is unfindable; you simply never know which ones exist. At the other extreme, if the category is too broad it becomes more difficult to find things within it. In Wiktionary people have established [[Category:English nouns]] which now has numerous elements, but what user would ever look there to find something? The purpose of categories is to help the passive user to find things. It requires some idea of which Googling strategies work and which don't, and how to modify a strategy which initially doesn't work. Just think of what works when you are searching for something.
In my mind a category should not have more than 200 direct elements, this being the number of items that will appear on a single page by default when we ask for a category to be listed. Anything longer should be subdivided. Even so, a person should have the option to have an "include sub-categories" to a determinable level when listing the contents of a higher level category.
If we don't have a term for (or an article about) it, there probably shouldn't be a category for it, either (I'm sure a determined mind could come up with an exception).
If the category system could effectively build these intersection categories on the fly, I'd agree. But the category system can't currently do that. (And it's been around a reasonably long time, with that as an obvious flaw, and no one has fixed it.)
I suggested something of the sort before categories were implemented, but more from the searching end. The real problem is with the search function, which is remarkably unsophisticated for a project the size of Wikipedia.
Attributes: The category exists to denote some very specific small detail of a subject, such that it would be conceivable to have dozens or more such categories on an article. Examples: 1943 deaths, Living persons, Winners of Nobel Peace Prize, etc. These tend to hierarchies that start strict then end up fuzzy. Eg, 1943 deaths is only in 1943 and "1940s deaths", and these have parent categories of "1940s","Years" and so forth, eventually ending up in "History", whereupon things become chaos.
There is no way to make hierarchies not suck, especially if you have to maintain them manually (as we do now). Don't try to impose hierarchies unless they emerge quite naturally from the subject.
I made a proposal. All subcategories of attributes must be a subset of the parent attribute. Seems like a perfectly reasonable way to make hierarchies not suck.
It's an idea that I have tried to implement for some time at Wiktionary. The difficulty with such hierarchies is that they require people to think logically, and to be able to trace a path back to a single top level hierarchy.
Ec
Roger Luethi wrote:
On Sat, 03 Jun 2006 19:54:27 +0200, Steve Bennett wrote:
I'm probably not the only one who envisages all the wonderful things that could be done with this massive collection of information that is Wikipedia, *if only* we could do something clever with the categories. And then you realise that you can't really do anything clever because "category" has all sorts of different meanings to different people.
Agreed. Still: can you give some specific examples of wonderful things that could be done but are not possible now? That would tell us what problem you are trying to solve.
So far I have identified four rough types of categories. I'll invent the notion a(X) to mean that article X is in category a. a(b(X)) means that a is a subcategory of b, and X is in b.
ITYM "b is a subcategory of a".
Taxonomies: Tend to end in "s" and satisfy the rule that "If a(X) then X is an a") is a logical sentence. Tend to form strict hierarchies, where if a(X) and b(a), then it's perfectly natural and normal that b(a(X)). Eg, Bridges in France is a subcat of Bridges, and every entry in "bridges in France" is definitely a Bridge. It's rare for an article to be in more than two taxonomic categories at once.
"Bridges in France" may not be the best example. "Bridges in France" is just an intersection of two attributes ("in France", "Bridges"), and their relative position in a hierarchy is undefined. Hence more than one hierarchy: You can drill down "France" ... "Buildings and structurces in France" or "Bridges" ... "Bridges by country".
<snip>
There are a few proposals on how to do this; most of them are in the "See also" section of http://en.wikipedia.org/wiki/Wikipedia:Category_math_feature, specifically CatScan http://tools.wikimedia.de/~daniel/WikiSense/CategoryIntersect.php and the MediaWiki extension DynamicPageList http://meta.wikimedia.org/wiki/DynamicPageList; an extended version, DPL2, is described at http://meta.wikimedia.org/wiki/DynamicPageList2.
Incidentally, I realised one great thing that could be done with a hierarchical taxonomy: The very top branch can be "Fictional" vs "Real". Easy way to prune out huge sections of fancruft :)
Steve
On 6/7/06, Steve Bennett stevagewp@gmail.com wrote:
Incidentally, I realised one great thing that could be done with a hierarchical taxonomy: The very top branch can be "Fictional" vs "Real". Easy way to prune out huge sections of fancruft :)
Steve
Not really. I don't see what such a division gains. You still have a demarcation problem. Is [[Darth Vader]] fictional or real? He's obviously fictional, since such a fellow never was born, lived, fell to the dark side, and was redeemed; but on the other, you can't say he is completely unreal and unmoored from reality, because there are countless real things of him, various physical real objects like comics, books, movies, costumes with actors in them etc.
~maru 'Shuzan held out his short staff and said, "If you call this a short staff, you oppose its reality. If you do not call it a short staff, you ignore the fact. Now what do you wish to call this?"'