With the advent of __HIDDENCAT__, I've been wondering about using hidden categories to create indexes. My initial hope with Wikipedia was that we could reorganize categories so that categories could function as broad indexes of single attributes such as "People", "Films", "Bridges", etc... and hide all the intersection categories of parents. Later, if and when category intersection was implemented, all the hidden categories would no longer be needed. However, implementing major changes seems to be near impossible in a project as large and set in its ways as Wikipedia. There is just too much resistance to change. If category intersection was implemented there would be an technical compelling reason to make the change, but short of that upgrade, it seems like a very difficult -- if not impossible -- sell.
It really bothers me (and others, especially librarians), that Wikipedia is not indexed. You cannot find a master index of People, places, books, films, etc... To find anything you have to know in advance, where it is subcategorized. This only works if you know where to browse, and it is your desire to only browse in a small well-defined place. One of the big joys of libraries is the ability of finding things you didn't know about in broad swaths of knowledge. This ability is often lacking in Wikipedia because of categories being constantly broken into smaller pieces. For example, If I want to browse through the bridges in Europe, I have to look at a category for each country separately, and in some countries (like the UK) I have look at one for each county. It is just too difficult and time consuming a task to be a pleasurable leisurely browse.
So I've been thinking of alternative approaches. One possibility is to use hidden categories to create index categories. For instance, [[Category:Index-Films]] could contain all films, [[Category:Index-People]] could contain all people, etc... However, this would be difficult to maintain because the categories would be hidden, and it would take a tremendous amount of work to populate these categories. It seems crazy to have people doing all the mindless busywork necessary to create categories like these. That is why we have computers.
This is where developers come in...
I'm wondering about creating a new namespace, called (you guessed it) INDEX. Any category of people could be put in an index by adding [[Index:People]] on the category page. The "People" INDEX page, into which the category get put, would have links to all the articles and subcategories from the categories in the INDEX. The contents of the subcategories of those categories would NOT be added automatically. Each would have to be manually added to the index if appropriate. Just like a category there would be text that could be edited for each INDEX page. So in essence, an INDEX is a way to do category unions. This would be much, much easier than trying to create and maintain these indexes manually using categories.
It would be great if an INDEX page could be viewed two different ways (and easily switched). The first way would look similar to current categories, showing a category tree at the top, and all the articles below arranged alphabetically. It would also be great to see categories viewed hierarchically, like an index in a book. So the categories would be listed alphabetically and then all the subcategories and articles in the categories would be listed together alphabetically and indented. The categories could be differentiated by either making them bold, italic, or by labeling them as categories. If the subcategories have also been included in the index, their contents would also appear indented in one more level (this could be closed at first and opened using a "+, the same way category trees look. Users might also be able to set the default number of levels that appear -- perhaps two?).
I don't think there is any need to be able to add anything but categories to an INDEX. Adding anything else would probably make it harder to maintain the INDEX, and would probably confuse newbies. Of course, you should be able to create a link to an index page by typing [[:Index:People|Index of people]].
If you think this idea has merit and is a possibility, would it be difficult to implement? It has long been my understanding that category unions would be much less server intensive than category intersections. Perhaps each INDEX display process could be done dynamically?
Thanks,
Samuel Wantman [[en:User:Sam]]
Samuel Wantman wrote:
With the advent of __HIDDENCAT__, I've been wondering about using hidden categories to create indexes. My initial hope with Wikipedia was that we could reorganize categories so that categories could function as broad indexes of single attributes such as "People", "Films", "Bridges", etc... and hide all the intersection categories of parents. Later, if and when category intersection was implemented, all the hidden categories would no longer be needed. However, implementing major changes seems to be near impossible in a project as large and set in its ways as Wikipedia. There is just too much resistance to change. If category intersection was implemented there would be an technical compelling reason to make the change, but short of that upgrade, it seems like a very difficult -- if not impossible -- sell.
It really bothers me (and others, especially librarians), that Wikipedia is not indexed. You cannot find a master index of People, places, books, films, etc... To find anything you have to know in advance, where it is subcategorized. This only works if you know where to browse, and it is your desire to only browse in a small well-defined place. One of the big joys of libraries is the ability of finding things you didn't know about in broad swaths of knowledge. This ability is often lacking in Wikipedia because of categories being constantly broken into smaller pieces. For example, If I want to browse through the bridges in Europe, I have to look at a category for each country separately, and in some countries (like the UK) I have look at one for each county. It is just too difficult and time consuming a task to be a pleasurable leisurely browse.
So I've been thinking of alternative approaches. One possibility is to use hidden categories to create index categories. For instance, [[Category:Index-Films]] could contain all films, [[Category:Index-People]] could contain all people, etc... However, this would be difficult to maintain because the categories would be hidden, and it would take a tremendous amount of work to populate these categories. It seems crazy to have people doing all the mindless busywork necessary to create categories like these. That is why we have computers.
This is where developers come in...
I'm wondering about creating a new namespace, called (you guessed it) INDEX. Any category of people could be put in an index by adding [[Index:People]] on the category page. The "People" INDEX page, into which the category get put, would have links to all the articles and subcategories from the categories in the INDEX. The contents of the subcategories of those categories would NOT be added automatically. Each would have to be manually added to the index if appropriate. Just like a category there would be text that could be edited for each INDEX page. So in essence, an INDEX is a way to do category unions. This would be much, much easier than trying to create and maintain these indexes manually using categories.
It would be great if an INDEX page could be viewed two different ways (and easily switched). The first way would look similar to current categories, showing a category tree at the top, and all the articles below arranged alphabetically. It would also be great to see categories viewed hierarchically, like an index in a book. So the categories would be listed alphabetically and then all the subcategories and articles in the categories would be listed together alphabetically and indented. The categories could be differentiated by either making them bold, italic, or by labeling them as categories. If the subcategories have also been included in the index, their contents would also appear indented in one more level (this could be closed at first and opened using a "+, the same way category trees look. Users might also be able to set the default number of levels that appear -- perhaps two?).
I don't think there is any need to be able to add anything but categories to an INDEX. Adding anything else would probably make it harder to maintain the INDEX, and would probably confuse newbies. Of course, you should be able to create a link to an index page by typing [[:Index:People|Index of people]].
If you think this idea has merit and is a possibility, would it be difficult to implement? It has long been my understanding that category unions would be much less server intensive than category intersections. Perhaps each INDEX display process could be done dynamically?
Thanks,
Samuel Wantman [[en:User:Sam]]
In general, I agree. Information should be served in a variety of ways: cognitively - human understanding; index - lists; category - "tag" (as MW does it); and semantically - machine and human understanding; (obviously there are more ways..)
From a development/architectural point-of-view, I would say that both indexes and categories should be machine generated. Categories are tagged by the "article writer(s)" and indexes should be generated from properties of an article.
The solution to your idea/request exists in the combination of SemanticMediaWiki and the Halo Extension - and in fact, implementation could be quite easy, by adding the semantic properties in the different taxonomy templates.. So for an example taxonomy dealing with people: Name (property) : <nameofperson> Profession (property) : <professionofperson> and so on..
Thus being able to create a list of people by name (operator: and/or/(if?)/(while?)/others...) profession.
-- Wiredtape ----------------------- Wiredtape.com - A guide to Bureaucracy.. -Help make life simpler!
Samuel Wantman wrote:
It really bothers me (and others, especially librarians), that Wikipedia is not indexed. You cannot find a master index of People, places, books, films, etc... To find anything you have to know in advance, where it is subcategorized.
Perhaps you need to think again. Perhaps the librarians are wrong. What is a person is not always clear. To what extent is pharaoh Cheops a person, a fictional character or a deity? The same question can be asked of Winston Churchill and Harry Potter. Is London a town or several towns? Is computer science a branch of mathematics or of electrical engineering? Wikipedia doesn't take a clear stand on any of these issues, but remains open to ambiguity. This is not because we have never heard of the hierarchical Dewey Decimal, but because we reject it. It was a necessary tool for organizing printed information in the 20th century. But now you can search.
Nonetheless, the Persondata project does exist, exactly for indexing persons. But it still doesn't include Mr. Cheops.
I realize the lines I quoted from your posting is only an introduction to your new ideas for indexing. These ideas might be quite useful. You should try to implement them. But please don't give us that old "librarians are bothered", because we have been there long ago and we have moved on.
Perhaps you need to think again. Perhaps the librarians are wrong. What is a person is not always clear. To what extent is pharaoh Cheops a person, a fictional character or a deity? The same question can be asked of Winston Churchill and Harry Potter.
Winston Churchill is (was) quite definitely a person...
I'm sure I missed lots of previous discussion, but it should be possible to have an extension add category tags to a page for all of the parents of any category that is added manually. Thus, if you put something in [[Category:Bridges in the UK]], the extension would walk up the subcategory relationships for Category:Bridges in the UK and add [[Category:Bridges]], [[Category:Bridges in EU countries]], etc.
This makes the top level categories so large as to be pretty useless for browsing, but that might be OK... except for the problem of how subcategories are displayed in the Category pages. That is to say, they're not, unless they are in the return of the initial limited SQL query. Solving that has been bandied about in both recent and older threads, and it seems to be nontrivial on the Wikipedia scale.
This also leads to massive issues about whether Categories in Wikipedia are a well-formed ontology (which is a fancy way of expressing Lars Aronsson's reply). I'm barely conversant in ontologies through my participation in Gene Ontology activities as a newbie, but my gut reaction is .... not even close.
Jim
On Feb 28, 2008, at 2:39 AM, Samuel Wantman wrote:
With the advent of __HIDDENCAT__, I've been wondering about using hidden categories to create indexes. My initial hope with Wikipedia was that we could reorganize categories so that categories could function as broad indexes of single attributes such as "People", "Films", "Bridges", etc... and hide all the intersection categories of parents. Later, if and when category intersection was implemented, all the hidden categories would no longer be needed. However, implementing major changes seems to be near impossible in a project as large and set in its ways as Wikipedia. There is just too much resistance to change. If category intersection was implemented there would be an technical compelling reason to make the change, but short of that upgrade, it seems like a very difficult -- if not impossible -- sell.
It really bothers me (and others, especially librarians), that Wikipedia is not indexed. You cannot find a master index of People, places, books, films, etc... To find anything you have to know in advance, where it is subcategorized. This only works if you know where to browse, and it is your desire to only browse in a small well-defined place. One of the big joys of libraries is the ability of finding things you didn't know about in broad swaths of knowledge. This ability is often lacking in Wikipedia because of categories being constantly broken into smaller pieces. For example, If I want to browse through the bridges in Europe, I have to look at a category for each country separately, and in some countries (like the UK) I have look at one for each county. It is just too difficult and time consuming a task to be a pleasurable leisurely browse.
So I've been thinking of alternative approaches. One possibility is to use hidden categories to create index categories. For instance, [[Category:Index-Films]] could contain all films, [[Category:Index-People]] could contain all people, etc... However, this would be difficult to maintain because the categories would be hidden, and it would take a tremendous amount of work to populate these categories. It seems crazy to have people doing all the mindless busywork necessary to create categories like these. That is why we have computers.
This is where developers come in...
I'm wondering about creating a new namespace, called (you guessed it) INDEX. Any category of people could be put in an index by adding [[Index:People]] on the category page. The "People" INDEX page, into which the category get put, would have links to all the articles and subcategories from the categories in the INDEX. The contents of the subcategories of those categories would NOT be added automatically. Each would have to be manually added to the index if appropriate. Just like a category there would be text that could be edited for each INDEX page. So in essence, an INDEX is a way to do category unions. This would be much, much easier than trying to create and maintain these indexes manually using categories.
It would be great if an INDEX page could be viewed two different ways (and easily switched). The first way would look similar to current categories, showing a category tree at the top, and all the articles below arranged alphabetically. It would also be great to see categories viewed hierarchically, like an index in a book. So the categories would be listed alphabetically and then all the subcategories and articles in the categories would be listed together alphabetically and indented. The categories could be differentiated by either making them bold, italic, or by labeling them as categories. If the subcategories have also been included in the index, their contents would also appear indented in one more level (this could be closed at first and opened using a "+, the same way category trees look. Users might also be able to set the default number of levels that appear -- perhaps two?).
I don't think there is any need to be able to add anything but categories to an INDEX. Adding anything else would probably make it harder to maintain the INDEX, and would probably confuse newbies. Of course, you should be able to create a link to an index page by typing [[:Index:People|Index of people]].
If you think this idea has merit and is a possibility, would it be difficult to implement? It has long been my understanding that category unions would be much less server intensive than category intersections. Perhaps each INDEX display process could be done dynamically?
Thanks,
Samuel Wantman [[en:User:Sam]]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054
On Thu, Feb 28, 2008 at 3:39 AM, Samuel Wantman wantman@earthlink.net wrote:
I'm wondering about creating a new namespace, called (you guessed it) INDEX. Any category of people could be put in an index by adding [[Index:People]] on the category page. The "People" INDEX page, into which the category get put, would have links to all the articles and subcategories from the categories in the INDEX. The contents of the subcategories of those categories would NOT be added automatically. Each would have to be manually added to the index if appropriate. Just like a category there would be text that could be edited for each INDEX page. So in essence, an INDEX is a way to do category unions. This would be much, much easier than trying to create and maintain these indexes manually using categories.
So you're basically suggesting manually-created but automatically-populated category unions. Category unions are not so hard to do on the backend. They aren't great, though, if you want to retrieve in sorted order. It's possible to do so if you're okay with some fairly sharp restrictions, like unioning a max of three categories. But in MySQL, I'm not sure there'd be an efficient way to union a *large* number of categories and retrieve the results in sorted order.
For a small number of categories, you can just do a MySQL UNION, like this:
mysql> EXPLAIN (SELECT * FROM categorylinks WHERE cl_to='Living_people' ORDER BY cl_sortkey LIMIT 200) UNION ALL (SELECT * FROM categorylinks WHERE cl_to='Vegetables' ORDER BY cl_sortkey LIMIT 200) ORDER BY cl_sortkey LIMIT 200; +----+--------------+---------------+------+-------------------------+------------+---------+-------+--------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------+---------------+------+-------------------------+------------+---------+-------+--------+----------------+ | 1 | PRIMARY | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 543730 | Using where | | 2 | UNION | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 31 | Using where | | NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort | +----+--------------+---------------+------+-------------------------+------------+---------+-------+--------+----------------+ 3 rows in set (0.04 sec)
This filesorts, but only a limited number of rows: the maximum number of rows times the number of categories. This is potentially acceptable (although undesirable) for a small number of categories in the union, especially if the limit (in this case 200) is small, say more like 20. For a large number of categories with a reasonable limit size you could easily be talking filesorts of thousands of rows, which isn't really acceptable.
The thing is, I'm pretty sure (although I'm not a computer science whiz) that MySQL should be able to use a merge sort here, rather than an explicit sort. That might be acceptably fast. You'd still have to scan a lot of index rows, but at least you wouldn't have to sort them. I don't know if there's any way to get it to do a merge sort here, though.
On Thu, Feb 28, 2008 at 10:06 AM, Ben chuwiey@gmail.com wrote:
The solution to your idea/request exists in the combination of SemanticMediaWiki and the Halo Extension - and in fact, implementation could be quite easy, by adding the semantic properties in the different taxonomy templates.. So for an example taxonomy dealing with people: Name (property) : <nameofperson> Profession (property) : <professionofperson> and so on..
Unfortunately, Semantic MediaWiki is not efficient enough to be enabled on Wikipedia. This kind of problem is very easy to solve inefficiently but hard to do scalably.
On Thu, Feb 28, 2008 at 12:20 PM, Jim Hu jimhu@tamu.edu wrote:
This also leads to massive issues about whether Categories in Wikipedia are a well-formed ontology (which is a fancy way of expressing Lars Aronsson's reply). I'm barely conversant in ontologies through my participation in Gene Ontology activities as a newbie, but my gut reaction is .... not even close.
It has been previously observed that there are quite a few cyclic subcategory relationships on Wikipedia, so if that precludes being a "well-formed ontology", then yeah, it's not.
wikitech-l@lists.wikimedia.org