On Thu, Feb 28, 2008 at 3:39 AM, Samuel Wantman wantman@earthlink.net wrote:
I'm wondering about creating a new namespace, called (you guessed it) INDEX. Any category of people could be put in an index by adding [[Index:People]] on the category page. The "People" INDEX page, into which the category get put, would have links to all the articles and subcategories from the categories in the INDEX. The contents of the subcategories of those categories would NOT be added automatically. Each would have to be manually added to the index if appropriate. Just like a category there would be text that could be edited for each INDEX page. So in essence, an INDEX is a way to do category unions. This would be much, much easier than trying to create and maintain these indexes manually using categories.
So you're basically suggesting manually-created but automatically-populated category unions. Category unions are not so hard to do on the backend. They aren't great, though, if you want to retrieve in sorted order. It's possible to do so if you're okay with some fairly sharp restrictions, like unioning a max of three categories. But in MySQL, I'm not sure there'd be an efficient way to union a *large* number of categories and retrieve the results in sorted order.
For a small number of categories, you can just do a MySQL UNION, like this:
mysql> EXPLAIN (SELECT * FROM categorylinks WHERE cl_to='Living_people' ORDER BY cl_sortkey LIMIT 200) UNION ALL (SELECT * FROM categorylinks WHERE cl_to='Vegetables' ORDER BY cl_sortkey LIMIT 200) ORDER BY cl_sortkey LIMIT 200; +----+--------------+---------------+------+-------------------------+------------+---------+-------+--------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------+---------------+------+-------------------------+------------+---------+-------+--------+----------------+ | 1 | PRIMARY | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 543730 | Using where | | 2 | UNION | categorylinks | ref | cl_sortkey,cl_timestamp | cl_sortkey | 257 | const | 31 | Using where | | NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort | +----+--------------+---------------+------+-------------------------+------------+---------+-------+--------+----------------+ 3 rows in set (0.04 sec)
This filesorts, but only a limited number of rows: the maximum number of rows times the number of categories. This is potentially acceptable (although undesirable) for a small number of categories in the union, especially if the limit (in this case 200) is small, say more like 20. For a large number of categories with a reasonable limit size you could easily be talking filesorts of thousands of rows, which isn't really acceptable.
The thing is, I'm pretty sure (although I'm not a computer science whiz) that MySQL should be able to use a merge sort here, rather than an explicit sort. That might be acceptably fast. You'd still have to scan a lot of index rows, but at least you wouldn't have to sort them. I don't know if there's any way to get it to do a merge sort here, though.
On Thu, Feb 28, 2008 at 10:06 AM, Ben chuwiey@gmail.com wrote:
The solution to your idea/request exists in the combination of SemanticMediaWiki and the Halo Extension - and in fact, implementation could be quite easy, by adding the semantic properties in the different taxonomy templates.. So for an example taxonomy dealing with people: Name (property) : <nameofperson> Profession (property) : <professionofperson> and so on..
Unfortunately, Semantic MediaWiki is not efficient enough to be enabled on Wikipedia. This kind of problem is very easy to solve inefficiently but hard to do scalably.
On Thu, Feb 28, 2008 at 12:20 PM, Jim Hu jimhu@tamu.edu wrote:
This also leads to massive issues about whether Categories in Wikipedia are a well-formed ontology (which is a fancy way of expressing Lars Aronsson's reply). I'm barely conversant in ontologies through my participation in Gene Ontology activities as a newbie, but my gut reaction is .... not even close.
It has been previously observed that there are quite a few cyclic subcategory relationships on Wikipedia, so if that precludes being a "well-formed ontology", then yeah, it's not.