So, it seems (if I interpret Jimbo's mail on wikitech and the discussion here correctly) that most of us would like *some kind* of category scheme in wikipedia. I do, too! But, we seem to differ on the details (shocked silence!).
So far, I saw three concepts: 1. Simple categories like "Person", "Event", etc.; about a dozen total. 2. Categories and subcategories, like "Science/Biology/Biochemistry/Proteomics", which can be "scaled down" to #1 as well ("Humankind/Person" or something) 3. Complex object structures with machine-readable meta-knowledge encoded into the articles, which would allow for quite complex queries/summaries, like "biologists born after 1860".
Pros: 1. Easy to edit (the wiki way!) 2. Still easy to edit, but making wikipedia browseable by category, fine-tune Recent Changes, etc. 3. Strong improvement in search functions, meta-knowledge available for data-mining.
Cons: 1. Not much of a help... 2. We'd need to agree on a category scheme, and maintenance might get a *little* complicated. 3. Quite complex to edit (e.g., "<category type='person' occupation='biologist' birth_month='5' birth_day='24' birth_year='1874' birth_place='London' death_month=.....>")
For a wikipedia I'd have to write myself, I'd choose #3, but with respect to the wiki way, #2 seems more likely to achieve consensus (if there is such a thing;-)
Magnus
Magnus Manske magnus.manske@epost.de writes:
Cons:
- Not much of a help...
Yes, that's it. It does not help at all (we already have much to many lists). Better encourage people to write overview articles! ;-)
Better searching features are missing. Wildcards and logical expressions (and/or) would be nice things to have to (okay, one can use google, but google is always some time behind
- Quite complex to edit (e.g., "<category type='person' occupation='biologist' birth_month='5' birth_day='24' birth_year='1874' birth_place='London' death_month=.....>")
On those issues you can read the documentation on the TEI DTD (Text Encoding Initiative - http://www.tei-c.org).
And how categorize artists?
Karl Eichwalder wrote:
- Quite complex to edit (e.g., "<category type='person'
occupation='biologist' birth_month='5' birth_day='24' birth_year='1874' birth_place='London' death_month=.....>")
On those issues you can read the documentation on the TEI DTD (Text Encoding Initiative - http://www.tei-c.org).
I took a look at it. The complexity seems over the top for some of the more technophobic people we get here. Some of the markups may even be in conflict with Wiki ways. Besides that it's not clear that it accomplishes the task of developing a useful kind of indexing for our purposes.
Eclecticology
Ray Saintonge saintonge@telus.net writes:
I took a look at it. The complexity seems over the top for some of the more technophobic people we get here.
Yes, it's complex, but very mature.
Some of the markups may even be in conflict with Wiki ways.
One could add namespaces to solve this issue. Okay, that's even more complexity ;)
Besides that it's not clear that it accomplishes the task of developing a useful kind of indexing for our purposes.
Yes, that's the question. Since I don't have that much experience with the wikipedia I don't want to judge.
OTOH, I'd like basing a more specialized wiki on the TEI DTD -- just as a proof of concept. Does anybody know about such a wiki software? In other words: I'm interested in a wiki implemented as a conforming SGML or XML application (for my own research purposes, not as a replacement for wikipedia).
Magnus Manske wrote:
So, it seems (if I interpret Jimbo's mail on wikitech and the discussion here correctly) that most of us would like *some kind* of category scheme in wikipedia. I do, too! But, we seem to differ on the details (shocked silence!).
So far, I saw three concepts:
- Simple categories like "Person", "Event", etc.; about a dozen total.
- Categories and subcategories, like
"Science/Biology/Biochemistry/Proteomics", which can be "scaled down" to #1 as well ("Humankind/Person" or something) 3. Complex object structures with machine-readable meta-knowledge encoded into the articles, which would allow for quite complex queries/summaries, like "biologists born after 1860".
Pros:
- Easy to edit (the wiki way!)
- Still easy to edit, but making wikipedia browseable by category,
fine-tune Recent Changes, etc. 3. Strong improvement in search functions, meta-knowledge available for data-mining.
Cons:
- Not much of a help...
- We'd need to agree on a category scheme, and maintenance might get
a *little* complicated. 3. Quite complex to edit (e.g., "<category type='person' occupation='biologist' birth_month='5' birth_day='24' birth_year='1874' birth_place='London' death_month=.....>")
For a wikipedia I'd have to write myself, I'd choose #3, but with respect to the wiki way, #2 seems more likely to achieve consensus (if there is such a thing;-)
I want to thank Magnus for summarizing this recent thread. It makes it easier to see where to jump in.
A few months ago I suggested A system of boxes where a person could provide a subject codification or categorizastion. I also suggested at the same time using Library of Congress Classification as a starting point which could be modified to suit our needs. The suggestion did not fare well. Among the objections were that it would require a lot of work to change every article to apply codification, and that people wouild need to learn a lot of difficult to remember codes. When I went so far as to suggest that a "XX" code be used by parents to prevent their children from downloading certain articles, a few objected on the grounds that this would be permitting censorship, even though the criteria for using it this way would reside on an individual's own computer. Bowdlerization would remain a personal option.
There are really two issues (boxes and codes) in my proposal, and they can be considered separately and mostly independantly.
The boxes are the more important of the two, and could function with whole words as easily as with classification codes. "Person" could function as easily as "CT", the LOC code for biography. The boxes could easily go beside the summary box on the edit page. In order to facilitate cross referencing the ability would be needed to enter more than one category. This would allow, for example, a person looking for mathematicians to search the articles which show boxes for both mathematics and persons. I would leave it up to the techies to determine whether this is done as a series of boxes, each with a single category, or as a single box with a number of appropriately delimited entries.
Whatever is devised would involve a certain amount of anticipation or pre-emption. Some of the categories that we originally employ may end up totally useless as the scheme develops. In one sense, however, this is nothing more than scaling up something that we already do when we wikify articles. When we do this we have no idea about which of the links will lead to an existing article, or one which will never be created. One of the functions of our naming conventions is to optimize the probability that what we write and what we link will converge. If I add to a list of Oscar nominated movies, particularly if I'm adding a movie with a one-word title, I have no certain idea whether that title will have some other ecyclopediable subject, or whether there have been other movies with that title. Researching every such instance as much as I would like is totally impractical. Much of our work here depends on making serendipitous guesses. Then sombody else finds a use for the term in a totally unfamiliar area.
There is absolutely no doubt that a lot of work would be required to classify all the existing articles. I do note that someone commented in the last couple of days that some articles had not been revised or reviewed for a long time. That's fine; the boxes could be filled as part of this review. There's no need to do everything overnight. I would suggest, though, if this approach were adopted, that from the beginning every article be botted with the code "AAA" to mean unclasified, and any new article created without classification would automaticaly be coded with "AAA". In due course this would be useful to contributors looking for things to classify. Nobody should need to feel the burden that all contributors would feel obliged to classify. Our present summary box is often left unfilled, and as much as it may annoy some veterans, it is relatively harmless. The same could be said of a classification box, perhaps with the caution that inexperienced Wikipedians might be better to leave it blank until they are familiar with the categories.
This is already getting long, and I have other obligations. I'll write about the second issue later.
Eclecticology
When I suggested a system of codes to enter into coding boxes, I made reference to a modified version of the Library of Congress Classification system. This is not to suggest that it is any better than anything else; it's just an established system that is a convenient jumping off point. Starting from scratch would involve reinventing a lot of basics. The LCC is a system based on one, two and sometimes three letters, a number up to four digits long, a decimal point, and decimalized alphanumerics of varying lengths. For practical purposes I see no reason to go beyond three letters at the beginning, if at all. The only conceivable exception for the future would be for a very big subject area that is just crying for further subdivision. We would not be using the system to put numbers on the spines of books to ensure that the books are put on the right shelf in the library.
The classification system should allow nesting of categories. Under the existing LCC "Q" represents sciences in general, "QA" represents mathematics and no three letter codes are defined in the QAs. Without prejudice, this would leave free to define "QAG" to represent geometry. From the searcher's point of view, he could find his geometric subject by searching either Q or QA or QAG, but the result from searching on Q could be big and mostly useless to him. An article could be classified in more than one category; one that deals with both calculus and geometry could show both a QAG and a QAC category. Also, a person who wants to be able to contribute through classifying articles should have the option to choose only those items which are unextended when he's looking for work.
Ib my previous post I mentioned code "AAA" for unclassified, but there are also other codes that could serve Wikipedia's own special purposes.
Eclecticology
Looks like UNESCO has declared 2006 to be the Year of African Languages. See
http://news.bbc.co.uk/2/hi/africa/4536450.stm for more
Ec
wikipedia-l@lists.wikimedia.org