Interesting new paper: http://arxiv.org/abs/cs.IR/0512085
"Out of the 78,977 categories, 12,252 are not assigned to any article" Is it that bad, or are these categories just assigned to user talk/wikipedia/etc. namespace pages and therefore not counted?
On 9/1/06, Erik Moeller eloquence@gmail.com wrote:
Interesting new paper: http://arxiv.org/abs/cs.IR/0512085
-- Peace & Love, Erik _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
On 01/09/06, Akash Mehta draicone@gmail.com wrote:
"Out of the 78,977 categories, 12,252 are not assigned to any article" Is it that bad, or are these categories just assigned to user talk/wikipedia/etc. namespace pages and therefore not counted?
There's quite a few cats which serve only to group subcategories, and don't have any articles that belong in them.
[[Category:American people by occupation]], for example - all the articles belong in subcategories, and as these are pretty disparate categories there's no article really dealing with the subject of the main category.
http://mangalorean.com/news.php?newstype=local&newsid=32933 Wikipedia for more Indian languages on its website
By Frederick Noronha/IANS
Bangalore, Sep 1 (IANS) More and more Indians can contribute to the volunteer-edited Wikipedia encyclopedia project, now rated among the top 20 websites globally.
So says Jimmy "Jimbo" Wales, founder of the Web-based free content multilingual project.
During a visit to India, Wales noted that volunteer contributions to the Kannada Wikipedia had been growing 22 percent and Bengali 35 percent a month.
"These growth rates are fairly high. Of course, they're growing from a small base. But Kannada already has over 5,000 articles and is still growing. That's really exciting. Bengali too has a growth rate of 35 percent," he said.
"It's not as bad as it was a year ago. We had almost nothing then. Now, languages like Bengali, Kannada, Marathi are in the 3,000-5,000 article range. Hindi, Assamese over 1,000. But Hindi, a very large language, has only 1,500 entries. That's a little surprising," Wales told IANS.
"We still have an enormous amount of work left to do. India has 23 official languages. English has more than 10,000 articles. We aim to have 200,000 articles for every language spoken by a million people," he said.
Currently, Japanese is the only non-European language among the 'big 10' of Wikipedia.
Farsi, Arabic, Korean, Thai, Chinese and Bhasa Indonesia are among those with over 10,000 articles. In the same category are Urdu, Bangla, Hindi, Kannada, Marathi, Tamil and Telugu.
Wales, who took part in a conference on open content in India, said he had just started a new project called the Wikiversity.
"We want to provide a free encyclopaedia to every single person in the planet. We also want to provide all the tools to become literate," he said.
"Our mission reaches far beyond the Internet - even people who don't have access to electricity. There are over one million articles in English. And English is less than one-third of the total work of Wikipedia.
"Unless you have a thousand articles, I don't count it as a fully active community," he told audiences in India, urging them to do more in this regard.
With just four full-time employees, this volunteer-driven project has become the 17th most visited website in the world, and currently has some 200 servers across the globe.
Wales said Wikipedia was drawing more hits than the "BBC and the CNN combined". He cited tests that show Wikipedia had only four mistakes per article as against the Britannica's three.
"People should have this idea that Wikipedia is pretty good and that it's getting better all the time. I think what was surprising for most people was not that Wikipedia had four errors per article but that Britannica had three (per article)!" he added.
"The focus of my effort is to see how to get the initial communities going," he said, explaining that he planned to test-hire his fifth employee in India to encourage content to be built up.
Wales referred to the Wikipedia on a CD or on mobile phones.
Webaroo.com, a project founded by IITian Rakesh Mathur that offers software for users to download highly compressed web content, has also offered portions of the Wikipedia through this format.
In keeping with its free nature, anyone could copy Wikipedia content, print it and even sell it, Wales said.
Indian techies who interacted with him suggested the time was "right" for Indian language content creation, since the tools for doing this had been created, including projects such as IndLinux.
Wikipedia is run on a website that allows any visitor to edit its content. It is written collaboratively by volunteers, allowing most articles to be changed by almost anyone with access to the website. Wikipedia's main servers are in Tampa, Florida, with additional servers in Amsterdam and Seoul.
Wikipedia started as an English language project Jan 15, 2001, as a complement to the now defunct Nupedia.
As of August 2006, Wikipedia has over five million articles in many languages, including more than 1.3 million in the English-language version. There are 229 language editions of Wikipedia, 16 of which have over 50,000 articles each.
There has been controversy over Wikipedia's reliability and accuracy, with the site receiving criticism for its susceptibility to vandalism, uneven quality and inconsistency, systemic bias, and preference for consensus over credentials.
IANS - ------------------------------------------------------------------------------- Frederick Noronha http://fn.goa-india.org 9822122436 +91-832-240-9490 http://fredericknoronha.wordpress.com fredericknoronha@gmail.com
Frederick "FN" Noronha wrote:
"It's not as bad as it was a year ago. We had almost nothing then. Now, languages like Bengali, Kannada, Marathi are in the 3,000-5,000 article range. Hindi, Assamese over 1,000. But Hindi, a very large language, has only 1,500 entries. That's a little surprising," Wales told IANS.
This actually doesn't surprise me too much. Hindi is a lingua franca of sorts in India, but so is English. In particular, the Hindi speakers most likely to be editing an encyclopedia on the internet (middle to upper class, well-educated) are highly likely to also speak very good English. Anecdotally, there are indeed a lot of Indians making good contributions on the English-language Wikipedia, so it seems that many seem to prefer to put their efforts there, probably for a variety of reasons (bigger base to start from; feeling that their contribution will have more global impact; etc.).
-Mark
I think it has to do with the fact that Internet access in India is still very limited, and, in some ways, very new.
As a journalist (based in Goa), I've been on e-mail access since 1994. But till just about 12-18 months back, I paid for every minute of internet time. Around that time, I managed to get unlimited dial-up access. Some six months back, I got access to broadband. And only about two months back, got unlimited 256 kbps broadband (at an affordable Rs 900 per month).
This may not reflect everyone's situation, since I live some ten kms out of a town in the smallest state. But outside of the metros (the four big cities), most people are in a similar predicament.
Also, not enough people have discovered the Wikipedia yet in India. Or so it seems. But you can expect the tide to turn in not too long a time. Let's hope so! FN
On 01/09/06, Delirium delirium@hackish.org wrote:
Frederick "FN" Noronha wrote:
"It's not as bad as it was a year ago. We had almost nothing then. Now, languages like Bengali, Kannada, Marathi are in the 3,000-5,000 article range. Hindi, Assamese over 1,000. But Hindi, a very large language, has only 1,500 entries. That's a little surprising," Wales told IANS.
This actually doesn't surprise me too much. Hindi is a lingua franca of sorts in India, but so is English. In particular, the Hindi speakers most likely to be editing an encyclopedia on the internet (middle to upper class, well-educated) are highly likely to also speak very good English. Anecdotally, there are indeed a lot of Indians making good contributions on the English-language Wikipedia, so it seems that many seem to prefer to put their efforts there, probably for a variety of reasons (bigger base to start from; feeling that their contribution will have more global impact; etc.).
-Mark
Delirium wrote:
This actually doesn't surprise me too much. Hindi is a lingua franca of sorts in India, but so is English. In particular, the Hindi speakers most likely to be editing an encyclopedia on the internet (middle to upper class, well-educated) are highly likely to also speak very good English. Anecdotally, there are indeed a lot of Indians making good contributions on the English-language Wikipedia,
This is the case in Scandinavia too. I would say that a majority of the Scandinavian contributors speak excellent English, even if the user base is starting to expand to older, less educated contributors. Some Scandinavian wikipedians think the Scandinavian language versions are jokes, and only contribute to the English Wikipedia. But the more Scandinavians who contribute to the English Wikipedia, the more contributors spill over to the "native" languages, of which 4 are among the top 20 languages at Wikipedia. Danish with 5 million speakers has 47,000 articles.
So even if these supposedly young, male, upper-class wikipedians in India speak good English, that is not enough to explain why Hindi is lagging behind. Perhaps the issue is if they grew up with an English or Hindi encyclopedia in their parents' bookshelf. Do they consider Hindi to be "a language of encyclopedias"? I don't see the word "Hindi" in http://en.wikipedia.org/wiki/Encyclopedia or http://en.wikipedia.org/wiki/List_of_encyclopedias
I also don't see the word "encyclopedia" in any of http://en.wikipedia.org/wiki/Hindi http://en.wikipedia.org/wiki/Hindi_literature or http://en.wikipedia.org/wiki/Indian_literature
In the longer run, I hope Wikipedia should not only have a full good Hindi version, but one edited by the right proportions of gender, class, and age. We probably have a very long way to go.
On 9/1/06, Andrew Gray shimgray@gmail.com wrote:
On 01/09/06, Akash Mehta draicone@gmail.com wrote:
"Out of the 78,977 categories, 12,252 are not assigned to any article" Is it that bad, or are these categories just assigned to user talk/wikipedia/etc. namespace pages and therefore not counted?
There's quite a few cats which serve only to group subcategories, and don't have any articles that belong in them.
[[Category:American people by occupation]], for example - all the articles belong in subcategories, and as these are pretty disparate categories there's no article really dealing with the subject of the main category.
--
- Andrew Gray
Wait, isn't that encouraged? I had thought that most categories were supposed to categorize categories, and only the terminal categories were supposed to have articles in them - ex. [[Category:Free software]] should have only categories in it, not articles on software.
~maru
On 02/09/06, maru dubshinki marudubshinki@gmail.com wrote:
Wait, isn't that encouraged? I had thought that most categories were supposed to categorize categories, and only the terminal categories were supposed to have articles in them - ex. [[Category:Free software]] should have only categories in it, not articles on software.
Eh? I don't recall that being required at all. Else you'll end up with a lot of "other x" subcats.
- d.
And wouldn't there be a lot of database space taken up by redundant categories like this? We could have articles for them, maybe, but at this rate we'll need to start 'WikiProject WikiDbCleanup'. If there are 12,000 categories, that has to make up a significant body of data.
On 9/2/06, David Gerard dgerard@gmail.com wrote:
On 02/09/06, maru dubshinki marudubshinki@gmail.com wrote:
Wait, isn't that encouraged? I had thought that most categories were supposed to categorize categories, and only the terminal categories were supposed to have articles in them - ex. [[Category:Free software]] should have only categories in it, not articles on software.
Eh? I don't recall that being required at all. Else you'll end up with a lot of "other x" subcats.
- d.
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
What are you talking about, redundant?? It's called structure, man!
http://stats.wikimedia.org/wikispecial/EN/CategoryOverviewIndex.htm
Our category system serves many purposes which sometimes clash, but surprisingly infrequently, IMO.
Good general rule:don't create a category until it's needed. Therefore it's not at all surprising that categories other than 'leaves' have contents.
Why is everyone so obsessed with cleanup? It's not even fun, let alone necessary (most of the time).
cheers, Brianna user:pfctdayelise
On 02/09/06, Akash Mehta draicone@gmail.com wrote:
And wouldn't there be a lot of database space taken up by redundant categories like this? We could have articles for them, maybe, but at this rate we'll need to start 'WikiProject WikiDbCleanup'. If there are 12,000 categories, that has to make up a significant body of data.
On 9/2/06, David Gerard dgerard@gmail.com wrote:
On 02/09/06, maru dubshinki marudubshinki@gmail.com wrote:
Wait, isn't that encouraged? I had thought that most categories were supposed to categorize categories, and only the terminal categories were supposed to have articles in them - ex. [[Category:Free software]] should have only categories in it, not articles on software.
Eh? I don't recall that being required at all. Else you'll end up with a lot of "other x" subcats.
- d.
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l
On 9/2/06, Akash Mehta draicone@gmail.com wrote:
And wouldn't there be a lot of database space taken up by redundant categories like this? We could have articles for them, maybe, but at this rate we'll need to start 'WikiProject WikiDbCleanup'. If there are 12,000 categories, that has to make up a significant body of data.
At a rough estimate, the entire category structure of the English Wikipedia takes up as much database space as the edit history of [[George W Bush]]. I wouldn't worry about it.
wikipedia-l@lists.wikimedia.org