Hi, everyone,
I want to do some experiments on classification using web pages of wikipedia. Now that I have got the web page archive, the experiment needs the following category information:
1. what is the category (or categories) of a web page (an article)? eg. once I can get the two tips, the information is enough. a. Web page P1 belongs to category C1; b. Category C1 is under two parent categories CC1 and CC2, while the two categories own their parent category chains seperately. Then I can build a tree, which leaves are the web pages.
2. how do guys in wikipedia deal with the category work upon the huge amount of articles, for example, category method, level or inheritance between categories.
Could you give me some adivces or URLs to find them ?
Thanks & Best wishes,
2010/6/13 杨杰 xtyangjie@gmail.com:
Hi, everyone,
Hello Yang Jie,
I want to do some experiments on classification using web pages of wikipedia. Now that I have got the web page archive, the experiment needs the following category information:
- what is the category (or categories) of a web page (an article)?
eg. once I can get the two tips, the information is enough. a. Web page P1 belongs to category C1; b. Category C1 is under two parent categories CC1 and CC2, while the two categories own their parent category chains seperately. Then I can build a tree, which leaves are the web pages.
- how do guys in wikipedia deal with the category work upon the huge
amount of articles, for example, category method, level or inheritance between categories.
Could you give me some adivces or URLs to find them ?
The best URL to know how Wikipedia users use the categories is [1]
I'm not sure I understand your questions well so, don't hesitate to ask more precise questions once you have read [1]
[1] http://en.wikipedia.org/wiki/Wikipedia:Categories
Xi’an Jiaotong University
Hey! I've been there!
once i didn't know software is not free, but found it days later; now i realize that it's indeed free.
Yes, it's free!
Yours sincerely
-- Peter Potrowl http://www.mediawiki.org/wiki/User:Peter17
On Sun, Jun 13, 2010 at 5:55 AM, 杨杰 xtyangjie@gmail.com wrote:
- what is the category (or categories) of a web page (an article)?
eg. once I can get the two tips, the information is enough. a. Web page P1 belongs to category C1; b. Category C1 is under two parent categories CC1 and CC2, while the two categories own their parent category chains seperately. Then I can build a tree, which leaves are the web pages.
You can use API [1] function "prop=categories" to query any pages. Or you could obtain a database dump [2] and query the `categorylinks` table.
1. http://en.wikipedia.org/w/api.php 2. http://dumps.wikimedia.org/backup-index.html
- how do guys in wikipedia deal with the category work upon the huge
amount of articles, for example, category method, level or inheritance between categories.
They are stored in MySQL, see [3] and [4].
3. http://www.mediawiki.org/wiki/Manual:Category_table 4. http://www.mediawiki.org/wiki/Manual:Categorylinks_table
Department of Computer Science and Technology, Xi’an Jiaotong University
I'm in Xi'an too :P
-- Jimmy Xu
Thx, Jimmy and Peter17 !
May be it is a good choice to have a look at its data dump. I will give the feedback once there are some results.
Thank you again!
On Sun, Jun 13, 2010 at 3:21 PM, Jimmy Xu xu.jimmy.wrk@gmail.com wrote:
On Sun, Jun 13, 2010 at 5:55 AM, 杨杰 xtyangjie@gmail.com wrote:
- what is the category (or categories) of a web page (an article)?
eg. once I can get the two tips, the information is enough. a. Web page P1 belongs to category C1; b. Category C1 is under two parent categories CC1 and CC2, while the two categories own their parent category chains seperately. Then I can build a tree, which leaves are the web pages.
You can use API [1] function "prop=categories" to query any pages. Or you could obtain a database dump [2] and query the `categorylinks` table.
- how do guys in wikipedia deal with the category work upon the huge
amount of articles, for example, category method, level or inheritance between categories.
They are stored in MySQL, see [3] and [4].
- http://www.mediawiki.org/wiki/Manual:Category_table
- http://www.mediawiki.org/wiki/Manual:Categorylinks_table
Department of Computer Science and Technology, Xi’an Jiaotong University
I'm in Xi'an too :P
-- Jimmy Xu
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
杨杰 wrote:
- how do guys in wikipedia deal with the category work upon the huge
amount of articles, for example, category method, level or inheritance between categories.
Categories are edited by normal users, just like articles. Sometimes a category is missing, and then it is created, often following a pattern of other categories. For example if there are categories for French food and German food, I might create a category for Swedish food.
Sometimes a user analyses a bigger part of the category structure and proposes that food categories should be named by ethnic origin rather than country, or that food and drink should be one category instead of separate categories. But the user who does this for food and drink is not the same user that does this for pet animals or vintage cars.
And on another language of Wikipedia, the category structure could be completely different. Many languages have a page for discussions about the category structure, such as http://en.wikipedia.org/wiki/Wikipedia:Categories_for_discussion
So far, this is not a "technical" issue that is discussed on wikitech-l. The technical part is just making sure that the creation and editing of categories can be done. This also involves some semi-automatic tools such as using a "bot" to move articles from one category to another.
wikitech-l@lists.wikimedia.org