On Wed, 30 May 2007 16:56:11 +0200, Frank Schumacher wrote:
FS: Dear NG,
FS:
FS: I use the html-download of wikipedia to extract a net of main- and
FS: subcategories with the connected articles.
FS:
FS: To achieve this, I parse all Category~*.* pages.
FS:
FS: Now it happens, that categories with count of (i.e) subcategories
FS: greater than 200 aren't represented completely in the html-dump. The
FS: page only contains the first 200 elements, further elements are not in
FS: anymore. The link "next 200" redirects to itself and actually, no page
FS: with the "next 200" can be found.
FS:
FS: So I can only extract the first 200 elements. Can anything be done about
FS: this?
You can work with the xml dumps. Import them into mysql and have a look to
http://meta.wikimedia.org/wiki/Database_layout
--
Emmanuel