On Mon, 16 Nov 2009, Emmanuel Engelhart wrote:
Yes, by building the ZIM file, I add to "the already there keywords" the title of the redirects pages pointing to this page.
What are "the already there keywords" ?
I'm not enthusiastic about dumping category pages... but this is only a part of the issue. The other part is that I have no method to know, given a list of articles, which categories I have to integrate in the final dump! do you?
You could do it iteratively ?
You must have a method of unlinking 'red-linked' pages - links in articles that point to pages not in our collection.
Include all categories, remove those that point to zero or one article in our collection.
You can leave all categories in the article, just unlink the ones that do not 'make the cut'.
My point is - if references make up half the text dump, categories surely deserve to be in there.
Re: References - could you perhaps link to http://en.wikipedia.org/wiki/Gamma-ray_burst#References ?
Then, if you *do* have internet access, you can get to the refs ?
(Still not very satisfactory - the correspondence between the article ref and the actual one is lost - you have to look for it).
I appreciate all the work that has already gone in to this, but I see a lot of effort going into one or two zim files, and not enough on the process - where you could create another zim file which is just chemistry-related, or Africa-related, or top-1000 articles.
Cheers, Andy!