Francesco Cosoleto ha scritto:
Pietro Battiston ha scritto:
- adding to replace.py the capability
to remember already seen pages (on this patch, I'd like some feedback: as I'm intensively testing it, but maybe the implementation itself could be better)
This feature may be general, not related only to replace.py.
Logging an "exclude" file is - in my intentions - particularly related to xml dumps; I don't want to download again all the pages of a dump that I already know I won't change.
That said, I must admit I wondered when I found xml dump page generator is in replace.py instead than in pagegenerators.py. Is there a motivation? This too could be a general feature.
More specifically, the "exclude" logging feature may be general, but in the replace.py case (but maybe in general too) I think it should evolve to provide separate logging for: 1) pages fixed or automatically skipped because they where already fixed 2) pages skipped manually because replace doesn't apply
For example, if I use replace.py with a dump and then get a new dump, I'll delete log 1), but I'll want to keep, in most cases, log 2).
Inserting in wikipedia.py this stuff is maybe redundant for many bots, but I don't think it could be a bad idea. When I find time, I'll write a better patch (xml or still plaintext?).
Pietro Battiston