[Pywikipedia-l] [ pywikipediabot-Patches-1843798 ] Add capabiliy to remember pages to replace.py

Wed Jan 16 14:20:33 UTC 2008

Patches item #1843798, was opened at 2007-12-03 18:45
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1843798&group_id=93107

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Pietro Battiston (toobaz)
Assigned to: Nobody/Anonymous (nobody)
Summary: Add capabiliy to remember pages to replace.py

Initial Comment:
When doing very long semi-automatic replacements, it can happen to kill the bot and to start again. So you have to say "no" again to all non wanted replacements. It is even worse if you're using an xml dump: it can be several weeks old, and it will make you download lot of pages that where ALREADY corrected.

This patch consist in two parts:
1) a patch to replace.py that adds a new parameter, "-exclude", and makes it accept a path to a file which will be used both for:
-> knowing which articles to exclude from substitution
-> logging denied replaces' pages and pages already known to be not needing replacements 
2) a patch to pagegenerators.py that adds a generator filter, able to yield only pages not appearing in a given list

The only doubt I have is: should the replace.py log in some other way? xml? wikipedia module's predefined functions? log into a given wikipedia userpage (so that logs can easily be shared)?

As I've done it, it needs to import os and codecs modules... don't know if it's a problem.

Anyway, a patch like this is something really needed, if needed I can try to improve it.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-01-16 06:20

Message:
Logged In: NO 

replace.py already has the option -xmlstart:page when using an xml dump,
to skip all entries before "page".

----------------------------------------------------------------------

Comment By: Daniel Herding (wikipedian)
Date: 2008-01-16 04:35

Message:
Logged In: YES 
user_id=880694
Originator: NO

We already have something very similar for solve_disambiguation.py. When
you run it with the -primary parameter, e.g. on [[en:London]], it saves all
page titles where the user pressed 'N' to the 'disambiguations' directory,
and skips these pages when you run the same command later.

It saves the URL-encoded titles into a text files, one title per line,
without [[brackets]].

It would be nice if some code could be shared, although I'm not sure if
that's possible (I haven't yet looked at your code, but
solve_disambiguation.py is a bit complicated). But we should keep
solve_disambiguation's format because there are probably people who want to
keep using their logs.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1843798&group_id=93107