https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
Web browser: --- Bug ID: 54574 Summary: Re 1843798: Add capabiliy to remember pages to replace.py Product: Pywikibot Version: unspecified Hardware: All OS: All Status: ASSIGNED Severity: normal Priority: Unprioritized Component: General Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: legoktm.wikipedia@gmail.com Classification: Unclassified Mobile Platform: ---
Originally from: http://sourceforge.net/p/pywikipediabot/patches/326/ Reported by: sigmaoctantis Created on: 2009-05-12 04:30:27 Subject: Re 1843798: Add capabiliy to remember pages to replace.py Assigned to: nicdumz Original description: A new patch to implement toobaz's function with the changes suggested by wikipedian. https://sourceforge.net/tracker/?func=detail&aid=1843798&group%5C_id...
- solve_disambiguation.py and pagegenerators.py:
1. Generator and logging function for -primary option moved from solve_disambiguation.py to pagegenerators.py
2. TODO in solve_disambiguation.py done: generator now starts yielding before all referring pages have been found
3. makes use of new TextfilePageGenerator
4. code is a few lines shorter
- replace.py:
5. "-exclude" option from toobaz's patch implemented. Allows to filter generator through a list of previously edited pages. New pages are appended to the filter file based on choices made: -exclude: logs to filter choice "N"
6. additional command line options for other settings: -editonce: logs to filter choices "Y", "A" -treatonce: logs to filter choices "Y", "A", "N" -scanonce: logs to filter choices "Y", "A", "N"; no change
7. uses generator and file format from solve_disambiguation.py (suggested by wikipedian below)
8. default filter filename is the name of the fix. Files are placed in a subdirectory "replace".
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #1 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- patch for replace.py solve_disambiguation.py pagegenerators.py
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #2 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- patch for replace.py solve_disambiguation.py pagegenerators.py (revised)
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #3 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Thanks for the quick review. I will try to address the various points and included a new version of the patch.
a. I added a bit more text to the source and reformatted part of the code, but I didn't want to change existing code more than needed.
b. generator: - checks if the filter file exists - reads it - runs the next generator and skips pages in memory
Previously, it first run the next generator and then deleted from its result pages that were in the filter file
c. replace.py command line options
I added several command line options to define which pages should be skipped the next time. One could edit replace.py directly, but it seemed cleaner to provide all options at command line level.
toobaz excluded pages where a replacement was manually rejected ("N"). The option "-exclude" will keep this functionality.
Personally, I find it more useful to filter pages that were edited in a previous run. This avoids that the bot repeats the same edit later, after someone reverted a previous edit. Option "-editonce" provides this.
"-treatonce" combines the two.
"-scanonce" avoids that the bot re-fetches the same page in a 2nd run, even if the regex didn't match it in the first run. (I fixed an omission for "skipped" in the second patch)
Without the different options, the additions to replace.py would be much shorter ..
d. I had to insert several "break" in replace.py to avoid that nothing but "N" gets to the stage confusingly labeled "choice must be 'N'" in the code.
e. FilterFileAppend is based on the function from solve_disambiguation. The advantage of writing each page to the file is that it wont miss one if it's interrupted or crashes. This mode from solve_disambiguation remains unchanged.
f. The same goes for the file format. Up to now, I didn't have any problems with it and it worked ok with a title "臺灣Taiwan&āàäà" I just tested. urlname was also used by PrimaryIgnoreManager. For backward compatibility, may it should be kept.
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #4 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Wow, that's a big patch =)
* codecs is fine with me * can you avoid lines > 80 characters? I know that this is not something we do everywhere, but that's bad looking code. Same goes for if foo: bar. Please skip a line. * can you document thoroughly what's being done? parameters in the generators? In replace.py ? I find it really hard to understand the "choice" table in the docstring explaining -scanonce & others. * What's this: + f = codecs.open(filename, 'r', 'utf-8') + f.close() ??
I am also not convinced by the fact that after each page, FilterFileAppend is called, and #1 path is computed, #2 a file is opened, written in, and closed. I'm thinking that a possible cleaner way to do this would be to have a Filter object: put everything you need in it (an opened file descriptor, a list of titles to ignore if you need to use this, etc...) and keep a reference to it from the replace & disambig bots. How does that sound to you?
I also know that Daniel wanted first to keep the same file format, but... a couple of things are wrong here: * if you output titles with page.urlname() it will not be possible to read the file with TextfilePageGenerator afaik. Think of special characters, being url encoded, and not decoded. * if you want to use a Page title for a filename, you want Page.titleforFilename, not Page.urlname
Thank you!
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #5 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Assigning to nicdumz for processing.
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #6 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- Nicdumz, do you have the time to work on this? It's been stale for.... a while.
Sigmaoctantis, sorry for the very slow uptake. It's a general problem for most patches that are larger than the 'glance over it, looks ok, commit' language updates. I'll see if I can find the time to review it.
In any case, the patch does not apply cleanly currently, so it needs some more fiddling.
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #7 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- - **assigned_to**: nobody --> nicdumz
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
--- Comment #8 from Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com --- - **priority**: 5 --> 7
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
Kunal Mehta (Legoktm) legoktm.wikipedia@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://sourceforge.net/p/p | |ywikipediabot/patches/326
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
xqt info@gno.de changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEW CC| |info@gno.de
https://bugzilla.wikimedia.org/show_bug.cgi?id=54574
Ricordisamoa ricordisamoa@openmailbox.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Component|General |Other scripts
pywikipedia-bugs@lists.wikimedia.org