RFC: Counting and saving replacements in replace.py - pywikibot

4 Sep 2017


      I have corrected the misspelled subject and added one more good reason to
do it. Please reply to this one, not the prevoius.
Happy Monday for all!
I have had a dream for many-many years already: when I use replace.py for
correcting grammatical and spelling errors (that's what I do in most of my
time dedicated to botwork), it would be nice and useful to count the
replacements and to extract old-new pairs for later use. This task needs
changing of replaceExcept() in textlib, so for long time I haven't been
brave enough as I thought it to be more complicated and I was afraid of
community rejection, as often happens when a task is important for
somebody, but others don't feel it as such.
But now compat is desert, so I was brave enough to make experiments with my
copy. I still use replace.py in compat for several reasons, not to be
detailed here. I want to show you what I have done and why and ask for
opinions, if this is a good direction and could be ported to core. The
benefit of this change is much greater, then the pain with it.
Of course, the below solution is an estimation. It cannot count
modifications with the built-in editor. Still it is a good and useful
estimation.
== Example ==
Please have a look at https://hu.wikipedia.org/wiki/
Szerkeszt%C5%91:BinBot/munka#2017._szeptember_4.
The first numerical value in the table is the number of modified pages, the
last one is the total number of replacements. The difference is
astonishing: for some tasks the two numbers are equal, while there is one
where the last number is 12 times as big as the first. I think this is
something worth to show.
== Motivation ==
=== Counting the replacements ===
* Statistics
* Choosing bot tasks by efficiency
* Printing the number of replacements to the screen after each page
increases the security of the work. Sometimes not every diff is properly
coloured (e.g. think of a space), and the work is tiring, so it is easy to
skip a change, but the number may make the user focus on it.
* Give data for community (e.g. which are dangerous common errors, where we
need further steps)
* Natural curiosity of a bot owner
* Scientific purpose
* etc.
=== Saving the old-new pairs to a file or a wikipage ==
* Preparing new bot tasks, developing fixes and regexes
* Creating lists of common errors for the community
* There is a common spelling error which is quite easy to detect when the
word is [[link]]ed, but almost impossible without linking (due to the
enormous number of false positives). Ma idea is to save the hits from the
linked version and use them for the unliked as a list of errors rather than
a pattern.
* There is another common error which is not worth to be treated by bot due
to the enormous number of false positives. But if I could save the list
automatically (without modifying pages), it could be revised by volunteers
and used later as a list of errors rather than a pattern.
* Showing this list to users or groups of interests in order to teach them
which errors to avoid in the future.
* Scientific purpose
* etc.
== Solution ==
Sorry, I cannot create a diff now, because this directory is not versioned.
However, these 4 steps are not complicated to follow.
=== textlib.py ===
def replaceExcept(text, old, new, exceptions, caseInsensitive=False,
                  allowoverlap=False, marker='', site=None):
became:
def replaceExcept(text, old, new, exceptions, caseInsensitive=False,
                  allowoverlap=False, marker='', site=None,
returnPairs=False):
Just within 80 characters. :-) So it won't cause any harm when called from
anywhere without the new argument, the behaviour is unchanged for existing
calls.
A new initialization:
pairs = []
At the end of the main if, bottom of this branch:
        else:
            # We found a valid match. Replace it.
the last line:
            markerpos = match.start() + len(replacement)
became:
            markerpos = match.start() + len(replacement)
            pairs.append((match.group(), replacement))
And at the very end of the method instead of return text now I have:
    if returnPairs:
        return (text, pairs)
    else:
        return text
=== replace.py ===
replaceExcept() is called from doReplacements(). Without details, instead
of returning new_text, now it will
        return (new_text, replaceList)
where replaceList is a list of (old, new) tuples.
Generally it is not recommended to mix returning values and making side
effects, such as storing pairs in a list, which is global to the method, so
I decided do give back pairs. The main method of the bot (run()) can handle
it according to given parameters, either to increment a counter, or save
the (old, new) pairs to a file or a wikipage, or do nothing, just the
classic task of replacement. It needs some memory, but by this point only
pairs of the actual page are stored. Unless you explicitely create a huge
list with all the occuring pairs, which is not neccessary, it won't cause a
problem.
-- 
Bináris


-- 
Bináris