I have corrected the misspelled subject and added one more good reason to do it. Please reply to this one, not the prevoius.
Happy Monday for all!
I have had a dream for many-many years already: when I use replace.py for correcting grammatical and spelling errors (that's what I do in most of my time dedicated to botwork), it would be nice and useful to count the replacements and to extract old-new pairs for later use. This task needs changing of replaceExcept() in textlib, so for long time I haven't been brave enough as I thought it to be more complicated and I was afraid of community rejection, as often happens when a task is important for somebody, but others don't feel it as such.
But now compat is desert, so I was brave enough to make experiments with my copy. I still use replace.py in compat for several reasons, not to be detailed here. I want to show you what I have done and why and ask for opinions, if this is a good direction and could be ported to core. The benefit of this change is much greater, then the pain with it. Of course, the below solution is an estimation. It cannot count modifications with the built-in editor. Still it is a good and useful estimation.
== Example == Please have a look at https://hu.wikipedia.org/wiki/ Szerkeszt%C5%91:BinBot/munka#2017._szeptember_4. The first numerical value in the table is the number of modified pages, the last one is the total number of replacements. The difference is astonishing: for some tasks the two numbers are equal, while there is one where the last number is 12 times as big as the first. I think this is something worth to show.
== Motivation == === Counting the replacements === * Statistics * Choosing bot tasks by efficiency * Printing the number of replacements to the screen after each page increases the security of the work. Sometimes not every diff is properly coloured (e.g. think of a space), and the work is tiring, so it is easy to skip a change, but the number may make the user focus on it. * Give data for community (e.g. which are dangerous common errors, where we need further steps) * Natural curiosity of a bot owner * Scientific purpose * etc.
=== Saving the old-new pairs to a file or a wikipage == * Preparing new bot tasks, developing fixes and regexes * Creating lists of common errors for the community * There is a common spelling error which is quite easy to detect when the word is [[link]]ed, but almost impossible without linking (due to the enormous number of false positives). Ma idea is to save the hits from the linked version and use them for the unliked as a list of errors rather than a pattern. * There is another common error which is not worth to be treated by bot due to the enormous number of false positives. But if I could save the list automatically (without modifying pages), it could be revised by volunteers and used later as a list of errors rather than a pattern. * Showing this list to users or groups of interests in order to teach them which errors to avoid in the future. * Scientific purpose * etc.
== Solution == Sorry, I cannot create a diff now, because this directory is not versioned. However, these 4 steps are not complicated to follow.
=== textlib.py === def replaceExcept(text, old, new, exceptions, caseInsensitive=False, allowoverlap=False, marker='', site=None): became: def replaceExcept(text, old, new, exceptions, caseInsensitive=False, allowoverlap=False, marker='', site=None, returnPairs=False): Just within 80 characters. :-) So it won't cause any harm when called from anywhere without the new argument, the behaviour is unchanged for existing calls. A new initialization: pairs = [] At the end of the main if, bottom of this branch: else: # We found a valid match. Replace it. the last line: markerpos = match.start() + len(replacement) became: markerpos = match.start() + len(replacement) pairs.append((match.group(), replacement)) And at the very end of the method instead of return text now I have: if returnPairs: return (text, pairs) else: return text
=== replace.py === replaceExcept() is called from doReplacements(). Without details, instead of returning new_text, now it will return (new_text, replaceList) where replaceList is a list of (old, new) tuples. Generally it is not recommended to mix returning values and making side effects, such as storing pairs in a list, which is global to the method, so I decided do give back pairs. The main method of the bot (run()) can handle it according to given parameters, either to increment a counter, or save the (old, new) pairs to a file or a wikipage, or do nothing, just the classic task of replacement. It needs some memory, but by this point only pairs of the actual page are stored. Unless you explicitely create a huge list with all the occuring pairs, which is not neccessary, it won't cause a problem.