Hi folks,
I still use trunk/compat for many reasons, but as I see the new code at https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the core version must suffer from the same problem.
If we use -namespace for namespace filtering, class XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is filtered by a namespace generator. This may MULTIPLY the running time in some cases and this may cost hours or even days for a fix of complicated, slow regexes. I have just checked, that dump does contain namespace informátion. So why don't we filter during the scan?
I made an experiment. I modified my copy to display count of articles and count of matching pages. The replacement was: (ur'(\d)\s*%', ur'\1%'), which seems pretty slow. :-( The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, I used -xmlstart.) It went through 820 thousand pages and found 240+ matches (I displayed every 10th match). Then the bot worked further 30-40 minutes to check the actual pages from live wiki, this time with namespace filtering on. (I don't replace in this phase, just save the list, so no human interaction is implied in this time.) Guess the result! 62 out of 240 remained. This means that the bigger part of these 14 hours went into /dev/null. Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
There is another issue with that generator: it always checks for replacements but does not apply them which means replacements are always done twice which might slow down the run too. I think we should open a Phabricator task for it. Best Xqt
Am 16.09.2018 um 22:03 schrieb Bináris wikiposta@gmail.com:
Hi folks,
I still use trunk/compat for many reasons, but as I see the new code at https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the core version must suffer from the same problem.
If we use -namespace for namespace filtering, class XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is filtered by a namespace generator. This may MULTIPLY the running time in some cases and this may cost hours or even days for a fix of complicated, slow regexes. I have just checked, that dump does contain namespace informátion. So why don't we filter during the scan?
I made an experiment. I modified my copy to display count of articles and count of matching pages. The replacement was: (ur'(\d)\s*%', ur'\1%'), which seems pretty slow. :-( The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, I used -xmlstart.) It went through 820 thousand pages and found 240+ matches (I displayed every 10th match). Then the bot worked further 30-40 minutes to check the actual pages from live wiki, this time with namespace filtering on. (I don't replace in this phase, just save the list, so no human interaction is implied in this time.) Guess the result! 62 out of 240 remained. This means that the bigger part of these 14 hours went into /dev/null. Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
-- Bináris _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
I have done the work for compat, now it is running, and I plan to open the ticket when I get the numbers. As far as I know, compat is unfortunatley totally deprecated. Is there despite this any possibility to upload a patch? Otherwise I can describe here, what I did. I know that people still use compat.
Let's talk about the second problem. I am not sure it may easily be solved for the satisfaction ov everybody, but I was already thinking about it. (I have plenty of plans concerning replace.py which is quite poor now.)
Advanced use of replace.py needs a two-run approach. First we collect candidates from dump wit no human inteaction, while the bot owner sleeps or works or does any useful thing. The titles are collected to a file, and in the second run the owner processes them interactively, much faster. All the belongings of this process that I implemented to compat are totally missing from core now, making replace.py useless for me, but this is obviously a temporary state. So let's think in the way, that replace.py must help direct immediate raplecements as well as two-run replacements.
If you want to replace immediately in one run, the replacement should be done only once to spare time. But where? XMLdumpgenerator is a separate class that can yield pages, I don't hink we should remove it. The main class has the complexity of how to handle the separate cases and human interactions. I don't think we should transfer it to XMLdumpgenerator. Perhaps the best solution is if XMLdumpgeneratordoes not want to replace, just to search. This will be some faster.
If you save the titles for later human processing, XMLdumpgenerator does not have to do the replacement in most of cases, just search again. There is a third case: when I develop new fixes, I often do experiments, It is useful to see the planned replacements during the first run, this helps me to enhance the fix. So I wouldn't totally remove the replacing ability. This needs a separate switch which can be ON by default when we use -xml.
Please keep in mind that to accelerate the generator is important, but keep the seped of main replacebot high is even more important. When you want to totally avoid double work, you don't use dump at all.
So I have three ideas for the work of this:
1. The switch tells the generator to replace the second parameter of replacement tuples with ''. I don't have numbers, how faster this would be. This has some danger, so the bot must ensure that the switch is effective only if we save titles to a file, or we work in simulation mode, not to destroy wiki. 2. The generator will search instead of replace. I Don't like this idea, because textlib.py has the complexity to listen for exceptions and comments and nowikis etc. 3. We enhance textlib.py so that replaceExcept() will have a new parameter. This will make replaceExcept() to use a search rather than a replace. *This is the good solution.* In this case the function could return a dummy text which differs from original, so that we don't have to rewrite the scripts which use it.
Anyhow, replaceExcept() needs another enhancement which I already did in my copy. It should optionally return (old, new) pairs for further processing, this is very useful for developing fixes, measuring efficiency, creating statistics etc. This will be a separate task, but if you agree with this solution, we may add the two new parameters in one run.
So we have three tickets now. :-)
2018-09-17 8:10 GMT+02:00 info@gno.de:
There is another issue with that generator: it always checks for replacements but does not apply them which means replacements are always done twice which might slow down the run too. I think we should open a Phabricator task for it. Best Xqt
Am 16.09.2018 um 22:03 schrieb Bináris wikiposta@gmail.com:
Hi folks,
I still use trunk/compat for many reasons, but as I see the new code at https://github.com/wikimedia/pywikibot/blob/master/scripts/replace.py, the core version must suffer from the same problem.
If we use -namespace for namespace filtering, class XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is filtered by a namespace generator. This may MULTIPLY the running time in some cases and this may cost hours or even days for a fix of complicated, slow regexes. I have just checked, that dump does contain namespace informátion. So why don't we filter during the scan?
I made an experiment. I modified my copy to display count of articles and count of matching pages. The replacement was: (ur'(\d)\s*%', ur'\1%'), which seems pretty slow. :-( The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, I used -xmlstart.) It went through 820 thousand pages and found 240+ matches (I displayed every 10th match). Then the bot worked further 30-40 minutes to check the actual pages from live wiki, this time with namespace filtering on. (I don't replace in this phase, just save the list, so no human interaction is implied in this time.) Guess the result! 62 out of 240 remained. This means that the bigger part of these 14 hours went into /dev/null. Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
-- Bináris
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
Bináris wikiposta@gmail.com hat am 17. September 2018 um 09:03 geschrieben:
I have done the work for compat, now it is running, and I plan to open the ticket when I get the numbers. As far as I know, compat is unfortunatley totally deprecated. Is there despite this any possibility to upload a patch?
Not for compat branch but for the core repository. You may use the Gerrit Patch Uploader [1] for it.
If you are unable to merge your patch into core send me your compat patch and I'll try to merge it.
[1] https://www.mediawiki.org/wiki/Gerrit_patch_uploader
Best
Xqt
2018-09-17 9:03 GMT+02:00 Bináris wikiposta@gmail.com:
We enhance textlib.py so that replaceExcept() will have a new parameter. This will make replaceExcept() to use a search rather than a replace. *This is the good solution.* In this case the function could return a dummy text which differs from original, so that we don't have to rewrite the scripts which use it.
Also, if we have multiple (old, new) pairs in a fix, with this switch replaceExcept() can return for the first match, thus the page will be listed. This will accelerate it again.
2018-09-16 22:03 GMT+02:00 Bináris wikiposta@gmail.com:
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole dump, I used -xmlstart.) It went through 820 thousand pages and found 240+ matches (I displayed every 10th match). Then the bot worked further 30-40 minutes to check the actual pages from live wiki, this time with namespace filtering on. (I don't replace in this phase, just save the list, so no human interaction is implied in this time.) Guess the result! 62 out of 240 remained. This means that the bigger part of these 14 hours went into /dev/null. Now I realize how much time I wasted in the past 10 years. :-(
I was not quite right. With the modified code it took 12 hours instead of 14, 630,000 pages were scanned instead of 820,000 and 83 matches found instead of 240+ (of which 62 are real). Bt this is still not the same.