I have done the work for compat, now it is running, and I plan to open the
ticket when I get the numbers.
As far as I know, compat is unfortunatley totally deprecated. Is there
despite this any possibility to upload a patch? Otherwise I can describe
here, what I did. I know that people still use compat.
Let's talk about the second problem. I am not sure it may easily be solved
for the satisfaction ov everybody, but I was already thinking about it. (I
have plenty of plans concerning replace.py which is quite poor now.)
Advanced use of replace.py needs a two-run approach. First we collect
candidates from dump wit no human inteaction, while the bot owner sleeps or
works or does any useful thing.
The titles are collected to a file, and in the second run the owner
processes them interactively, much faster. All the belongings of this
process that I implemented to compat are totally missing from core now,
making replace.py useless for me, but this is obviously a temporary state.
So let's think in the way, that replace.py must help direct immediate
raplecements as well as two-run replacements.
If you want to replace immediately in one run, the replacement should be
done only once to spare time. But where? XMLdumpgenerator is a separate
class that can yield pages, I don't hink we should remove it. The main
class has the complexity of how to handle the separate cases and human
interactions. I don't think we should transfer it to XMLdumpgenerator.
Perhaps the best solution is if XMLdumpgeneratordoes not want to replace,
just to search. This will be some faster.
If you save the titles for later human processing, XMLdumpgenerator does
not have to do the replacement in most of cases, just search again.
There is a third case: when I develop new fixes, I often do experiments, It
is useful to see the planned replacements during the first run, this helps
me to enhance the fix. So I wouldn't totally remove the replacing ability.
This needs a separate switch which can be ON by default when we use -xml.
Please keep in mind that to accelerate the generator is important, but keep
the seped of main replacebot high is even more important. When you want to
totally avoid double work, you don't use dump at all.
So I have three ideas for the work of this:
1. The switch tells the generator to replace the second parameter of
replacement tuples with ''. I don't have numbers, how faster this would
be. This has some danger, so the bot must ensure that the switch is
effective only if we save titles to a file, or we work in simulation mode,
not to destroy wiki.
2. The generator will search instead of replace. I Don't like this idea,
because textlib.py has the complexity to listen for exceptions and comments
and nowikis etc.
3. We enhance textlib.py so that replaceExcept() will have a new
parameter. This will make replaceExcept() to use a search rather than a
replace. *This is the good solution.* In this case the function could
return a dummy text which differs from original, so that we don't have to
rewrite the scripts which use it.
Anyhow, replaceExcept() needs another enhancement which I already did in my
copy. It should optionally return (old, new) pairs for further processing,
this is very useful for developing fixes, measuring efficiency, creating
statistics etc. This will be a separate task, but if you agree with this
solution, we may add the two new parameters in one run.
So we have three tickets now. :-)
2018-09-17 8:10 GMT+02:00 <info(a)gno.de>de>:
There is another issue with that generator: it always
replacements but does not apply them which means replacements are always
done twice which might slow down the run too. I think we should open a
Phabricator task for it.
Am 16.09.2018 um 22:03 schrieb Bináris <wikiposta(a)gmail.com>om>:
I still use trunk/compat for many reasons, but as I see the new code at
the core version must suffer from the same problem.
If we use -namespace for namespace filtering, class
XmlDumpReplacePageGenerator will go through ALL pages, THEN the result is
filtered by a namespace generator. This may MULTIPLY the running time in
some cases and this may cost hours or even days for a fix of complicated,
I have just checked, that dump does contain namespace informátion. So why
don't we filter during the scan?
I made an experiment. I modified my copy to display count of articles and
count of matching pages. The replacement was:
which seems pretty slow. :-(
The bot scanned the latest huwiki dump for 14 hours(!). (Not the whole
dump, I used -xmlstart.) It went through 820 thousand pages and found 240+
matches (I displayed every 10th match).
Then the bot worked further 30-40 minutes to check the actual pages from
live wiki, this time with namespace filtering on. (I don't replace in this
phase, just save the list, so no human interaction is implied in this time.)
Guess the result! 62 out of 240 remained. This means that the bigger part
of these 14 hours went into /dev/null.
Now I realize how much time I wasted in the past 10 years. :-(
I am sure that passing namespaces to XmlDumpReplacePageGenerator is worth.
pywikibot mailing list
pywikibot mailing list