https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
Bug ID: 68705 Summary: 'category remove' regression when category contains only files Product: Pywikibot Version: core (2.0) Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: category.py Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: jayvdb@gmail.com Web browser: --- Mobile Platform: ---
(Creating bug from https://gerrit.wikimedia.org/r/#/c/149645/)
In May 2014 this changeset broke 'category.py remove' for categories with only files and no pages. https://gerrit.wikimedia.org/r/#/c/130301/
It is also a significant performance regression as 'categoryinfo' requires the server to fetch all of the category members, which is slow, only to return counts, when category.py needs to iterate over the category members, so it must hit the server again.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
Fabian CommodoreFabianus@gmx.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |CommodoreFabianus@gmx.de
--- Comment #1 from Fabian CommodoreFabianus@gmx.de --- I'm confused: I don't see a possibility to get the counts without hitting the server. And the server must be hit at least once to get the pages or categories in the “worst case”.
So the previous handling without using 'categoryinfo' seems to be the better solution. Of course it could be written a bit more pythonic by simply using
articles = cat.articles() if articles: # remove them else: # "complain"
Even the change in 'CategoryTreeRobot' should be reverted because the database should be the fasted way to get the number of articles in a category.
By the way database: This seems the only way to avoid hitting the server in the “best case” (|articles| == 0). But 'CategoryMoveRobot' doesn't use the database and it's only used for read-only operations.
But the changes above line 644 (old)/642 (new) look fine to me.
Now of course I'm fairly new to the bot's internals, so my assessment is probably off.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |info@gno.de, | |valhallasw@arctus.nl
--- Comment #2 from John Mark Vandenberg jayvdb@gmail.com --- Your assessment is exactly the same as mine. CC:ing the people involved in the previous changeset in case they can see some problem with our assessment.
I didnt +2 your changeset because it looks to me like the previous changeset needs to be mostly reverted, except for the formatting changes. Do you want to put up a 'revert' patch for review?
(Yes, the server must be hit at least once. Invoking cat.categoryinfo is the unnecessary performance regression, as the client logic needs all contents, so it isnt useful to ask the server to compute the counts when the client can do it.)
I believe the database is only used for read-only operations because it isnt a significant problem if the read-only operations use stale data. Write operations need to use fresh data to minimise errors.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
--- Comment #3 from xqt info@gno.de --- cat.articles() is a generator and bool(cat.articles()) is always True.
You may try a simple performace test like:
456626
You'll get the result above immediately
g = c.articles()
len(set(g))
You may wait several minutes to get a result. I've given up after 5.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
--- Comment #4 from John Mark Vandenberg jayvdb@gmail.com --- (In reply to xqt from comment #3)
I see the server doesnt spend much time working on this, as it loads the data from the category table.
https://www.mediawiki.org/wiki/Manual:Category_table
However it is another API call on the server, which adds more network latency (think someone in the jungle of Borneo on a 3G modem), server workload and delays (an issue on slower wikis), etc.
len(set(g)) is obviously not the best way to determine if the generator has any results.
The best way is, I think, to process the generator with an else block to detect if the generator had no results. There are other tricks to instantaneously determine if the generator is empty.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
--- Comment #5 from Fabian CommodoreFabianus@gmx.de --- Okay my mistake that 'bool(cat.articles)' return true if it's not None. But you could then do something like:
empty = True for page in cat.articles: empty = False if empty: # complain
Now I'm sure if this is slower (but it doesn't appear so). And maybe there is a more pythonic way. The for-else does not differ between an empty list and a non empty.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
--- Comment #6 from John Mark Vandenberg jayvdb@gmail.com --- (In reply to Fabian from comment #5)
The for-else does not differ between an empty list and a non empty.
This will work
for i, page in enumerate(generator, 1): .... else: if 'i' not in locals(): ...
(I dont like locals() being used)
There are many other ways to do it, including wrapping the generator in a generator class that allows peaking.
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
Ricordisamoa ricordisamoa@openmailbox.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |ricordisamoa@openmailbox.or | |g
--- Comment #7 from Ricordisamoa ricordisamoa@openmailbox.org --- https://gerrit.wikimedia.org/r/149645/ merged, is this fixed?
https://bugzilla.wikimedia.org/show_bug.cgi?id=68705
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
pywikipedia-bugs@lists.wikimedia.org