jenkins-bot has submitted this change and it was merged. (
https://gerrit.wikimedia.org/r/475613 )
Change subject: proofreadpage.py: OCR needs BeautifulSoup
......................................................................
proofreadpage.py: OCR needs BeautifulSoup
In proofreadpage.py, OCR needs BeautifulSoup in:
- url_image()
- _do_hocr()
Soup() is defined at import time only if bs4 is available.
Define it also when bs4 is not avaiable and make it raise
ImportError when called.
Rename Soup() to _bs4_soup() to comply with function naming rules.
OCR tests if bs4 is not available are already skipped:
- see Iaeabb046660b294fa19025282a344356f756c5bf
Bug: T210335
Change-Id: I5e3d235cdb1cba9b4ed52ba2442a9bfb1802d9bf
---
M pywikibot/proofreadpage.py
1 file changed, 18 insertions(+), 7 deletions(-)
Approvals:
Xqt: Looks good to me, approved
jenkins-bot: Verified
diff --git a/pywikibot/proofreadpage.py b/pywikibot/proofreadpage.py
index e22c2e6..5432d70 100644
--- a/pywikibot/proofreadpage.py
+++ b/pywikibot/proofreadpage.py
@@ -38,20 +38,29 @@
from bs4 import BeautifulSoup, FeatureNotFound
except ImportError as e:
BeautifulSoup = e
+
+ def _bs4_soup(*args, **kwargs):
+ """Raise BeautifulSoup when called, if bs4 is not
available."""
+ raise BeautifulSoup
else:
try:
BeautifulSoup('', 'lxml')
except FeatureNotFound:
- Soup = partial(BeautifulSoup, features='html.parser')
+ _bs4_soup = partial(BeautifulSoup, features='html.parser')
else:
- Soup = partial(BeautifulSoup, features='lxml')
+ _bs4_soup = partial(BeautifulSoup, features='lxml')
import pywikibot
from pywikibot.comms import http
from pywikibot.data.api import Request
+from pywikibot.tools import ModuleDeprecationWrapper
_logger = 'proofreadpage'
+wrapper = ModuleDeprecationWrapper(__name__)
+wrapper._add_deprecated_attr('Soup', _bs4_soup,
replacement_name='_bs4_soup',
+ since='20181128')
+
class FullHeader(object):
@@ -524,9 +533,10 @@
@rtype: str/unicode
@raises Exception: in case of http errors
+ @raise ImportError: if bs4 is not installed, _bs4_soup() will raise
@raises ValueError: in case of no prp_page_image src found for scan
"""
- # wrong link fail with various possible Exceptions.
+ # wrong link fails with various possible Exceptions.
if not hasattr(self, '_url_image'):
if self.exists():
@@ -541,7 +551,7 @@
pywikibot.error('Error fetching HTML for %s.' % self)
raise
- soup = Soup(response.text)
+ soup = _bs4_soup(response.text)
try:
self._url_image = soup.find(class_='prp-page-image')
@@ -623,10 +633,11 @@
This is the main method for 'phetools'.
Fallback method is ocr.
+ @raise ImportError: if bs4 is not installed, _bs4_soup() will raise
"""
def parse_hocr_text(txt):
"""Parse hocr text."""
- soup = Soup(txt)
+ soup = _bs4_soup(txt)
res = []
for ocr_page in soup.find_all(class_='ocr_page'):
@@ -823,7 +834,7 @@
del self._parsed_text
self._parsed_text = self._get_parsed_page()
- self._soup = Soup(self._parsed_text)
+ self._soup = _bs4_soup(self._parsed_text)
# Do not search for "new" here, to avoid to skip purging if links
# to non-existing pages are present.
attrs = {'class': re.compile('prp-pagequality')}
@@ -845,7 +856,7 @@
self.purge()
del self._parsed_text
self._parsed_text = self._get_parsed_page()
- self._soup = Soup(self._parsed_text)
+ self._soup = _bs4_soup(self._parsed_text)
if not self._soup.find_all('a', attrs=attrs):
raise ValueError(
'Missing class="qualityN prp-pagequality-N" or '
--
To view, visit
https://gerrit.wikimedia.org/r/475613
To unsubscribe, or for help writing mail filters, visit
https://gerrit.wikimedia.org/r/settings
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I5e3d235cdb1cba9b4ed52ba2442a9bfb1802d9bf
Gerrit-Change-Number: 475613
Gerrit-PatchSet: 6
Gerrit-Owner: Mpaa <mpaa.wiki(a)gmail.com>
Gerrit-Reviewer: John Vandenberg <jayvdb(a)gmail.com>
Gerrit-Reviewer: Mpaa <mpaa.wiki(a)gmail.com>
Gerrit-Reviewer: Xqt <info(a)gno.de>
Gerrit-Reviewer: jenkins-bot (75)