jenkins-bot has submitted this change and it was merged.
Change subject: [FIX] reflinks: Expect that urlopen return bytes
......................................................................
[FIX] reflinks: Expect that urlopen return bytes
With Python 3 an error occured on using the regex because it only allows
that either all three are bytes or are str but the result from urlopen
is bytes.
On Python 2 all strs are actually bytes by default so it worked fine
there. But with unicode_literals in 1e54a7d6 the replacement and pattern
changed into unicode. While the pattern works in Python 2 regardless the
replacement will force that the searched content is converted into
unicode. As that happens implicitly it'll use ASCII and cause an error
when it contains non-ASCII characters.
The getheader method has been removed Python 3 so it is using the
replacement in Python 3.
Bug: T94688
Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f
---
M scripts/reflinks.py
1 file changed, 8 insertions(+), 5 deletions(-)
Approvals:
John Vandenberg: Looks good to me, approved
jenkins-bot: Verified
diff --git a/scripts/reflinks.py b/scripts/reflinks.py
index 41f8fc9..009e808 100755
--- a/scripts/reflinks.py
+++ b/scripts/reflinks.py
@@ -454,14 +454,14 @@
raise
# Regex to grasp content-type meta HTML tag in HTML source
- self.META_CONTENT =
re.compile(r'(?i)<meta[^>]*content\-type[^>]*>')
+ self.META_CONTENT =
re.compile(br'(?i)<meta[^>]*content\-type[^>]*>')
# Extract the encoding from a charset property (from content-type !)
- self.CHARSET =
re.compile(r'(?i)charset\s*=\s*(?P<enc>[^\'",;>/]*)')
+ self.CHARSET =
re.compile(br'(?i)charset\s*=\s*(?P<enc>[^\'",;>/]*)')
# Extract html title from page
self.TITLE =
re.compile(r'(?is)(?<=<title>).*?(?=</title>)')
# Matches content inside <script>/<style>/HTML comments
self.NON_HTML = re.compile(
-
r'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<!\[CDATA\[.*?\]\]>')
+
br'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<!\[CDATA\[.*?\]\]>')
# Authorized mime types for HTML pages
self.MIME = re.compile(
@@ -555,7 +555,10 @@
f = urlopen(ref.url)
# Try to get Content-Type from server
headers = f.info()
- contentType = headers.getheader('Content-Type')
+ if sys.version_info[0] > 2:
+ contentType = headers.get_content_type()
+ else:
+ contentType = headers.getheader('Content-Type')
if contentType and not self.MIME.search(contentType):
if ref.link.lower().endswith('.pdf') and \
not self.getOption('ignorepdf'):
@@ -645,7 +648,7 @@
f.close()
# remove <script>/<style>/comments/CDATA tags
- linkedpagetext = self.NON_HTML.sub('', linkedpagetext)
+ linkedpagetext = self.NON_HTML.sub(b'', linkedpagetext)
meta_content = self.META_CONTENT.search(linkedpagetext)
enc = []
--
To view, visit
https://gerrit.wikimedia.org/r/205095
To unsubscribe, visit
https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f
Gerrit-PatchSet: 2
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: XZise <CommodoreFabianus(a)gmx.de>
Gerrit-Reviewer: John Vandenberg <jayvdb(a)gmail.com>
Gerrit-Reviewer: Ladsgroup <ladsgroup(a)gmail.com>
Gerrit-Reviewer: Merlijn van Deen <valhallasw(a)arctus.nl>
Gerrit-Reviewer: Rubin <rubin(a)wikimedia.ru>
Gerrit-Reviewer: jenkins-bot <>