jenkins-bot has submitted this change and it was merged.
Change subject: [FIX] reflinks: Expect that urlopen return bytes ......................................................................
[FIX] reflinks: Expect that urlopen return bytes
With Python 3 an error occured on using the regex because it only allows that either all three are bytes or are str but the result from urlopen is bytes.
On Python 2 all strs are actually bytes by default so it worked fine there. But with unicode_literals in 1e54a7d6 the replacement and pattern changed into unicode. While the pattern works in Python 2 regardless the replacement will force that the searched content is converted into unicode. As that happens implicitly it'll use ASCII and cause an error when it contains non-ASCII characters.
The getheader method has been removed Python 3 so it is using the replacement in Python 3.
Bug: T94688 Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f --- M scripts/reflinks.py 1 file changed, 8 insertions(+), 5 deletions(-)
Approvals: John Vandenberg: Looks good to me, approved jenkins-bot: Verified
diff --git a/scripts/reflinks.py b/scripts/reflinks.py index 41f8fc9..009e808 100755 --- a/scripts/reflinks.py +++ b/scripts/reflinks.py @@ -454,14 +454,14 @@ raise
# Regex to grasp content-type meta HTML tag in HTML source - self.META_CONTENT = re.compile(r'(?i)<meta[^>]*content-type[^>]*>') + self.META_CONTENT = re.compile(br'(?i)<meta[^>]*content-type[^>]*>') # Extract the encoding from a charset property (from content-type !) - self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^'",;>/]*)') + self.CHARSET = re.compile(br'(?i)charset\s*=\s*(?P<enc>[^'",;>/]*)') # Extract html title from page self.TITLE = re.compile(r'(?is)(?<=<title>).*?(?=</title>)') # Matches content inside <script>/<style>/HTML comments self.NON_HTML = re.compile( - r'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<![CDATA[.*?]]>') + br'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<![CDATA[.*?]]>')
# Authorized mime types for HTML pages self.MIME = re.compile( @@ -555,7 +555,10 @@ f = urlopen(ref.url) # Try to get Content-Type from server headers = f.info() - contentType = headers.getheader('Content-Type') + if sys.version_info[0] > 2: + contentType = headers.get_content_type() + else: + contentType = headers.getheader('Content-Type') if contentType and not self.MIME.search(contentType): if ref.link.lower().endswith('.pdf') and \ not self.getOption('ignorepdf'): @@ -645,7 +648,7 @@ f.close()
# remove <script>/<style>/comments/CDATA tags - linkedpagetext = self.NON_HTML.sub('', linkedpagetext) + linkedpagetext = self.NON_HTML.sub(b'', linkedpagetext)
meta_content = self.META_CONTENT.search(linkedpagetext) enc = []
pywikibot-commits@lists.wikimedia.org