[Gerrit] [FIX] reflinks: Expect that urlopen return bytes - change (pywikibot/core) - Pywikibot-commits

20 Apr 2015

jenkins-bot has submitted this change and it was merged.
Change subject: [FIX] reflinks: Expect that urlopen return bytes
......................................................................
[FIX] reflinks: Expect that urlopen return bytes
With Python 3 an error occured on using the regex because it only allows
that either all three are bytes or are str but the result from urlopen
is bytes.
On Python 2 all strs are actually bytes by default so it worked fine
there. But with unicode_literals in 1e54a7d6 the replacement and pattern
changed into unicode. While the pattern works in Python 2 regardless the
replacement will force that the searched content is converted into
unicode. As that happens implicitly it'll use ASCII and cause an error
when it contains non-ASCII characters.
The getheader method has been removed Python 3 so it is using the
replacement in Python 3.
Bug: T94688
Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f
---
M scripts/reflinks.py
1 file changed, 8 insertions(+), 5 deletions(-)
Approvals:
  John Vandenberg: Looks good to me, approved
  jenkins-bot: Verified

diff --git a/scripts/reflinks.py b/scripts/reflinks.py
index 41f8fc9..009e808 100755
--- a/scripts/reflinks.py
+++ b/scripts/reflinks.py
@@ -454,14 +454,14 @@
             raise
# Regex to grasp content-type meta HTML tag in HTML source
-        self.META_CONTENT = re.compile(r'(?i)<meta[^>]*content-type[^>]*>')
+        self.META_CONTENT = re.compile(br'(?i)<meta[^>]*content-type[^>]*>')
         # Extract the encoding from a charset property (from content-type !)
-        self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^'",;>/]*)')
+        self.CHARSET = re.compile(br'(?i)charset\s*=\s*(?P<enc>[^'",;>/]*)')
         # Extract html title from page
         self.TITLE = re.compile(r'(?is)(?<=<title>).*?(?=</title>)')
         # Matches content inside <script>/<style>/HTML comments
         self.NON_HTML = re.compile(
-            r'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<![CDATA[.*?]]>')
+            br'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<![CDATA[.*?]]>')
# Authorized mime types for HTML pages
         self.MIME = re.compile(
@@ -555,7 +555,10 @@
                         f = urlopen(ref.url)
                     # Try to get Content-Type from server
                     headers = f.info()
-                    contentType = headers.getheader('Content-Type')
+                    if sys.version_info[0] > 2:
+                        contentType = headers.get_content_type()
+                    else:
+                        contentType = headers.getheader('Content-Type')
                     if contentType and not self.MIME.search(contentType):
                         if ref.link.lower().endswith('.pdf') and \
                            not self.getOption('ignorepdf'):
@@ -645,7 +648,7 @@
                         f.close()
# remove <script>/<style>/comments/CDATA tags
-                linkedpagetext = self.NON_HTML.sub('', linkedpagetext)
+                linkedpagetext = self.NON_HTML.sub(b'', linkedpagetext)
meta_content = self.META_CONTENT.search(linkedpagetext)
                 enc = []
-- 
To view, visit https://gerrit.wikimedia.org/r/205095
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f
Gerrit-PatchSet: 2
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: XZise CommodoreFabianus@gmx.de
Gerrit-Reviewer: John Vandenberg jayvdb@gmail.com
Gerrit-Reviewer: Ladsgroup ladsgroup@gmail.com
Gerrit-Reviewer: Merlijn van Deen valhallasw@arctus.nl
Gerrit-Reviewer: Rubin rubin@wikimedia.ru
Gerrit-Reviewer: jenkins-bot <>