[Gerrit] [FIX] reflinks: Expect that urlopen return bytes - change (pywikibot/core) - Pywikibot-commits

20 Apr 2015

jenkins-bot has submitted this change and it was merged.

Change subject: [FIX] reflinks: Expect that urlopen return bytes
......................................................................


[FIX] reflinks: Expect that urlopen return bytes

With Python 3 an error occured on using the regex because it only allows
that either all three are bytes or are str but the result from urlopen
is bytes.

On Python 2 all strs are actually bytes by default so it worked fine
there. But with unicode_literals in 1e54a7d6 the replacement and pattern
changed into unicode. While the pattern works in Python 2 regardless the
replacement will force that the searched content is converted into
unicode. As that happens implicitly it'll use ASCII and cause an error
when it contains non-ASCII characters.

The getheader method has been removed Python 3 so it is using the
replacement in Python 3.

Bug: T94688
Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f
---
M scripts/reflinks.py
1 file changed, 8 insertions(+), 5 deletions(-)

Approvals:
  John Vandenberg: Looks good to me, approved
  jenkins-bot: Verified

diff --git a/scripts/reflinks.py b/scripts/reflinks.py
index 41f8fc9..009e808 100755
--- a/scripts/reflinks.py
+++ b/scripts/reflinks.py
@@ -454,14 +454,14 @@
             raise
 
         # Regex to grasp content-type meta HTML tag in HTML source
-        self.META_CONTENT =
re.compile(r'(?i)<meta[^>]*content\-type[^>]*>')
+        self.META_CONTENT =
re.compile(br'(?i)<meta[^>]*content\-type[^>]*>')
         # Extract the encoding from a charset property (from content-type !)
-        self.CHARSET =
re.compile(r'(?i)charset\s*=\s*(?P<enc>[^\'",;>/]*)')
+        self.CHARSET =
re.compile(br'(?i)charset\s*=\s*(?P<enc>[^\'",;>/]*)')
         # Extract html title from page
         self.TITLE =
re.compile(r'(?is)(?<=<title>).*?(?=</title>)')
         # Matches content inside <script>/<style>/HTML comments
         self.NON_HTML = re.compile(
-           
r'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<!\[CDATA\[.*?\]\]>')
+           
br'(?is)<script[^>]*>.*?</script>|<style[^>]*>.*?</style>|<!--.*?-->|<!\[CDATA\[.*?\]\]>')
 
         # Authorized mime types for HTML pages
         self.MIME = re.compile(
@@ -555,7 +555,10 @@
                         f = urlopen(ref.url)
                     # Try to get Content-Type from server
                     headers = f.info()
-                    contentType = headers.getheader('Content-Type')
+                    if sys.version_info[0] > 2:
+                        contentType = headers.get_content_type()
+                    else:
+                        contentType = headers.getheader('Content-Type')
                     if contentType and not self.MIME.search(contentType):
                         if ref.link.lower().endswith('.pdf') and \
                            not self.getOption('ignorepdf'):
@@ -645,7 +648,7 @@
                         f.close()
 
                 # remove <script>/<style>/comments/CDATA tags
-                linkedpagetext = self.NON_HTML.sub('', linkedpagetext)
+                linkedpagetext = self.NON_HTML.sub(b'', linkedpagetext)
 
                 meta_content = self.META_CONTENT.search(linkedpagetext)
                 enc = []

-- 
To view, visit https://gerrit.wikimedia.org/r/205095
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Icf7b5a02d12cea3ec969fc6a3420ea870ea37f0f
Gerrit-PatchSet: 2
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: XZise &lt;CommodoreFabianus(a)gmx.de&gt;
Gerrit-Reviewer: John Vandenberg &lt;jayvdb(a)gmail.com&gt;
Gerrit-Reviewer: Ladsgroup &lt;ladsgroup(a)gmail.com&gt;
Gerrit-Reviewer: Merlijn van Deen &lt;valhallasw(a)arctus.nl&gt;
Gerrit-Reviewer: Rubin &lt;rubin(a)wikimedia.ru&gt;
Gerrit-Reviewer: jenkins-bot <>