[Pywikibot-commits] [Gerrit] Decoding text: catch exception - change (pywikibot/core)

20 Aug 2014

jenkins-bot has submitted this change and it was merged.
Change subject: Decoding text: catch exception
......................................................................
Decoding text: catch exception
Wrapped the decode instruction derived from
change Ia2051a2a80851b15b1a04a135763291bd633d4e3
in a "try: except:" block, as suggested in comment 9 of bug 67410
Also: added a comma in "self.CHARSET" regex in case of
"contentType" contains a list of value from HTML-meta
Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4
---
M scripts/reflinks.py
1 file changed, 6 insertions(+), 2 deletions(-)
Approvals:
  John Vandenberg: Looks good to me, approved
  jenkins-bot: Verified

diff --git a/scripts/reflinks.py b/scripts/reflinks.py
index 65dcb5b..a7f50f0 100644
--- a/scripts/reflinks.py
+++ b/scripts/reflinks.py
@@ -434,7 +434,7 @@
         # Regex to grasp content-type meta HTML tag in HTML source
         self.META_CONTENT = re.compile(r'(?i)<meta[^>]*content-type[^>]*>')
         # Extract the encoding from a charset property (from content-type !)
-        self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^'";>/]*)')
+        self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^'",;>/]*)')
         # Extract html title from page
         self.TITLE = re.compile(r'(?is)(?<=<title>).*?(?=</title>)')
         # Matches content inside <script>/<style>/HTML comments
@@ -683,7 +683,11 @@
if 'utf-8' not in enc:
                     enc.append('utf-8')
-                u = linkedpagetext.decode(enc[0])   # Bug 67410
+                try:
+                    u = linkedpagetext.decode(enc[0])   # Bug 67410
+                except (UnicodeDecodeError, LookupError) as e:
+                    pywikibot.output(u'%s : Decoding error - %s' % (ref.link, e))
+                    continue
# Retrieves the first non empty string inside <title> tags
                 for m in self.TITLE.finditer(u):
-- 
To view, visit https://gerrit.wikimedia.org/r/155226
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4
Gerrit-PatchSet: 2
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: Beta16 l.rabinelli@gmail.com
Gerrit-Reviewer: John Vandenberg jayvdb@gmail.com
Gerrit-Reviewer: Ladsgroup ladsgroup@gmail.com
Gerrit-Reviewer: Merlijn van Deen valhallasw@arctus.nl
Gerrit-Reviewer: jenkins-bot <>


    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

[Pywikibot-commits] [Gerrit] Decoding text: catch exception - change (pywikibot/core)