jenkins-bot has submitted this change and it was merged.
Change subject: Decoding text: catch exception ......................................................................
Decoding text: catch exception
Wrapped the decode instruction derived from change Ia2051a2a80851b15b1a04a135763291bd633d4e3 in a "try: except:" block, as suggested in comment 9 of bug 67410
Also: added a comma in "self.CHARSET" regex in case of "contentType" contains a list of value from HTML-meta
Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4 --- M scripts/reflinks.py 1 file changed, 6 insertions(+), 2 deletions(-)
Approvals: John Vandenberg: Looks good to me, approved jenkins-bot: Verified
diff --git a/scripts/reflinks.py b/scripts/reflinks.py index 65dcb5b..a7f50f0 100644 --- a/scripts/reflinks.py +++ b/scripts/reflinks.py @@ -434,7 +434,7 @@ # Regex to grasp content-type meta HTML tag in HTML source self.META_CONTENT = re.compile(r'(?i)<meta[^>]*content-type[^>]*>') # Extract the encoding from a charset property (from content-type !) - self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^'";>/]*)') + self.CHARSET = re.compile(r'(?i)charset\s*=\s*(?P<enc>[^'",;>/]*)') # Extract html title from page self.TITLE = re.compile(r'(?is)(?<=<title>).*?(?=</title>)') # Matches content inside <script>/<style>/HTML comments @@ -683,7 +683,11 @@
if 'utf-8' not in enc: enc.append('utf-8') - u = linkedpagetext.decode(enc[0]) # Bug 67410 + try: + u = linkedpagetext.decode(enc[0]) # Bug 67410 + except (UnicodeDecodeError, LookupError) as e: + pywikibot.output(u'%s : Decoding error - %s' % (ref.link, e)) + continue
# Retrieves the first non empty string inside <title> tags for m in self.TITLE.finditer(u):