jenkins-bot has submitted this change and it was merged.
Change subject: Decoding text: catch exception
......................................................................
Decoding text: catch exception
Wrapped the decode instruction derived from
change Ia2051a2a80851b15b1a04a135763291bd633d4e3
in a "try: except:" block, as suggested in comment 9 of bug 67410
Also: added a comma in "self.CHARSET" regex in case of
"contentType" contains a list of value from HTML-meta
Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4
---
M scripts/reflinks.py
1 file changed, 6 insertions(+), 2 deletions(-)
Approvals:
John Vandenberg: Looks good to me, approved
jenkins-bot: Verified
diff --git a/scripts/reflinks.py b/scripts/reflinks.py
index 65dcb5b..a7f50f0 100644
--- a/scripts/reflinks.py
+++ b/scripts/reflinks.py
@@ -434,7 +434,7 @@
# Regex to grasp content-type meta HTML tag in HTML source
self.META_CONTENT =
re.compile(r'(?i)<meta[^>]*content\-type[^>]*>')
# Extract the encoding from a charset property (from content-type !)
- self.CHARSET =
re.compile(r'(?i)charset\s*=\s*(?P<enc>[^\'";>/]*)')
+ self.CHARSET =
re.compile(r'(?i)charset\s*=\s*(?P<enc>[^\'",;>/]*)')
# Extract html title from page
self.TITLE =
re.compile(r'(?is)(?<=<title>).*?(?=</title>)')
# Matches content inside <script>/<style>/HTML comments
@@ -683,7 +683,11 @@
if 'utf-8' not in enc:
enc.append('utf-8')
- u = linkedpagetext.decode(enc[0]) # Bug 67410
+ try:
+ u = linkedpagetext.decode(enc[0]) # Bug 67410
+ except (UnicodeDecodeError, LookupError) as e:
+ pywikibot.output(u'%s : Decoding error - %s' % (ref.link,
e))
+ continue
# Retrieves the first non empty string inside <title> tags
for m in self.TITLE.finditer(u):
--
To view, visit
https://gerrit.wikimedia.org/r/155226
To unsubscribe, visit
https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I3af86d3386ea919001287fe1c057932c16537eb4
Gerrit-PatchSet: 2
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: Beta16 <l.rabinelli(a)gmail.com>
Gerrit-Reviewer: John Vandenberg <jayvdb(a)gmail.com>
Gerrit-Reviewer: Ladsgroup <ladsgroup(a)gmail.com>
Gerrit-Reviewer: Merlijn van Deen <valhallasw(a)arctus.nl>
Gerrit-Reviewer: jenkins-bot <>