jenkins-bot has submitted this change and it was merged. (
https://gerrit.wikimedia.org/r/517756 )
Change subject: proofreadpage_tests.py: fix failing OCR-tests
......................................................................
proofreadpage_tests.py: fix failing OCR-tests
Text returned by OCR agorithm seems not determistic (e.g. one
possible reason might be that OCR tools used by such services
are updated and have better performances).
Test for text similarity instead of perfect text equality.
This is acceptabe as purpose of the test is to test the call to
OCR tools API, rather than the OCR process itself.
Bug: T225595
Change-Id: I500ddc6e76e0b5d4032b0f7a79e1632a9b5b3b24
---
M tests/proofreadpage_tests.py
1 file changed, 13 insertions(+), 8 deletions(-)
Approvals:
Dalba: Looks good to me, approved
jenkins-bot: Verified
diff --git a/tests/proofreadpage_tests.py b/tests/proofreadpage_tests.py
index efa1518..282b6a7 100644
--- a/tests/proofreadpage_tests.py
+++ b/tests/proofreadpage_tests.py
@@ -7,6 +7,7 @@
#
from __future__ import absolute_import, division, unicode_literals
+import difflib
import json
import pywikibot
@@ -335,10 +336,10 @@
'year 1872,\nBY D. APPLETON & CO.,\nIn the Office
'
'of the Librarian of Congress, at '
'Washington.\n\n'),
- 'ocr': (False, 'lam-mam, according to Act of Congress, in the
'
- 'year 157-2,\nBY D. APPLEION Av CO.,\nIn the '
- 'Of\ufb01ce or the Librarian of '
- 'Congress, at Washington.\n\n'),
+ 'ocr': (False, 'EsTEnen, according to Act of Congress, in the
'
+ 'year 1872,\nBy D. APPLETON & CO.,\nIn the '
+ 'Office of the Librarian of Congress, at '
+ 'Washington.\n\u000c'),
'googleOCR': (False, 'ENTERED, according to Act of Congress, in
'
'the year 1572,\nBY D. APPLETON & CO.\n'
'In the Office of the Librarian of '
@@ -363,7 +364,8 @@
self.skipTest(text)
ref_error, ref_text = self.data['hocr']
self.assertEqual(error, ref_error)
- self.assertEqual(text, ref_text)
+ s = difflib.SequenceMatcher(None, text, ref_text)
+ self.assertGreater(s.ratio(), 0.9)
def test_do_ocr_phetools(self):
"""Test
page._do_ocr(ocr_tool='phetools')."""
@@ -372,7 +374,8 @@
if error:
self.skipTest(text)
self.assertEqual(error, ref_error)
- self.assertEqual(text, ref_text)
+ s = difflib.SequenceMatcher(None, text, ref_text)
+ self.assertGreater(s.ratio(), 0.9)
def test_do_ocr_googleocr(self):
"""Test
page._do_ocr(ocr_tool='googleOCR')."""
@@ -381,7 +384,8 @@
self.skipTest(text)
ref_error, ref_text = self.data['googleOCR']
self.assertEqual(error, ref_error)
- self.assertEqual(text, ref_text)
+ s = difflib.SequenceMatcher(None, text, ref_text)
+ self.assertGreater(s.ratio(), 0.9)
def test_ocr_googleocr(self):
"""Test page.ocr(ocr_tool='googleOCR')."""
@@ -391,7 +395,8 @@
self.assertIsInstance(exc, ValueError)
else:
ref_error, ref_text = self.data['googleOCR']
- self.assertEqual(text, ref_text)
+ s = difflib.SequenceMatcher(None, text, ref_text)
+ self.assertGreater(s.ratio(), 0.9)
@require_modules('bs4')
--
To view, visit
https://gerrit.wikimedia.org/r/517756
To unsubscribe, or for help writing mail filters, visit
https://gerrit.wikimedia.org/r/settings
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I500ddc6e76e0b5d4032b0f7a79e1632a9b5b3b24
Gerrit-Change-Number: 517756
Gerrit-PatchSet: 2
Gerrit-Owner: Mpaa <mpaa.wiki(a)gmail.com>
Gerrit-Reviewer: Dalba <dalba.wiki(a)gmail.com>
Gerrit-Reviewer: John Vandenberg <jayvdb(a)gmail.com>
Gerrit-Reviewer: jenkins-bot (75)
Gerrit-CC: Xqt <info(a)gno.de>