jenkins-bot has submitted this change and it was merged.
Change subject: harvest_template.py perf: dont parse and discard
......................................................................
harvest_template.py perf: dont parse and discard
Currently the wikipage is parsed even when all of the parsed values
will be discarded. This is slow and noisy.
When an item has claims for all properties to be processed,
skip the item before fetching and parsing the wikitext.
Improve existing claim merge TODO to properly describe the complexity.
Change-Id: Ib3df4cce7125591e8fa585d18988e2a605456a09
---
M scripts/harvest_template.py
1 file changed, 6 insertions(+), 2 deletions(-)
Approvals:
Xqt: Looks good to me, approved
jenkins-bot: Verified
diff --git a/scripts/harvest_template.py b/scripts/harvest_template.py
index d85084e..65ec3b3 100755
--- a/scripts/harvest_template.py
+++ b/scripts/harvest_template.py
@@ -94,6 +94,9 @@
if not item.exists():
pywikibot.output('%s doesn\'t have a wikidata item :(' % page)
#TODO FIXME: We should provide an option to create the page
+ item.get()
+ if set(self.fields.values()) <= set(item.claims.keys()):
+ pywikibot.output('%s item %s has claims for all properties. Skipping'
% (page, item.title()))
else:
pagetext = page.get()
templates = pywikibot.extract_templates_and_params(pagetext)
@@ -121,8 +124,9 @@
pywikibot.output(
u'A claim for %s already exists. Skipping'
% claim.getID())
- # TODO FIXME: This is a very crude way of dupe
- # checking
+ # TODO: Implement smarter approach to merging
+ # harvested values with existing claims esp.
+ # without overwriting humans unintentionally.
else:
if claim.getType() == 'wikibase-item':
# Try to extract a valid page
--
To view, visit
https://gerrit.wikimedia.org/r/135314
To unsubscribe, visit
https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib3df4cce7125591e8fa585d18988e2a605456a09
Gerrit-PatchSet: 3
Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Owner: John Vandenberg <jayvdb(a)gmail.com>
Gerrit-Reviewer: Ladsgroup <ladsgroup(a)gmail.com>
Gerrit-Reviewer: Merlijn van Deen <valhallasw(a)arctus.nl>
Gerrit-Reviewer: Xqt <info(a)gno.de>
Gerrit-Reviewer: jenkins-bot <>