jenkins-bot has submitted this change and it was merged.
Change subject: harvest_template.py perf: dont parse and discard ......................................................................
harvest_template.py perf: dont parse and discard
Currently the wikipage is parsed even when all of the parsed values will be discarded. This is slow and noisy.
When an item has claims for all properties to be processed, skip the item before fetching and parsing the wikitext.
Improve existing claim merge TODO to properly describe the complexity.
Change-Id: Ib3df4cce7125591e8fa585d18988e2a605456a09 --- M scripts/harvest_template.py 1 file changed, 6 insertions(+), 2 deletions(-)
Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified
diff --git a/scripts/harvest_template.py b/scripts/harvest_template.py index d85084e..65ec3b3 100755 --- a/scripts/harvest_template.py +++ b/scripts/harvest_template.py @@ -94,6 +94,9 @@ if not item.exists(): pywikibot.output('%s doesn't have a wikidata item :(' % page) #TODO FIXME: We should provide an option to create the page + item.get() + if set(self.fields.values()) <= set(item.claims.keys()): + pywikibot.output('%s item %s has claims for all properties. Skipping' % (page, item.title())) else: pagetext = page.get() templates = pywikibot.extract_templates_and_params(pagetext) @@ -121,8 +124,9 @@ pywikibot.output( u'A claim for %s already exists. Skipping' % claim.getID()) - # TODO FIXME: This is a very crude way of dupe - # checking + # TODO: Implement smarter approach to merging + # harvested values with existing claims esp. + # without overwriting humans unintentionally. else: if claim.getType() == 'wikibase-item': # Try to extract a valid page
pywikibot-commits@lists.wikimedia.org