jenkins-bot submitted this change.

View Change

Approvals: Xqt: Looks good to me, approved jenkins-bot: Verified
[maintenance] Add a preload_sites.py script to preload site informations

Instantiating a BaseLink causes site and user informations to be loaded.
This is especially true for Page, Link, SiteLink objects and other
subclasses. The informations are cached for 30 days by default. Loading
a bulk of these site/user information for a wikibase object can increase
the wait time a lot to get all these be loaded during normal bot
operation and bot operators does not expect such a delay.

preload_sites.py preloads sites/user informations at once and decreases
normal bot operations a lot if these informations aren't cached already.
To force preloading even the cache date is not expired the global config
variable -API_config_expiry:0 can be set.

preload_sites.py decrease loading time by 65% due to parallel workers.

Bug: T226157
Change-Id: I91f6fe7cd3257c1ec8496c74892ff006b9f86449
---
M docs/scripts/scripts.maintenance.rst
M scripts/README.rst
A scripts/maintenance/preload_sites.py
3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/docs/scripts/scripts.maintenance.rst b/docs/scripts/scripts.maintenance.rst
index c7e5bf9..c6441df 100644
--- a/docs/scripts/scripts.maintenance.rst
+++ b/docs/scripts/scripts.maintenance.rst
@@ -32,6 +32,11 @@

.. automodule:: scripts.maintenance.make_i18n_dict

+scripts.maintenance.preload_sites script
+-----------------------------------------
+
+.. automodule:: scripts.maintenance.preload_sites
+
scripts.maintenance.sorting\_order script
-----------------------------------------

@@ -42,7 +47,6 @@

.. automodule:: scripts.maintenance.update_linktrails

-
scripts.maintenance.wikimedia\_sites script
-------------------------------------------

diff --git a/scripts/README.rst b/scripts/README.rst
index e2feed6..8a2ec98 100644
--- a/scripts/README.rst
+++ b/scripts/README.rst
@@ -285,6 +285,8 @@
+------------------------+---------------------------------------------------------+
| make_i18n_dict.py | Generate an i18n file from a given script. |
+------------------------+---------------------------------------------------------+
+ | preload_sites.py | Preload and cache site information for each WM family. |
+ +------------------------+---------------------------------------------------------+
| sorting_order.py | Updates interwiki sorting order in family.py file. |
+------------------------+---------------------------------------------------------+
| update_linktrails.py | Script that updates the linktrails in family.py file. |
diff --git a/scripts/maintenance/preload_sites.py b/scripts/maintenance/preload_sites.py
new file mode 100644
index 0000000..5eb4f2a
--- /dev/null
+++ b/scripts/maintenance/preload_sites.py
@@ -0,0 +1,83 @@
+#!/usr/bin/python
+"""Script that preloads site and user info for all sites of given family.
+
+The following parameters are supported:
+
+-worker:<num> The number of parallel tasks to be run. Default is the
+ number of precessors on the machine
+
+Usage:
+
+ python pwb.py preload_sites [{<family>}] [-worker{<num>}]
+
+To force preloading, change the global expiry value to 0:
+
+ python pwb.py -API_config_expiry:0 preload_sites [{<family>}]
+
+"""
+#
+# (C) Pywikibot team, 2021
+#
+# Distributed under the terms of the MIT license.
+#
+from concurrent.futures import ThreadPoolExecutor, wait
+from datetime import datetime
+
+import pywikibot
+
+from pywikibot.family import Family
+
+# supported families by this script
+families_list = [
+ 'wikibooks',
+ 'wikinews',
+ 'wikipedia',
+ 'wikiquote',
+ 'wikisource',
+ 'wikiversity',
+ 'wikivoyage',
+ 'wiktionary',
+]
+
+exceptions = {
+}
+
+
+def preload_family(family):
+ """Preload all sites of a single family file."""
+ msg = 'Preloading sites of {} family{}'
+ pywikibot.output(msg.format(family, '...'))
+
+ codes = Family.load(family).languages_by_size
+ for code in exceptions.get(family, []):
+ if code in codes:
+ codes.remove(code)
+ obsolete = Family.load(family).obsolete
+
+ for code in codes:
+ if code not in obsolete:
+ site = pywikibot.Site(code, family)
+ pywikibot.Page(site, 'Main page') # title does not care
+
+ pywikibot.output(msg.format(family, ' completed.'))
+
+
+def preload_families(families, worker):
+ """Preload all sites of all given family files."""
+ start = datetime.now()
+ with ThreadPoolExecutor(worker) as executor:
+ futures = {executor.submit(preload_family, family):
+ family for family in families}
+ wait(futures)
+ pywikibot.output('Loading time used: {}'.format(datetime.now() - start))
+
+
+if __name__ == '__main__':
+ fam = set()
+ worker = None
+ for arg in pywikibot.handle_args():
+ if arg in families_list:
+ fam.add(arg)
+ elif arg.startswith('-worker'):
+ worker = int(arg.partition(':')[2])
+ preload_families(fam or families_list, worker)

To view, visit change 656997. To unsubscribe, or for help writing mail filters, visit settings.

Gerrit-Project: pywikibot/core
Gerrit-Branch: master
Gerrit-Change-Id: I91f6fe7cd3257c1ec8496c74892ff006b9f86449
Gerrit-Change-Number: 656997
Gerrit-PatchSet: 4
Gerrit-Owner: Xqt <info@gno.de>
Gerrit-Reviewer: Bugreporter <bugreporter1@sina.com>
Gerrit-Reviewer: D3r1ck01 <xsavitar.wiki@aol.com>
Gerrit-Reviewer: JJMC89 <JJMC89.Wikimedia@gmail.com>
Gerrit-Reviewer: Matěj Suchánek <matejsuchanek97@gmail.com>
Gerrit-Reviewer: Xqt <info@gno.de>
Gerrit-Reviewer: jenkins-bot
Gerrit-MessageType: merged