http://www.mediawiki.org/wiki/Special:Code/pywikipedia/10528
Revision: 10528
Author: xqt
Date: 2012-09-16 13:48:36 +0000 (Sun, 16 Sep 2012)
Log Message:
-----------
old python 2.3 scripts
Added Paths:
-----------
archive/old python 2.3 scripts/
archive/old python 2.3 scripts/interwiki.py
archive/old python 2.3 scripts/wikipedia.py
Copied: archive/old python 2.3 scripts/interwiki.py (from rev 10463,
trunk/pywikipedia/interwiki.py)
===================================================================
--- archive/old python 2.3 scripts/interwiki.py (rev 0)
+++ archive/old python 2.3 scripts/interwiki.py 2012-09-16 13:48:36 UTC (rev 10528)
@@ -0,0 +1,2585 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+"""
+Script to check language links for general pages. This works by downloading the
+page, and using existing translations plus hints from the command line to
+download the equivalent pages from other languages. All of such pages are
+downloaded as well and checked for interwiki links recursively until there are
+no more links that are encountered. A rationalization process then selects the
+right interwiki links, and if this is unambiguous, the interwiki links in the
+original page will be automatically updated and the modified page uploaded.
+
+These command-line arguments can be used to specify which pages to work on:
+
+&pagegenerators_help;
+
+ -days: Like -years, but runs through all date pages. Stops at
+ Dec 31. If the argument is given in the form -days:X,
+ it will start at month no. X through Dec 31. If the
+ argument is simply given as -days, it will run from
+ Jan 1 through Dec 31. E.g. for -days:9 it will run
+ from Sep 1 through Dec 31.
+
+ -years: run on all year pages in numerical order. Stop at year 2050.
+ If the argument is given in the form -years:XYZ, it
+ will run from [[XYZ]] through [[2050]]. If XYZ is a
+ negative value, it is interpreted as a year BC. If the
+ argument is simply given as -years, it will run from 1
+ through 2050.
+
+ This implies -noredirect.
+
+ -new: Work on the 100 newest pages. If given as -new:x, will work
+ on the x newest pages.
+ When multiple -namespace parameters are given, x pages are
+ inspected, and only the ones in the selected name spaces are
+ processed. Use -namespace:all for all namespaces. Without
+ -namespace, only article pages are processed.
+
+ This implies -noredirect.
+
+ -restore: restore a set of "dumped" pages the robot was working on
+ when it terminated. The dump file will be subsequently
+ removed.
+
+ -restore:all restore a set of "dumped" pages of all dumpfiles to a given
+ family remaining in the "interwiki-dumps" directory. All
+ these dump files will be subsequently removed. If restoring
+ process interrupts again, it saves all unprocessed pages in
+ one new dump file of the given site.
+
+ -continue: like restore, but after having gone through the dumped pages,
+ continue alphabetically starting at the last of the dumped
+ pages. The dump file will be subsequently removed.
+
+ -warnfile: used as -warnfile:filename, reads all warnings from the
+ given file that apply to the home wiki language,
+ and read the rest of the warning as a hint. Then
+ treats all the mentioned pages. A quicker way to
+ implement warnfile suggestions without verifying them
+ against the live wiki is using the warnfile.py
+ script.
+
+Additionaly, these arguments can be used to restrict the bot to certain pages:
+
+ -namespace:n Number or name of namespace to process. The parameter can be
+ used multiple times. It works in combination with all other
+ parameters, except for the -start parameter. If you e.g.
+ want to iterate over all categories starting at M, use
+ -start:Category:M.
+
+ -number: used as -number:#, specifies that the robot should process
+ that amount of pages and then stop. This is only useful in
+ combination with -start. The default is not to stop.
+
+ -until: used as -until:title, specifies that the robot should
+ process pages in wiki default sort order up to, and
+ including, "title" and then stop. This is only useful in
+ combination with -start. The default is not to stop.
+ Note: do not specify a namespace, even if -start has one.
+
+ -bracket only work on pages that have (in the home language)
+ parenthesis in their title. All other pages are skipped.
+ (note: without ending colon)
+
+ -skipfile: used as -skipfile:filename, skip all links mentioned in
+ the given file. This does not work with -number!
+
+ -skipauto use to skip all pages that can be translated automatically,
+ like dates, centuries, months, etc.
+ (note: without ending colon)
+
+ -lack: used as -lack:xx with xx a language code: only work on pages
+ without links to language xx. You can also add a number nn
+ like -lack:xx:nn, so that the bot only works on pages with
+ at least nn interwiki links (the default value for nn is 1).
+
+These arguments control miscellanous bot behaviour:
+
+ -quiet Use this option to get less output
+ (note: without ending colon)
+
+ -async Put page on queue to be saved to wiki asynchronously. This
+ enables loading pages during saving throtteling and gives a
+ better performance.
+ NOTE: For post-processing it always assumes that saving the
+ the pages was sucessful.
+ (note: without ending colon)
+
+ -summary: Set an additional action summary message for the edit. This
+ could be used for further explainings of the bot action.
+ This will only be used in non-autonomous mode.
+
+ -hintsonly The bot does not ask for a page to work on, even if none of
+ the above page sources was specified. This will make the
+ first existing page of -hint or -hinfile slip in as the start
+ page, determining properties like namespace, disambiguation
+ state, and so on. When no existing page is found in the
+ hints, the bot does nothing.
+ Hitting return without input on the "Which page to check:"
+ prompt has the same effect as using -hintsonly.
+ Options like -back, -same or -wiktionary are in effect only
+ after a page has been found to work on.
+ (note: without ending colon)
+
+These arguments are useful to provide hints to the bot:
+
+ -hint: used as -hint:de:Anweisung to give the robot a hint
+ where to start looking for translations. If no text
+ is given after the second ':', the name of the page
+ itself is used as the title for the hint, unless the
+ -hintnobracket command line option (see there) is also
+ selected.
+
+ There are some special hints, trying a number of languages
+ at once:
+ * all: All languages with at least ca. 100 articles.
+ * 10: The 10 largest languages (sites with most
+ articles). Analogous for any other natural
+ number.
+ * arab: All languages using the Arabic alphabet.
+ * cyril: All languages that use the Cyrillic alphabet.
+ * chinese: All Chinese dialects.
+ * latin: All languages using the Latin script.
+ * scand: All Scandinavian languages.
+
+ Names of families that forward their interlanguage links
+ to the wiki family being worked upon can be used (with
+ -family=wikipedia only), they are:
+ * commons: Interlanguage links of Mediawiki Commons.
+ * incubator: Links in pages on the Mediawiki Incubator.
+ * meta: Interlanguage links of named pages on Meta.
+ * species: Interlanguage links of the wikispecies wiki.
+ * strategy: Links in pages on Wikimedias strategy wiki.
+ * test: Take interwiki links from Test Wikipedia
+
+ Languages, groups and families having the same page title
+ can be combined, as -hint:5,scand,sr,pt,commons:New_York
+
+ -hintfile: similar to -hint, except that hints are taken from the given
+ file, enclosed in [[]] each, instead of the command line.
+
+ -askhints: for each page one or more hints are asked. See hint: above
+ for the format, one can for example give "en:something" or
+ "20:" as hint.
+
+ -same looks over all 'serious' languages for the same title.
+ -same is equivalent to -hint:all:
+ (note: without ending colon)
+
+ -wiktionary: similar to -same, but will ONLY accept names that are
+ identical to the original. Also, if the title is not
+ capitalized, it will only go through other wikis without
+ automatic capitalization.
+
+ -untranslated: works normally on pages with at least one interlanguage
+ link; asks for hints for pages that have none.
+
+ -untranslatedonly: same as -untranslated, but pages which already have a
+ translation are skipped. Hint: do NOT use this in
+ combination with -start without a -number limit, because
+ you will go through the whole alphabet before any queries
+ are performed!
+
+ -showpage when asking for hints, show the first bit of the text
+ of the page always, rather than doing so only when being
+ asked for (by typing '?'). Only useful in combination
+ with a hint-asking option like -untranslated, -askhints
+ or -untranslatedonly.
+ (note: without ending colon)
+
+ -noauto Do not use the automatic translation feature for years and
+ dates, only use found links and hints.
+ (note: without ending colon)
+
+ -hintnobracket used to make the robot strip everything in brackets,
+ and surrounding spaces from the page name, before it is
+ used in a -hint:xy: where the page name has been left out,
+ or -hint:all:, -hint:10:, etc. without a name, or
+ an -askhint reply, where only a language is given.
+
+These arguments define how much user confirmation is required:
+
+ -autonomous run automatically, do not ask any questions. If a question
+ -auto to an operator is needed, write the name of the page
+ to autonomous_problems.dat and continue on the next page.
+ (note: without ending colon)
+
+ -confirm ask for confirmation before any page is changed on the
+ live wiki. Without this argument, additions and
+ unambiguous modifications are made without confirmation.
+ (note: without ending colon)
+
+ -force do not ask permission to make "controversial" changes,
+ like removing a language because none of the found
+ alternatives actually exists.
+ (note: without ending colon)
+
+ -cleanup like -force but only removes interwiki links to non-existent
+ or empty pages.
+
+ -select ask for each link whether it should be included before
+ changing any page. This is useful if you want to remove
+ invalid interwiki links and if you do multiple hints of
+ which some might be correct and others incorrect. Combining
+ -select and -confirm is possible, but seems like overkill.
+ (note: without ending colon)
+
+These arguments specify in which way the bot should follow interwiki links:
+
+ -noredirect do not follow redirects nor category redirects.
+ (note: without ending colon)
+
+ -initialredirect work on its target if a redirect or category redirect is
+ entered on the command line or by a generator (note: without
+ ending colon). It is recommended to use this option with the
+ -movelog pagegenerator.
+
+ -neverlink: used as -neverlink:xx where xx is a language code:
+ Disregard any links found to language xx. You can also
+ specify a list of languages to disregard, separated by
+ commas.
+
+ -ignore: used as -ignore:xx:aaa where xx is a language code, and
+ aaa is a page title to be ignored.
+
+ -ignorefile: similar to -ignore, except that the pages are taken from
+ the given file instead of the command line.
+
+ -localright do not follow interwiki links from other pages than the
+ starting page. (Warning! Should be used very sparingly,
+ only when you are sure you have first gotten the interwiki
+ links on the starting page exactly right).
+ (note: without ending colon)
+
+ -hintsareright do not follow interwiki links to sites for which hints
+ on existing pages are given. Note that, hints given
+ interactively, via the -askhint command line option,
+ are only effective once they have been entered, thus
+ interwiki links on the starting page are followed
+ regardess of hints given when prompted.
+ (Warning! Should be used with caution!)
+ (note: without ending colon)
+
+ -back only work on pages that have no backlink from any other
+ language; if a backlink is found, all work on the page
+ will be halted. (note: without ending colon)
+
+The following arguments are only important for users who have accounts for
+multiple languages, and specify on which sites the bot should modify pages:
+
+ -localonly only work on the local wiki, not on other wikis in the
+ family I have a login at. (note: without ending colon)
+
+ -limittwo only update two pages - one in the local wiki (if logged-in)
+ and one in the top available one.
+ For example, if the local page has links to de and fr,
+ this option will make sure that only the local site and
+ the de: (larger) sites are updated. This option is useful
+ to quickly set two way links without updating all of the
+ wiki families sites.
+ (note: without ending colon)
+
+ -whenneeded works like limittwo, but other languages are changed in the
+ following cases:
+ * If there are no interwiki links at all on the page
+ * If an interwiki link must be removed
+ * If an interwiki link must be changed and there has been
+ a conflict for this page
+ Optionally, -whenneeded can be given an additional number
+ (for example -whenneeded:3), in which case other languages
+ will be changed if there are that number or more links to
+ change or add. (note: without ending colon)
+
+The following arguments influence how many pages the bot works on at once:
+
+ -array: The number of pages the bot tries to be working on at once.
+ If the number of pages loaded is lower than this number,
+ a new set of pages is loaded from the starting wiki. The
+ default is 100, but can be changed in the config variable
+ interwiki_min_subjects
+
+ -query: The maximum number of pages that the bot will load at once.
+ Default value is 60.
+
+Some configuration option can be used to change the working of this robot:
+
+interwiki_min_subjects: the minimum amount of subjects that should be processed
+ at the same time.
+
+interwiki_backlink: if set to True, all problems in foreign wikis will
+ be reported
+
+interwiki_shownew: should interwiki.py display every new link it discovers?
+
+interwiki_graph: output a graph PNG file on conflicts? You need pydot for
+ this:
http://dkbza.org/pydot.html
+
+interwiki_graph_format: the file format for interwiki graphs
+
+without_interwiki: save file with local articles without interwikis
+
+All these options can be changed through the user-config.py configuration file.
+
+If interwiki.py is terminated before it is finished, it will write a dump file
+to the interwiki-dumps subdirectory. The program will read it if invoked with
+the "-restore" or "-continue" option, and finish all the subjects in
that list.
+After finishing the dump file will be deleted. To run the interwiki-bot on all
+pages on a language, run it with option "-start:!", and if it takes so long
+that you have to break it off, use "-continue" next time.
+
+"""
+#
+# (C) Rob W.W. Hooft, 2003
+# (C) Daniel Herding, 2004
+# (C) Yuri Astrakhan, 2005-2006
+# (C) xqt, 2009-2012
+# (C) Pywikipedia bot team, 2007-2012
+#
+# Distributed under the terms of the MIT license.
+#
+__version__ = '$Id$'
+#
+
+import sys, copy, re, os
+import time
+import codecs
+import socket
+
+try:
+ set # introduced in Python 2.4: faster and future
+except NameError:
+ from sets import Set as set
+
+try: sorted ## Introduced in 2.4
+except NameError:
+ def sorted(seq, cmp=None, key=None, reverse=False):
+ """Copy seq and sort and return it.
+ >>> sorted([3, 1, 2])
+ [1, 2, 3]
+ """
+ seq2 = copy.copy(seq)
+ if key:
+ if cmp is None:
+ cmp = __builtins__.cmp
+ seq2.sort(lambda x,y: cmp(key(x), key(y)))
+ else:
+ if cmp is None:
+ seq2.sort()
+ else:
+ seq2.sort(cmp)
+ if reverse:
+ seq2.reverse()
+ return seq2
+
+import wikipedia as pywikibot
+import config
+import catlib
+import pagegenerators
+from pywikibot import i18n
+import titletranslate, interwiki_graph
+import webbrowser
+
+docuReplacements = {
+ '&pagegenerators_help;': pagegenerators.parameterHelp
+}
+
+class SaveError(pywikibot.Error):
+ """
+ An attempt to save a page with changed interwiki has failed.
+ """
+
+class LinkMustBeRemoved(SaveError):
+ """
+ An interwiki link has to be removed, but this can't be done because of user
+ preferences or because the user chose not to change the page.
+ """
+
+class GiveUpOnPage(pywikibot.Error):
+ """
+ The user chose not to work on this page and its linked pages any more.
+ """
+
+# Subpage templates. Must be in lower case,
+# whereas subpage itself must be case sensitive
+moved_links = {
+ 'bn' : (u'documentation', u'/doc'),
+ 'ca' : (u'ús de la plantilla', u'/ús'),
+ 'cs' : (u'dokumentace', u'/doc'),
+ 'da' : (u'dokumentation', u'/doc'),
+ 'de' : (u'dokumentation', u'/Meta'),
+ 'dsb': ([u'dokumentacija', u'doc'],
u'/Dokumentacija'),
+ 'en' : ([u'documentation',
+ u'template documentation',
+ u'template doc',
+ u'doc',
+ u'documentation, template'], u'/doc'),
+ 'es' : ([u'documentación', u'documentación de plantilla'],
u'/doc'),
+ 'eu' : (u'txantiloi dokumentazioa', u'/dok'),
+ 'fa' : ([u'documentation',
+ u'template documentation',
+ u'template doc',
+ u'doc',
+ u'توضیحات',
+ u'زیرصفحه توضیحات'], u'/doc'),
+ # fi: no idea how to handle this type of subpage at :Metasivu:
+ 'fi' : (u'mallineohje', None),
+ 'fr' : ([u'/documentation', u'documentation',
u'doc_modèle',
+ u'documentation modèle', u'documentation modèle compliqué',
+ u'documentation modèle en sous-page',
+ u'documentation modèle compliqué en sous-page',
+ u'documentation modèle utilisant les parserfunctions en sous-page',
+ ],
+ u'/Documentation'),
+ 'hsb': ([u'dokumentacija', u'doc'],
u'/Dokumentacija'),
+ 'hu' : (u'sablondokumentáció', u'/doc'),
+ 'id' : (u'template doc', u'/doc'),
+ 'ja' : (u'documentation', u'/doc'),
+ 'ka' : (u'თარგის ინფო', u'/ინფო'),
+ 'ko' : (u'documentation', u'/설명문서'),
+ 'ms' : (u'documentation', u'/doc'),
+ 'no' : (u'dokumentasjon', u'/dok'),
+ 'nn' : (u'dokumentasjon', u'/dok'),
+ 'pl' : (u'dokumentacja', u'/opis'),
+ 'pt' : ([u'documentação', u'/doc'], u'/doc'),
+ 'ro' : (u'documentaţie', u'/doc'),
+ 'ru' : (u'doc', u'/doc'),
+ 'sv' : (u'dokumentation', u'/dok'),
+ 'uk' : ([u'документація',
+ u'doc',
+ u'documentation'], u'/Документація'),
+ 'vi' : (u'documentation', u'/doc'),
+ 'zh' : ([u'documentation', u'doc'], u'/doc'),
+}
+
+# A list of template names in different languages.
+# Pages which contain these shouldn't be changed.
+ignoreTemplates = {
+ '_default': [u'delete'],
+ 'ar' : [u'قيد الاستخدام'],
+ 'cs' : [u'Pracuje_se'],
+ 'de' : [u'inuse', 'in use', u'in bearbeitung',
u'inbearbeitung',
+ u'löschen', u'sla',
+ u'löschantrag', u'löschantragstext',
+ u'falschschreibung',
+ u'obsolete schreibung', 'veraltete schreibweise'],
+ 'en' : [u'inuse', u'softredirect'],
+ 'fa' : [u'در دست ویرایش ۲', u'حذف سریع'],
+ 'pdc': [u'lösche'],
+}
+
+class Global(object):
+ """
+ Container class for global settings.
+ Use of globals outside of this is to be avoided.
+ """
+ autonomous = False
+ confirm = False
+ always = False
+ select = False
+ followredirect = True
+ initialredirect = False
+ force = False
+ cleanup = False
+ remove = []
+ maxquerysize = 60
+ same = False
+ skip = set()
+ skipauto = False
+ untranslated = False
+ untranslatedonly = False
+ auto = True
+ neverlink = []
+ showtextlink = 0
+ showtextlinkadd = 300
+ localonly = False
+ limittwo = False
+ strictlimittwo = False
+ needlimit = 0
+ ignore = []
+ parenthesesonly = False
+ rememberno = False
+ followinterwiki = True
+ minsubjects = config.interwiki_min_subjects
+ nobackonly = False
+ askhints = False
+ hintnobracket = False
+ hints = []
+ hintsareright = False
+ contentsondisk = config.interwiki_contents_on_disk
+ lacklanguage = None
+ minlinks = 0
+ quiet = False
+ restoreAll = False
+ async = False
+ summary = u''
+
+ def readOptions(self, arg):
+ """ Read all commandline parameters for the global container
"""
+ if arg == '-noauto':
+ self.auto = False
+ elif arg.startswith('-hint:'):
+ self.hints.append(arg[6:])
+ elif arg.startswith('-hintfile'):
+ hintfilename = arg[10:]
+ if (hintfilename is None) or (hintfilename == ''):
+ hintfilename = pywikibot.input(u'Please enter the hint
filename:')
+ f = codecs.open(hintfilename, 'r', config.textfile_encoding)
+ R = re.compile(ur'\[\[(.+?)(?:\]\]|\|)') # hint or title ends either
before | or before ]]
+ for pageTitle in R.findall(f.read()):
+ self.hints.append(pageTitle)
+ f.close()
+ elif arg == '-force':
+ self.force = True
+ elif arg == '-cleanup':
+ self.cleanup = True
+ elif arg == '-same':
+ self.same = True
+ elif arg == '-wiktionary':
+ self.same = 'wiktionary'
+ elif arg == '-untranslated':
+ self.untranslated = True
+ elif arg == '-untranslatedonly':
+ self.untranslated = True
+ self.untranslatedonly = True
+ elif arg == '-askhints':
+ self.untranslated = True
+ self.untranslatedonly = False
+ self.askhints = True
+ elif arg == '-hintnobracket':
+ self.hintnobracket = True
+ elif arg == '-confirm':
+ self.confirm = True
+ elif arg == '-select':
+ self.select = True
+ elif arg == '-autonomous' or arg == '-auto':
+ self.autonomous = True
+ elif arg == '-noredirect':
+ self.followredirect = False
+ elif arg == '-initialredirect':
+ self.initialredirect = True
+ elif arg == '-localonly':
+ self.localonly = True
+ elif arg == '-limittwo':
+ self.limittwo = True
+ self.strictlimittwo = True
+ elif arg.startswith('-whenneeded'):
+ self.limittwo = True
+ self.strictlimittwo = False
+ try:
+ self.needlimit = int(arg[12:])
+ except KeyError:
+ pass
+ except ValueError:
+ pass
+ elif arg.startswith('-skipfile:'):
+ skipfile = arg[10:]
+ skipPageGen = pagegenerators.TextfilePageGenerator(skipfile)
+ for page in skipPageGen:
+ self.skip.add(page)
+ del skipPageGen
+ elif arg == '-skipauto':
+ self.skipauto = True
+ elif arg.startswith('-neverlink:'):
+ self.neverlink += arg[11:].split(",")
+ elif arg.startswith('-ignore:'):
+ self.ignore += [pywikibot.Page(None,p) for p in
arg[8:].split(",")]
+ elif arg.startswith('-ignorefile:'):
+ ignorefile = arg[12:]
+ ignorePageGen = pagegenerators.TextfilePageGenerator(ignorefile)
+ for page in ignorePageGen:
+ self.ignore.append(page)
+ del ignorePageGen
+ elif arg == '-showpage':
+ self.showtextlink += self.showtextlinkadd
+ elif arg == '-graph':
+ # override configuration
+ config.interwiki_graph = True
+ elif arg == '-bracket':
+ self.parenthesesonly = True
+ elif arg == '-localright':
+ self.followinterwiki = False
+ elif arg == '-hintsareright':
+ self.hintsareright = True
+ elif arg.startswith('-array:'):
+ self.minsubjects = int(arg[7:])
+ elif arg.startswith('-query:'):
+ self.maxquerysize = int(arg[7:])
+ elif arg == '-back':
+ self.nobackonly = True
+ elif arg == '-quiet':
+ self.quiet = True
+ elif arg == '-async':
+ self.async = True
+ elif arg.startswith('-summary'):
+ if len(arg) == 8:
+ self.summary = pywikibot.input(u'What summary do you want to
use?')
+ else:
+ self.summary = arg[9:]
+ elif arg.startswith('-lack:'):
+ remainder = arg[6:].split(':')
+ self.lacklanguage = remainder[0]
+ if len(remainder) > 1:
+ self.minlinks = int(remainder[1])
+ else:
+ self.minlinks = 1
+ else:
+ return False
+ return True
+
+class StoredPage(pywikibot.Page):
+ """
+ Store the Page contents on disk to avoid sucking too much
+ memory when a big number of Page objects will be loaded
+ at the same time.
+ """
+
+ # Please prefix the class members names by SP
+ # to avoid possible name clashes with pywikibot.Page
+
+ # path to the shelve
+ SPpath = None
+ # shelve
+ SPstore = None
+
+ # attributes created by pywikibot.Page.__init__
+ SPcopy = [ '_editrestriction',
+ '_site',
+ '_namespace',
+ '_section',
+ '_title',
+ 'editRestriction',
+ 'moveRestriction',
+ '_permalink',
+ '_userName',
+ '_ipedit',
+ '_editTime',
+ '_startTime',
+ '_revisionId',
+ '_deletedRevs' ]
+
+ def SPdeleteStore():
+ if StoredPage.SPpath:
+ del StoredPage.SPstore
+ os.unlink(StoredPage.SPpath)
+ SPdeleteStore = staticmethod(SPdeleteStore)
+
+ def __init__(self, page):
+ for attr in StoredPage.SPcopy:
+ setattr(self, attr, getattr(page, attr))
+
+ if not StoredPage.SPpath:
+ import shelve
+ index = 1
+ while True:
+ path = config.datafilepath('cache', 'pagestore' +
str(index))
+ if not os.path.exists(path): break
+ index += 1
+ StoredPage.SPpath = path
+ StoredPage.SPstore = shelve.open(path)
+
+ self.SPkey = str(self)
+ self.SPcontentSet = False
+
+ def SPgetContents(self):
+ return StoredPage.SPstore[self.SPkey]
+
+ def SPsetContents(self, contents):
+ self.SPcontentSet = True
+ StoredPage.SPstore[self.SPkey] = contents
+
+ def SPdelContents(self):
+ if self.SPcontentSet:
+ del StoredPage.SPstore[self.SPkey]
+
+ _contents = property(SPgetContents, SPsetContents, SPdelContents)
+
+class PageTree(object):
+ """
+ Structure to manipulate a set of pages.
+ Allows filtering efficiently by Site.
+ """
+ def __init__(self):
+ # self.tree :
+ # Dictionary:
+ # keys: Site
+ # values: list of pages
+ # All pages found within Site are kept in
+ # self.tree[site]
+
+ # While using dict values would be faster for
+ # the remove() operation,
+ # keeping list values is important, because
+ # the order in which the pages were found matters:
+ # the earlier a page is found, the closer it is to the
+ # Subject.originPage. Chances are that pages found within
+ # 2 interwiki distance from the originPage are more related
+ # to the original topic than pages found later on, after
+ # 3, 4, 5 or more interwiki hops.
+
+ # Keeping this order is hence important to display an ordered
+ # list of pages to the user when he'll be asked to resolve
+ # conflicts.
+ self.tree = {}
+ self.size = 0
+
+ def filter(self, site):
+ """
+ Iterates over pages that are in Site site
+ """
+ try:
+ for page in self.tree[site]:
+ yield page
+ except KeyError:
+ pass
+
+ def __len__(self):
+ return self.size
+
+ def add(self, page):
+ site = page.site
+ if not site in self.tree:
+ self.tree[site] = []
+ self.tree[site].append(page)
+ self.size += 1
+
+ def remove(self, page):
+ try:
+ self.tree[page.site].remove(page)
+ self.size -= 1
+ except ValueError:
+ pass
+
+ def removeSite(self, site):
+ """
+ Removes all pages from Site site
+ """
+ try:
+ self.size -= len(self.tree[site])
+ del self.tree[site]
+ except KeyError:
+ pass
+
+ def siteCounts(self):
+ """
+ Yields (Site, number of pages in site) pairs
+ """
+ for site, d in self.tree.iteritems():
+ yield site, len(d)
+
+ def __iter__(self):
+ for site, plist in self.tree.iteritems():
+ for page in plist:
+ yield page
+
+class Subject(object):
+ """
+ Class to follow the progress of a single 'subject' (i.e. a page with
+ all its translations)
+
+
+ Subject is a transitive closure of the binary relation on Page:
+ "has_a_langlink_pointing_to".
+
+ A formal way to compute that closure would be:
+
+ With P a set of pages, NL ('NextLevel') a function on sets defined as:
+ NL(P) = { target | ∃ source ∈ P, target ∈ source.langlinks() }
+ pseudocode:
+ todo <- [originPage]
+ done <- []
+ while todo != []:
+ pending <- todo
+ todo <-NL(pending) / done
+ done <- NL(pending) U done
+ return done
+
+
+ There is, however, one limitation that is induced by implementation:
+ to compute efficiently NL(P), one has to load the page contents of
+ pages in P.
+ (Not only the langlinks have to be parsed from each Page, but we also want
+ to know if the Page is a redirect, a disambiguation, etc...)
+
+ Because of this, the pages in pending have to be preloaded.
+ However, because the pages in pending are likely to be in several sites
+ we cannot "just" preload them as a batch.
+
+ Instead of doing "pending <- todo" at each iteration, we have to elect
a
+ Site, and we put in pending all the pages from todo that belong to that
+ Site:
+
+ Code becomes:
+ todo <- {originPage.site:[originPage]}
+ done <- []
+ while todo != {}:
+ site <- electSite()
+ pending <- todo[site]
+
+ preloadpages(site, pending)
+
+ todo[site] <- NL(pending) / done
+ done <- NL(pending) U done
+ return done
+
+
+ Subject objects only operate on pages that should have been preloaded before.
+ In fact, at any time:
+ * todo contains new Pages that have not been loaded yet
+ * done contains Pages that have been loaded, and that have been treated.
+ * If batch preloadings are successful, Page._get() is never called from
+ this Object.
+ """
+
+ def __init__(self, originPage=None, hints=None):
+ """Constructor. Takes as arguments the Page on the home wiki
+ plus optionally a list of hints for translation"""
+
+ if globalvar.contentsondisk:
+ if originPage:
+ originPage = StoredPage(originPage)
+
+ # Remember the "origin page"
+ self.originPage = originPage
+ # todo is a list of all pages that still need to be analyzed.
+ # Mark the origin page as todo.
+ self.todo = PageTree()
+ if originPage:
+ self.todo.add(originPage)
+
+ # done is a list of all pages that have been analyzed and that
+ # are known to belong to this subject.
+ self.done = PageTree()
+ # foundIn is a dictionary where pages are keys and lists of
+ # pages are values. It stores where we found each page.
+ # As we haven't yet found a page that links to the origin page, we
+ # start with an empty list for it.
+ if originPage:
+ self.foundIn = {self.originPage:[]}
+ else:
+ self.foundIn = {}
+ # This is a list of all pages that are currently scheduled for
+ # download.
+ self.pending = PageTree()
+ if globalvar.hintsareright:
+ # This is a set of sites that we got hints to
+ self.hintedsites = set()
+ self.translate(hints, globalvar.hintsareright)
+ self.confirm = globalvar.confirm
+ self.problemfound = False
+ self.untranslated = None
+ self.hintsAsked = False
+ self.forcedStop = False
+ self.workonme = True
+
+ def getFoundDisambig(self, site):
+ """
+ If we found a disambiguation on the given site while working on the
+ subject, this method returns it. If several ones have been found, the
+ first one will be returned.
+ Otherwise, None will be returned.
+ """
+ for tree in [self.done, self.pending]:
+ for page in tree.filter(site):
+ if page.exists() and page.isDisambig():
+ return page
+ return None
+
+ def getFoundNonDisambig(self, site):
+ """
+ If we found a non-disambiguation on the given site while working on the
+ subject, this method returns it. If several ones have been found, the
+ first one will be returned.
+ Otherwise, None will be returned.
+ """
+ for tree in [self.done, self.pending]:
+ for page in tree.filter(site):
+ if page.exists() and not page.isDisambig() \
+ and not page.isRedirectPage() and not page.isCategoryRedirect():
+ return page
+ return None
+
+ def getFoundInCorrectNamespace(self, site):
+ """
+ If we found a page that has the expected namespace on the given site
+ while working on the subject, this method returns it. If several ones
+ have been found, the first one will be returned.
+ Otherwise, None will be returned.
+ """
+ for tree in [self.done, self.pending, self.todo]:
+ for page in tree.filter(site):
+ # -hintsonly: before we have an origin page, any namespace will do.
+ if self.originPage and page.namespace() == self.originPage.namespace():
+ if page.exists() and not page.isRedirectPage() and not
page.isCategoryRedirect():
+ return page
+ return None
+
+ def translate(self, hints = None, keephintedsites = False):
+ """Add the given translation hints to the todo
list"""
+ if globalvar.same and self.originPage:
+ if hints:
+ pages = titletranslate.translate(self.originPage, hints = hints +
['all:'],
+ auto = globalvar.auto, removebrackets =
globalvar.hintnobracket)
+ else:
+ pages = titletranslate.translate(self.originPage, hints =
['all:'],
+ auto = globalvar.auto, removebrackets =
globalvar.hintnobracket)
+ else:
+ pages = titletranslate.translate(self.originPage, hints=hints,
+ auto=globalvar.auto, removebrackets=globalvar.hintnobracket,
+ site=pywikibot.getSite())
+ for page in pages:
+ if globalvar.contentsondisk:
+ page = StoredPage(page)
+ self.todo.add(page)
+ self.foundIn[page] = [None]
+ if keephintedsites:
+ self.hintedsites.add(page.site)
+
+ def openSites(self):
+ """
+ Iterator. Yields (site, count) pairs:
+ * site is a site where we still have work to do on
+ * count is the number of items in that Site that need work on
+ """
+ return self.todo.siteCounts()
+
+ def whatsNextPageBatch(self, site):
+ """
+ By calling this method, you 'promise' this instance that you will
+ preload all the 'site' Pages that are in the todo list.
+
+ This routine will return a list of pages that can be treated.
+ """
+ # Bug-check: Isn't there any work still in progress? We can't work on
+ # different sites at a time!
+ if len(self.pending) > 0:
+ raise 'BUG: Can\'t start to work on %s; still working on %s' %
(site, self.pending)
+ # Prepare a list of suitable pages
+ result = []
+ for page in self.todo.filter(site):
+ self.pending.add(page)
+ result.append(page)
+
+ self.todo.removeSite(site)
+
+ # If there are any, return them. Otherwise, nothing is in progress.
+ return result
+
+ def makeForcedStop(self,counter):
+ """
+ Ends work on the page before the normal end.
+ """
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ self.forcedStop = True
+
+ def addIfNew(self, page, counter, linkingPage):
+ """
+ Adds the pagelink given to the todo list, but only if we didn't know
+ it before. If it is added, update the counter accordingly.
+
+ Also remembers where we found the page, regardless of whether it had
+ already been found before or not.
+
+ Returns True if the page is new.
+ """
+ if self.forcedStop:
+ return False
+ # cannot check backlink before we have an origin page
+ if globalvar.nobackonly and self.originPage:
+ if page == self.originPage:
+ try:
+ pywikibot.output(u"%s has a backlink from %s."
+ % (page, linkingPage))
+ except UnicodeDecodeError:
+ pywikibot.output(u"Found a backlink for a page.")
+ self.makeForcedStop(counter)
+ return False
+
+ if page in self.foundIn:
+ # not new
+ self.foundIn[page].append(linkingPage)
+ return False
+ else:
+ if globalvar.contentsondisk:
+ page = StoredPage(page)
+ self.foundIn[page] = [linkingPage]
+ self.todo.add(page)
+ counter.plus(page.site)
+ return True
+
+ def skipPage(self, page, target, counter):
+ return self.isIgnored(target) or \
+ self.namespaceMismatch(page, target, counter) or \
+ self.wiktionaryMismatch(target)
+
+ def namespaceMismatch(self, linkingPage, linkedPage, counter):
+ """
+ Checks whether or not the given page has another namespace
+ than the origin page.
+
+ Returns True if the namespaces are different and the user
+ has selected not to follow the linked page.
+ """
+ if linkedPage in self.foundIn:
+ # We have seen this page before, don't ask again.
+ return False
+ elif self.originPage and self.originPage.namespace() != linkedPage.namespace():
+ # Allow for a mapping between different namespaces
+ crossFrom =
self.originPage.site.family.crossnamespace.get(self.originPage.namespace(), {})
+ crossTo = crossFrom.get(self.originPage.site.language(),
crossFrom.get('_default', {}))
+ nsmatch = crossTo.get(linkedPage.site.language(),
crossTo.get('_default', []))
+ if linkedPage.namespace() in nsmatch:
+ return False
+ if globalvar.autonomous:
+ pywikibot.output(u"NOTE: Ignoring link from page %s in namespace %i
to page %s in namespace %i."
+ % (linkingPage, linkingPage.namespace(),
+ linkedPage, linkedPage.namespace()))
+ # Fill up foundIn, so that we will not write this notice
+ self.foundIn[linkedPage] = [linkingPage]
+ return True
+ else:
+ preferredPage = self.getFoundInCorrectNamespace(linkedPage.site)
+ if preferredPage:
+ pywikibot.output(u"NOTE: Ignoring link from page %s in namespace
%i to page %s in namespace %i because page %s in the correct namespace has already been
found."
+ % (linkingPage, linkingPage.namespace(),
linkedPage,
+ linkedPage.namespace(), preferredPage))
+ return True
+ else:
+ choice = pywikibot.inputChoice(
+u'WARNING: %s is in namespace %i, but %s is in namespace %i. Follow it anyway?'
+ % (self.originPage, self.originPage.namespace(),
+ linkedPage, linkedPage.namespace()),
+ ['Yes', 'No', 'Add an alternative',
'give up'],
+ ['y', 'n', 'a', 'g'])
+ if choice != 'y':
+ # Fill up foundIn, so that we will not ask again
+ self.foundIn[linkedPage] = [linkingPage]
+ if choice == 'g':
+ self.makeForcedStop(counter)
+ elif choice == 'a':
+ newHint = pywikibot.input(u'Give the alternative for
language %s, not using a language code:'
+ % linkedPage.site.language())
+ if newHint:
+ alternativePage = pywikibot.Page(linkedPage.site,
newHint)
+ if alternativePage:
+ # add the page that was entered by the user
+ self.addIfNew(alternativePage, counter, None)
+ else:
+ pywikibot.output(
+ u"NOTE: ignoring %s and its interwiki links"
+ % linkedPage)
+ return True
+ else:
+ # same namespaces, no problem
+ # or no origin page yet, also no problem
+ return False
+
+ def wiktionaryMismatch(self, page):
+ if self.originPage and globalvar.same=='wiktionary':
+ if page.title().lower() != self.originPage.title().lower():
+ pywikibot.output(u"NOTE: Ignoring %s for %s in wiktionary mode"
% (page, self.originPage))
+ return True
+ elif page.title() != self.originPage.title() and
self.originPage.site.nocapitalize and page.site.nocapitalize:
+ pywikibot.output(u"NOTE: Ignoring %s for %s in wiktionary mode
because both languages are uncapitalized."
+ % (page, self.originPage))
+ return True
+ return False
+
+ def disambigMismatch(self, page, counter):
+ """
+ Checks whether or not the given page has the another disambiguation
+ status than the origin page.
+
+ Returns a tuple (skip, alternativePage).
+
+ skip is True if the pages have mismatching statuses and the bot
+ is either in autonomous mode, or the user chose not to use the
+ given page.
+
+ alternativePage is either None, or a page that the user has
+ chosen to use instead of the given page.
+ """
+ if not self.originPage:
+ return (False, None) # any page matches until we have an origin page
+ if globalvar.autonomous:
+ if self.originPage.isDisambig() and not page.isDisambig():
+ pywikibot.output(u"NOTE: Ignoring link from disambiguation page %s
to non-disambiguation %s"
+ % (self.originPage, page))
+ return (True, None)
+ elif not self.originPage.isDisambig() and page.isDisambig():
+ pywikibot.output(u"NOTE: Ignoring link from non-disambiguation page
%s to disambiguation %s"
+ % (self.originPage, page))
+ return (True, None)
+ else:
+ choice = 'y'
+ if self.originPage.isDisambig() and not page.isDisambig():
+ disambig = self.getFoundDisambig(page.site)
+ if disambig:
+ pywikibot.output(
+ u"NOTE: Ignoring non-disambiguation page %s for %s because
disambiguation page %s has already been found."
+ % (page, self.originPage, disambig))
+ return (True, None)
+ else:
+ choice = pywikibot.inputChoice(
+ u'WARNING: %s is a disambiguation page, but %s doesn\'t
seem to be one. Follow it anyway?'
+ % (self.originPage, page),
+ ['Yes', 'No', 'Add an alternative',
'Give up'],
+ ['y', 'n', 'a', 'g'])
+ elif not self.originPage.isDisambig() and page.isDisambig():
+ nondisambig = self.getFoundNonDisambig(page.site)
+ if nondisambig:
+ pywikibot.output(u"NOTE: Ignoring disambiguation page %s for %s
because non-disambiguation page %s has already been found."
+ % (page, self.originPage, nondisambig))
+ return (True, None)
+ else:
+ choice = pywikibot.inputChoice(
+ u'WARNING: %s doesn\'t seem to be a disambiguation page,
but %s is one. Follow it anyway?'
+ % (self.originPage, page),
+ ['Yes', 'No', 'Add an alternative',
'Give up'],
+ ['y', 'n', 'a', 'g'])
+ if choice == 'n':
+ return (True, None)
+ elif choice == 'a':
+ newHint = pywikibot.input(u'Give the alternative for language %s, not
using a language code:'
+ % page.site.language())
+ alternativePage = pywikibot.Page(page.site, newHint)
+ return (True, alternativePage)
+ elif choice == 'g':
+ self.makeForcedStop(counter)
+ return (True, None)
+ # We can follow the page.
+ return (False, None)
+
+ def isIgnored(self, page):
+ if page.site.language() in globalvar.neverlink:
+ pywikibot.output(u"Skipping link %s to an ignored language" %
page)
+ return True
+ if page in globalvar.ignore:
+ pywikibot.output(u"Skipping link %s to an ignored page" % page)
+ return True
+ return False
+
+ def reportInterwikilessPage(self, page):
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s does not have any interwiki links"
+ % self.originPage)
+ if config.without_interwiki:
+ f = codecs.open(
+ pywikibot.config.datafilepath('without_interwiki.txt'),
+ 'a', 'utf-8')
+ f.write(u"# %s \n" % page)
+ f.close()
+
+ def askForHints(self, counter):
+ if not self.workonme:
+ # Do not ask hints for pages that we don't work on anyway
+ return
+ if (self.untranslated or globalvar.askhints) and not self.hintsAsked \
+ and self.originPage and self.originPage.exists() \
+ and not self.originPage.isRedirectPage() and not
self.originPage.isCategoryRedirect():
+ # Only once!
+ self.hintsAsked = True
+ if globalvar.untranslated:
+ newhint = None
+ t = globalvar.showtextlink
+ if t:
+ pywikibot.output(self.originPage.get()[:t])
+ # loop
+ while True:
+ newhint = pywikibot.input(u'Give a hint (? to see
pagetext):')
+ if newhint == '?':
+ t += globalvar.showtextlinkadd
+ pywikibot.output(self.originPage.get()[:t])
+ elif newhint and not ':' in newhint:
+ pywikibot.output(u'Please enter a hint in the format
language:pagename or type nothing if you do not have a hint.')
+ elif not newhint:
+ break
+ else:
+ pages = titletranslate.translate(self.originPage,
hints=[newhint],
+ auto = globalvar.auto,
removebrackets=globalvar.hintnobracket)
+ for page in pages:
+ self.addIfNew(page, counter, None)
+ if globalvar.hintsareright:
+ self.hintedsites.add(page.site)
+
+ def batchLoaded(self, counter):
+ """
+ This is called by a worker to tell us that the promised batch of
+ pages was loaded.
+ In other words, all the pages in self.pending have already
+ been preloaded.
+
+ The only argument is an instance
+ of a counter class, that has methods minus() and plus() to keep
+ counts of the total work todo.
+ """
+ # Loop over all the pages that should have been taken care of
+ for page in self.pending:
+ # Mark the page as done
+ self.done.add(page)
+
+ # make sure that none of the linked items is an auto item
+ if globalvar.skipauto:
+ dictName, year = page.autoFormat()
+ if dictName is not None:
+ if self.originPage:
+ pywikibot.output(u'WARNING: %s:%s relates to %s:%s, which is
an auto entry %s(%s)'
+ % (self.originPage.site.language(),
self.originPage,
+ page.site.language(), page, dictName, year))
+
+ # Abort processing if the bot is running in autonomous mode.
+ if globalvar.autonomous:
+ self.makeForcedStop(counter)
+
+ # Register this fact at the todo-counter.
+ counter.minus(page.site)
+
+ # Now check whether any interwiki links should be added to the
+ # todo list.
+
+ if not page.exists():
+ globalvar.remove.append(unicode(page))
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s does not exist. Skipping."
+ % page)
+ if page == self.originPage:
+ # The page we are working on is the page that does not exist.
+ # No use in doing any work on it in that case.
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ # In some rare cases it might be we already did check some
'automatic' links
+ self.done = PageTree()
+ continue
+
+ elif page.isRedirectPage() or page.isCategoryRedirect():
+ if page.isRedirectPage():
+ redir = u''
+ else:
+ redir = u'category '
+ try:
+ if page.isRedirectPage():
+ redirectTargetPage = page.getRedirectTarget()
+ else:
+ redirectTargetPage = page.getCategoryRedirectTarget()
+ except pywikibot.InvalidTitle:
+ # MW considers #redirect [[en:#foo]] as a redirect page,
+ # but we can't do anything useful with such pages
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(
+ u"NOTE: %s redirects to an invalid title" % page)
+ continue
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is %sredirect to %s"
+ % (page, redir, redirectTargetPage))
+ if self.originPage is None or page == self.originPage:
+ # the 1st existig page becomes the origin page, if none was supplied
+ if globalvar.initialredirect:
+ if globalvar.contentsondisk:
+ redirectTargetPage = StoredPage(redirectTargetPage)
+ # don't follow another redirect; it might be a self loop
+ if not redirectTargetPage.isRedirectPage() \
+ and not redirectTargetPage.isCategoryRedirect():
+ self.originPage = redirectTargetPage
+ self.todo.add(redirectTargetPage)
+ counter.plus(redirectTargetPage.site)
+ else:
+ # This is a redirect page to the origin. We don't need to
+ # follow the redirection.
+ # In this case we can also stop all hints!
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ elif not globalvar.followredirect:
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: not following %sredirects."
+ % redir)
+ elif page.isStaticRedirect():
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(
+ u"NOTE: not following static %sredirects." %
redir)
+ elif page.site.family == redirectTargetPage.site.family \
+ and not self.skipPage(page, redirectTargetPage, counter):
+ if self.addIfNew(redirectTargetPage, counter, page):
+ if config.interwiki_shownew or pywikibot.verbose:
+ pywikibot.output(u"%s: %s gives new %sredirect %s"
+ % (self.originPage, page, redir,
+ redirectTargetPage))
+ continue
+
+ # must be behind the page.isRedirectPage() part
+ # otherwise a redirect error would be raised
+ elif page.isEmpty() and not page.isCategory():
+ globalvar.remove.append(unicode(page))
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is empty. Skipping." % page)
+ if page == self.originPage:
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ self.done = PageTree()
+ self.originPage = None
+ continue
+
+ elif page.section():
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is a page section. Skipping."
+ % page)
+ continue
+
+ # Page exists, isnt a redirect, and is a plain link (no section)
+ if self.originPage is None:
+ # the 1st existig page becomes the origin page, if none was supplied
+ self.originPage = page
+ try:
+ iw = page.interwiki()
+ except pywikibot.NoSuchSite:
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: site %s does not exist" %
page.site())
+ continue
+
+ (skip, alternativePage) = self.disambigMismatch(page, counter)
+ if skip:
+ pywikibot.output(u"NOTE: ignoring %s and its interwiki links"
+ % page)
+ self.done.remove(page)
+ iw = ()
+ if alternativePage:
+ # add the page that was entered by the user
+ self.addIfNew(alternativePage, counter, None)
+
+ duplicate = None
+ for p in self.done.filter(page.site):
+ if p != page and p.exists() and not p.isRedirectPage() and not
p.isCategoryRedirect():
+ duplicate = p
+ break
+
+ if self.originPage == page:
+ self.untranslated = (len(iw) == 0)
+ if globalvar.untranslatedonly:
+ # Ignore the interwiki links.
+ iw = ()
+ if globalvar.lacklanguage:
+ if globalvar.lacklanguage in [link.site.language() for link in iw]:
+ iw = ()
+ self.workonme = False
+ if len(iw) < globalvar.minlinks:
+ iw = ()
+ self.workonme = False
+
+ elif globalvar.autonomous and duplicate and not skip:
+ pywikibot.output(u"Stopping work on %s because duplicate
pages"\
+ " %s and %s are found" % (self.originPage, duplicate,
page))
+ self.makeForcedStop(counter)
+ try:
+ f = codecs.open(
+
pywikibot.config.datafilepath('autonomous_problems.dat'),
+ 'a', 'utf-8')
+ f.write(u"* %s {Found more than one link for %s}"
+ % (self.originPage, page.site))
+ if config.interwiki_graph and config.interwiki_graph_url:
+ filename = interwiki_graph.getFilename(self.originPage, extension
= config.interwiki_graph_formats[0])
+ f.write(u" [%s%s graph]" % (config.interwiki_graph_url,
filename))
+ f.write("\n")
+ f.close()
+ # FIXME: What errors are we catching here?
+ # except: should be avoided!!
+ except:
+ #raise
+ pywikibot.output(u'File autonomous_problems.dat open or corrupted!
Try again with -restore.')
+ sys.exit()
+ iw = ()
+ elif page.isEmpty() and not page.isCategory():
+ globalvar.remove.append(unicode(page))
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is empty; ignoring it and its
interwiki links"
+ % page)
+ # Ignore the interwiki links
+ self.done.remove(page)
+ iw = ()
+
+ for linkedPage in iw:
+ if globalvar.hintsareright:
+ if linkedPage.site in self.hintedsites:
+ pywikibot.output(u"NOTE: %s: %s extra interwiki on hinted
site ignored %s"
+ % (self.originPage, page, linkedPage))
+ break
+ if not self.skipPage(page, linkedPage, counter):
+ if globalvar.followinterwiki or page == self.originPage:
+ if self.addIfNew(linkedPage, counter, page):
+ # It is new. Also verify whether it is the second on the
+ # same site
+ lpsite=linkedPage.site
+ for prevPage in self.foundIn:
+ if prevPage != linkedPage and prevPage.site == lpsite:
+ # Still, this could be "no problem" as
either may be a
+ # redirect to the other. No way to find out quickly!
+ pywikibot.output(u"NOTE: %s: %s gives duplicate
interwiki on same site %s"
+ % (self.originPage, page,
+ linkedPage))
+ break
+ else:
+ if config.interwiki_shownew or pywikibot.verbose:
+ pywikibot.output(u"%s: %s gives new interwiki
%s"
+ % (self.originPage,
+ page, linkedPage))
+ if self.forcedStop:
+ break
+ # These pages are no longer 'in progress'
+ self.pending = PageTree()
+ # Check whether we need hints and the user offered to give them
+ if self.untranslated and not self.hintsAsked:
+ self.reportInterwikilessPage(page)
+ self.askForHints(counter)
+
+ def isDone(self):
+ """Return True if all the work for this subject has
completed."""
+ return len(self.todo) == 0
+
+ def problem(self, txt, createneed = True):
+ """Report a problem with the resolution of this
subject."""
+ pywikibot.output(u"ERROR: %s" % txt)
+ self.confirm = True
+ if createneed:
+ self.problemfound = True
+
+ def whereReport(self, page, indent=4):
+ for page2 in sorted(self.foundIn[page]):
+ if page2 is None:
+ pywikibot.output(u" "*indent + "Given as a hint.")
+ else:
+ pywikibot.output(u" "*indent + unicode(page2))
+
+
+ def assemble(self):
+ # No errors have been seen so far, except....
+ errorCount = self.problemfound
+ mysite = pywikibot.getSite()
+ # Build up a dictionary of all pages found, with the site as key.
+ # Each value will be a list of pages.
+ new = {}
+ for page in self.done:
+ if page.exists() and not page.isRedirectPage() and not
page.isCategoryRedirect():
+ site = page.site
+ if site.family.interwiki_forward:
+ #TODO: allow these cases to be propagated!
+ continue # inhibit the forwarding families pages to be updated.
+ if site == self.originPage.site:
+ if page != self.originPage:
+ self.problem(u"Found link to %s" % page)
+ self.whereReport(page)
+ errorCount += 1
+ else:
+ if site in new:
+ new[site].append(page)
+ else:
+ new[site] = [page]
+ # See if new{} contains any problematic values
+ result = {}
+ for site, pages in new.iteritems():
+ if len(pages) > 1:
+ errorCount += 1
+ self.problem(u"Found more than one link for %s" % site)
+
+ if not errorCount and not globalvar.select:
+ # no errors, so all lists have only one item
+ for site, pages in new.iteritems():
+ result[site] = pages[0]
+ return result
+
+ # There are any errors.
+ if config.interwiki_graph:
+ graphDrawer = interwiki_graph.GraphDrawer(self)
+ graphDrawer.createGraph()
+
+ # We don't need to continue with the rest if we're in autonomous
+ # mode.
+ if globalvar.autonomous:
+ return None
+
+ # First loop over the ones that have more solutions
+ for site, pages in new.iteritems():
+ if len(pages) > 1:
+ pywikibot.output(u"=" * 30)
+ pywikibot.output(u"Links to %s" % site)
+ i = 0
+ for page2 in pages:
+ i += 1
+ pywikibot.output(u" (%d) Found link to %s in:"
+ % (i, page2))
+ self.whereReport(page2, indent = 8)
+ while True:
+ #TODO: allow answer to repeat previous or go back after a mistake
+ answer = pywikibot.input(u"Which variant should be used?
(<number>, [n]one, [g]ive up) ").lower()
+ if answer:
+ if answer == 'g':
+ return None
+ elif answer == 'n':
+ # None acceptable
+ break
+ elif answer.isdigit():
+ answer = int(answer)
+ try:
+ result[site] = pages[answer - 1]
+ except IndexError:
+ # user input is out of range
+ pass
+ else:
+ break
+ # Loop over the ones that have one solution, so are in principle
+ # not a problem.
+ acceptall = False
+ for site, pages in new.iteritems():
+ if len(pages) == 1:
+ if not acceptall:
+ pywikibot.output(u"=" * 30)
+ page2 = pages[0]
+ pywikibot.output(u"Found link to %s in:" % page2)
+ self.whereReport(page2, indent = 4)
+ while True:
+ if acceptall:
+ answer = 'a'
+ else:
+ #TODO: allow answer to repeat previous or go back after a
mistake
+ answer = pywikibot.inputChoice(u'What should be done?',
['accept', 'reject', 'give up', 'accept all'],
['a', 'r', 'g', 'l'], 'a')
+ if answer == 'l': # accept all
+ acceptall = True
+ answer = 'a'
+ if answer == 'a': # accept this one
+ result[site] = pages[0]
+ break
+ elif answer == 'g': # give up
+ return None
+ elif answer == 'r': # reject
+ # None acceptable
+ break
+ return result
+
+ def finish(self, bot = None):
+ """Round up the subject, making any necessary changes. This
method
+ should be called exactly once after the todo list has gone empty.
+
+ This contains a shortcut: if a subject list is given in the argument
+ bot, just before submitting a page change to the live wiki it is
+ checked whether we will have to wait. If that is the case, the bot will
+ be told to make another get request first."""
+
+ #from clean_sandbox
+ def minutesDiff(time1, time2):
+ if type(time1) is long:
+ time1 = str(time1)
+ if type(time2) is long:
+ time2 = str(time2)
+ t1 = (((int(time1[0:4]) * 12 + int(time1[4:6])) * 30 +
+ int(time1[6:8])) * 24 + int(time1[8:10])) * 60 + \
+ int(time1[10:12])
+ t2 = (((int(time2[0:4]) * 12 + int(time2[4:6])) * 30 +
+ int(time2[6:8])) * 24 + int(time2[8:10])) * 60 + \
+ int(time2[10:12])
+ return abs(t2-t1)
+
+ if not self.isDone():
+ raise "Bugcheck: finish called before done"
+ if not self.workonme:
+ return
+ if self.originPage:
+ if self.originPage.isRedirectPage():
+ return
+ if self.originPage.isCategoryRedirect():
+ return
+ else:
+ return
+ if not self.untranslated and globalvar.untranslatedonly:
+ return
+ if self.forcedStop: # autonomous with problem
+ pywikibot.output(u"======Aborted processing %s======"
+ % self.originPage)
+ return
+ # The following check is not always correct and thus disabled.
+ # self.done might contain no interwiki links because of the -neverlink
+ # argument or because of disambiguation conflicts.
+# if len(self.done) == 1:
+# # No interwiki at all
+# return
+ pywikibot.output(u"======Post-processing %s======" % self.originPage)
+ # Assemble list of accepted interwiki links
+ new = self.assemble()
+ if new is None: # User said give up
+ pywikibot.output(u"======Aborted processing %s======"
+ % self.originPage)
+ return
+
+ # Make sure new contains every page link, including the page we are processing
+ # TODO: should be move to assemble()
+ # replaceLinks will skip the site it's working on.
+ if self.originPage.site not in new:
+ #TODO: make this possible as well.
+ if not self.originPage.site.family.interwiki_forward:
+ new[self.originPage.site] = self.originPage
+
+ #self.replaceLinks(self.originPage, new, True, bot)
+
+ updatedSites = []
+ notUpdatedSites = []
+ # Process all languages here
+ globalvar.always = False
+ if globalvar.limittwo:
+ lclSite = self.originPage.site
+ lclSiteDone = False
+ frgnSiteDone = False
+
+ for siteCode in lclSite.family.languages_by_size:
+ site = pywikibot.getSite(code = siteCode)
+ if (not lclSiteDone and site == lclSite) or \
+ (not frgnSiteDone and site != lclSite and site in new):
+ if site == lclSite:
+ lclSiteDone = True # even if we fail the update
+ if site.family.name in config.usernames and site.lang in
config.usernames[site.family.name]:
+ try:
+ if self.replaceLinks(new[site], new, bot):
+ updatedSites.append(site)
+ if site != lclSite:
+ frgnSiteDone = True
+ except SaveError:
+ notUpdatedSites.append(site)
+ except GiveUpOnPage:
+ break
+ elif not globalvar.strictlimittwo and site in new \
+ and site != lclSite:
+ old={}
+ try:
+ for page in new[site].interwiki():
+ old[page.site] = page
+ except pywikibot.NoPage:
+ pywikibot.output(u"BUG>>> %s no longer
exists?"
+ % new[site])
+ continue
+ mods, mcomment, adding, removing, modifying \
+ = compareLanguages(old, new, insite = lclSite)
+ if (len(removing) > 0 and not globalvar.autonomous) or \
+ (len(modifying) > 0 and self.problemfound) or \
+ len(old) == 0 or \
+ (globalvar.needlimit and \
+ len(adding) + len(modifying) >= globalvar.needlimit +1):
+ try:
+ if self.replaceLinks(new[site], new, bot):
+ updatedSites.append(site)
+ except SaveError:
+ notUpdatedSites.append(site)
+ except pywikibot.NoUsername:
+ pass
+ except GiveUpOnPage:
+ break
+ else:
+ for (site, page) in new.iteritems():
+ # edit restriction on is-wiki
+ #
http://is.wikipedia.org/wiki/Wikipediaspjall:V%C3%A9lmenni
+ # allow edits for the same conditions as -whenneeded
+ # or the last edit wasn't a bot
+ # or the last edit was 1 month ago
+ smallWikiAllowed = True
+ if globalvar.autonomous and page.site.sitename() ==
'wikipedia:is':
+ old={}
+ try:
+ for mypage in new[page.site].interwiki():
+ old[mypage.site] = mypage
+ except pywikibot.NoPage:
+ pywikibot.output(u"BUG>>> %s no longer
exists?"
+ % new[site])
+ continue
+ mods, mcomment, adding, removing, modifying \
+ = compareLanguages(old, new, insite=site)
+ #cannot create userlib.User with IP
+ smallWikiAllowed = page.isIpEdit() or \
+ len(removing) > 0 or len(old) == 0 or \
+ len(adding) + len(modifying) > 2 or \
+ len(removing) + len(modifying) == 0 and \
+ adding == [page.site]
+ if not smallWikiAllowed:
+ import userlib
+ user = userlib.User(page.site, page.userName())
+ if not 'bot' in user.groups() \
+ and not 'bot' in page.userName().lower(): #erstmal
auch keine namen mit bot
+ smallWikiAllowed = True
+ else:
+ diff = minutesDiff(page.editTime(),
+ time.strftime("%Y%m%d%H%M%S",
+ time.gmtime()))
+ if diff > 30*24*60:
+ smallWikiAllowed = True
+ else:
+ pywikibot.output(
+u'NOTE: number of edits are restricted at %s'
+ % page.site.sitename())
+
+ # if we have an account for this site
+ if site.family.name in config.usernames \
+ and site.lang in config.usernames[site.family.name] \
+ and smallWikiAllowed:
+ # Try to do the changes
+ try:
+ if self.replaceLinks(page, new, bot):
+ # Page was changed
+ updatedSites.append(site)
+ except SaveError:
+ notUpdatedSites.append(site)
+ except GiveUpOnPage:
+ break
+
+ # disabled graph drawing for minor problems: it just takes too long
+ #if notUpdatedSites != [] and config.interwiki_graph:
+ # # at least one site was not updated, save a conflict graph
+ # self.createGraph()
+
+ # don't report backlinks for pages we already changed
+ if config.interwiki_backlink:
+ self.reportBacklinks(new, updatedSites)
+
+ def clean(self):
+ """
+ Delete the contents that are stored on disk for this Subject.
+
+ We cannot afford to define this in a StoredPage destructor because
+ StoredPage instances can get referenced cyclicly: that would stop the
+ garbage collector from destroying some of those objects.
+
+ It's also not necessary to set these lines as a Subject destructor:
+ deleting all stored content one entry by one entry when bailing out
+ after a KeyboardInterrupt for example is redundant, because the
+ whole storage file will be eventually removed.
+ """
+ if globalvar.contentsondisk:
+ for page in self.foundIn:
+ # foundIn can contain either Page or StoredPage objects
+ # calling the destructor on _contents will delete the
+ # disk records if necessary
+ if hasattr(page, '_contents'):
+ del page._contents
+
+ def replaceLinks(self, page, newPages, bot):
+ """
+ Returns True if saving was successful.
+ """
+ if globalvar.localonly:
+ # In this case only continue on the Page we started with
+ if page != self.originPage:
+ raise SaveError(u'-localonly and page != originPage')
+ if page.section():
+ # This is not a page, but a subpage. Do not edit it.
+ pywikibot.output(u"Not editing %s: not doing interwiki on
subpages"
+ % page)
+ raise SaveError(u'Link has a #section')
+ try:
+ pagetext = page.get()
+ except pywikibot.NoPage:
+ pywikibot.output(u"Not editing %s: page does not exist" % page)
+ raise SaveError(u'Page doesn\'t exist')
+ if page.isEmpty() and not page.isCategory():
+ pywikibot.output(u"Not editing %s: page is empty" % page)
+ raise SaveError
+
+ # clone original newPages dictionary, so that we can modify it to the
+ # local page's needs
+ new = dict(newPages)
+ interwikis = page.interwiki()
+
+ # remove interwiki links to ignore
+ for iw in re.finditer('<!-- *\[\[(.*?:.*?)\]\] *-->', pagetext):
+ try:
+ ignorepage = pywikibot.Page(page.site, iw.groups()[0])
+ except (pywikibot.NoSuchSite, pywikibot.InvalidTitle):
+ continue
+ try:
+ if (new[ignorepage.site] == ignorepage) and \
+ (ignorepage.site != page.site):
+ if (ignorepage not in interwikis):
+ pywikibot.output(
+ u"Ignoring link to %(to)s for %(from)s"
+ % {'to': ignorepage,
+ 'from': page})
+ new.pop(ignorepage.site)
+ else:
+ pywikibot.output(
+ u"NOTE: Not removing interwiki from %(from)s to %(to)s
(exists both commented and non-commented)"
+ % {'to': ignorepage,
+ 'from': page})
+ except KeyError:
+ pass
+
+ # sanity check - the page we are fixing must be the only one for that
+ # site.
+ pltmp = new[page.site]
+ if pltmp != page:
+ s = u"None"
+ if pltmp is not None: s = pltmp
+ pywikibot.output(
+ u"BUG>>> %s is not in the list of new links! Found
%s."
+ % (page, s))
+ raise SaveError(u'BUG: sanity check failed')
+
+ # Avoid adding an iw link back to itself
+ del new[page.site]
+ # Do not add interwiki links to foreign families that page.site() does not
forward to
+ for stmp in new.keys():
+ if stmp.family != page.site.family:
+ if stmp.family.name != page.site.family.interwiki_forward:
+ del new[stmp]
+
+ # Put interwiki links into a map
+ old={}
+ for page2 in interwikis:
+ old[page2.site] = page2
+
+ # Check what needs to get done
+ mods, mcomment, adding, removing, modifying = compareLanguages(old,
+ new,
+ insite=page.site)
+
+ # When running in autonomous mode without -force switch, make sure we
+ # don't remove any items, but allow addition of the new ones
+ if globalvar.autonomous and (not globalvar.force or
+ pywikibot.unicode_error
+ ) and len(removing) > 0:
+ for rmsite in removing:
+ # Sometimes sites have an erroneous link to itself as an
+ # interwiki
+ if rmsite == page.site:
+ continue
+ rmPage = old[rmsite]
+ #put it to new means don't delete it
+ if not globalvar.cleanup and not globalvar.force or \
+ globalvar.cleanup and \
+ unicode(rmPage) not in globalvar.remove or \
+ rmPage.site.lang in ['hak', 'hi', 'cdo'] and
\
+ pywikibot.unicode_error: #work-arround for bug #3081100 (do not remove
affected pages)
+ new[rmsite] = rmPage
+ pywikibot.output(
+ u"WARNING: %s is either deleted or has a mismatching
disambiguation state."
+ % rmPage)
+ # Re-Check what needs to get done
+ mods, mcomment, adding, removing, modifying = compareLanguages(old,
+ new,
+
insite=page.site)
+ if not mods:
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u'No changes needed on page %s' % page)
+ return False
+
+ # Show a message in purple.
+ pywikibot.output(
+ u"\03{lightpurple}Updating links on page %s.\03{default}" % page)
+ pywikibot.output(u"Changes to be made: %s" % mods)
+ oldtext = page.get()
+ template = (page.namespace() == 10)
+ newtext = pywikibot.replaceLanguageLinks(oldtext, new,
+ site=page.site,
+ template=template)
+ # This is for now. Later there should be different funktions for each
+ # kind
+ if not botMayEdit(page):
+ if template:
+ pywikibot.output(
+ u'SKIPPING: %s should have interwiki links on subpage.'
+ % page)
+ else:
+ pywikibot.output(
+ u'SKIPPING: %s is under construction or to be deleted.'
+ % page)
+ return False
+ if newtext == oldtext:
+ return False
+ pywikibot.showDiff(oldtext, newtext)
+
+ # pywikibot.output(u"NOTE: Replace %s" % page)
+ # Determine whether we need permission to submit
+ ask = False
+
+ # Allow for special case of a self-pointing interwiki link
+ if removing and removing != [page.site]:
+ self.problem(u'Found incorrect link to %s in %s'
+ % (", ".join([x.lang for x in removing]), page),
+ createneed=False)
+ if pywikibot.unicode_error:
+ for x in removing:
+ if x.lang in ['hi', 'cdo']:
+ pywikibot.output(
+u'\03{lightred}WARNING: This may be false positive due to unicode bug
#3081100\03{default}')
+ break
+ ask = True
+ if globalvar.force or globalvar.cleanup:
+ ask = False
+ if globalvar.confirm and not globalvar.always:
+ ask = True
+ # If we need to ask, do so
+ if ask:
+ if globalvar.autonomous:
+ # If we cannot ask, deny permission
+ answer = 'n'
+ else:
+ answer = pywikibot.inputChoice(u'Submit?',
+ ['Yes', 'No', 'open in
Browser',
+ 'Give up', 'Always'],
+ ['y', 'n', 'b',
'g', 'a'])
+ if answer == 'b':
+ webbrowser.open("http://%s%s" % (
+ page.site.hostname(),
+ page.site.nice_get_address(page.title())
+ ))
+ pywikibot.input(u"Press Enter when finished in browser.")
+ return True
+ elif answer == 'a':
+ # don't ask for the rest of this subject
+ globalvar.always = True
+ answer = 'y'
+ else:
+ # If we do not need to ask, allow
+ answer = 'y'
+ # If we got permission to submit, do so
+ if answer == 'y':
+ # Check whether we will have to wait for pywikibot. If so, make
+ # another get-query first.
+ if bot:
+ while pywikibot.get_throttle.waittime() + 2.0 <
pywikibot.put_throttle.waittime():
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(
+ u"NOTE: Performing a recursive query first to save
time....")
+ qdone = bot.oneQuery()
+ if not qdone:
+ # Nothing more to do
+ break
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: Updating live wiki...")
+ timeout=60
+ while True:
+ try:
+ if globalvar.async:
+ page.put_async(newtext, comment=mcomment)
+ status = 302
+ else:
+ status, reason, data = page.put(newtext, comment=mcomment)
+ except pywikibot.LockedPage:
+ pywikibot.output(u'Page %s is locked. Skipping.' % page)
+ raise SaveError(u'Locked')
+ except pywikibot.EditConflict:
+ pywikibot.output(
+ u'ERROR putting page: An edit conflict occurred. Giving
up.')
+ raise SaveError(u'Edit conflict')
+ except (pywikibot.SpamfilterError), error:
+ pywikibot.output(
+ u'ERROR putting page: %s blacklisted by spamfilter. Giving
up.'
+ % (error.url,))
+ raise SaveError(u'Spam filter')
+ except (pywikibot.PageNotSaved), error:
+ pywikibot.output(u'ERROR putting page: %s' % (error.args,))
+ raise SaveError(u'PageNotSaved')
+ except (socket.error, IOError), error:
+ if timeout>3600:
+ raise
+ pywikibot.output(u'ERROR putting page: %s' % (error.args,))
+ pywikibot.output(u'Sleeping %i seconds before trying again.'
+ % (timeout,))
+ timeout *= 2
+ time.sleep(timeout)
+ except pywikibot.ServerError:
+ if timeout > 3600:
+ raise
+ pywikibot.output(u'ERROR putting page: ServerError.')
+ pywikibot.output(u'Sleeping %i seconds before trying again.'
+ % (timeout,))
+ timeout *= 2
+ time.sleep(timeout)
+ else:
+ break
+ if str(status) == '302':
+ return True
+ else:
+ pywikibot.output(u'%s %s' % (status, reason))
+ return False
+ elif answer == 'g':
+ raise GiveUpOnPage(u'User asked us to give up')
+ else:
+ raise LinkMustBeRemoved(u'Found incorrect link to %s in %s'
+ % (", ".join([x.lang for x in removing]),
+ page))
+
+ def reportBacklinks(self, new, updatedSites):
+ """
+ Report missing back links. This will be called from finish() if needed.
+
+ updatedSites is a list that contains all sites we changed, to avoid
+ reporting of missing backlinks for pages we already fixed
+
+ """
+ # use sets because searching an element is faster than in lists
+ expectedPages = set(new.itervalues())
+ expectedSites = set(new)
+ try:
+ for site in expectedSites - set(updatedSites):
+ page = new[site]
+ if not page.section():
+ try:
+ linkedPages = set(page.interwiki())
+ except pywikibot.NoPage:
+ pywikibot.output(u"WARNING: Page %s does no longer
exist?!" % page)
+ break
+ # To speed things up, create a dictionary which maps sites to pages.
+ # This assumes that there is only one interwiki link per language.
+ linkedPagesDict = {}
+ for linkedPage in linkedPages:
+ linkedPagesDict[linkedPage.site] = linkedPage
+ for expectedPage in expectedPages - linkedPages:
+ if expectedPage != page:
+ try:
+ linkedPage = linkedPagesDict[expectedPage.site]
+ pywikibot.output(
+ u"WARNING: %s: %s does not link to %s but to
%s"
+ % (page.site.family.name,
+ page, expectedPage, linkedPage))
+ except KeyError:
+ pywikibot.output(
+ u"WARNING: %s: %s does not link to %s"
+ % (page.site.family.name,
+ page, expectedPage))
+ # Check for superfluous links
+ for linkedPage in linkedPages:
+ if linkedPage not in expectedPages:
+ # Check whether there is an alternative page on that
language.
+ # In this case, it was already reported above.
+ if linkedPage.site not in expectedSites:
+ pywikibot.output(
+ u"WARNING: %s: %s links to incorrect %s"
+ % (page.site.family.name,
+ page, linkedPage))
+ except (socket.error, IOError):
+ pywikibot.output(u'ERROR: could not report backlinks')
+
+class InterwikiBot(object):
+ """A class keeping track of a list of subjects, controlling which
pages
+ are queried from which languages when."""
+
+ def __init__(self):
+ """Constructor. We always start with empty
lists."""
+ self.subjects = []
+ # We count how many pages still need to be loaded per site.
+ # This allows us to find out from which site to retrieve pages next
+ # in a way that saves bandwidth.
+ # sites are keys, integers are values.
+ # Modify this only via plus() and minus()!
+ self.counts = {}
+ self.pageGenerator = None
+ self.generated = 0
+
+ def add(self, page, hints = None):
+ """Add a single subject to the list"""
+ subj = Subject(page, hints = hints)
+ self.subjects.append(subj)
+ for site, count in subj.openSites():
+ # Keep correct counters
+ self.plus(site, count)
+
+ def setPageGenerator(self, pageGenerator, number = None, until = None):
+ """Add a generator of subjects. Once the list of subjects gets
+ too small, this generator is called to produce more Pages"""
+ self.pageGenerator = pageGenerator
+ self.generateNumber = number
+ self.generateUntil = until
+
+ def dump(self, append = True):
+ site = pywikibot.getSite()
+ dumpfn = pywikibot.config.datafilepath(
+ 'interwiki-dumps',
+ 'interwikidump-%s-%s.txt' % (site.family.name, site.lang))
+ if append: mode = 'appended'
+ else: mode = 'written'
+ f = codecs.open(dumpfn, mode[0], 'utf-8')
+ for subj in self.subjects:
+ if subj.originPage:
+ f.write(subj.originPage.title(asLink=True)+'\n')
+ f.close()
+ pywikibot.output(u'Dump %s (%s) %s.' % (site.lang, site.family.name,
mode))
+ return dumpfn
+
+ def generateMore(self, number):
+ """Generate more subjects. This is called internally when the
+ list of subjects becomes too small, but only if there is a
+ PageGenerator"""
+ fs = self.firstSubject()
+ if fs and (not globalvar.quiet or pywikibot.verbose):
+ pywikibot.output(u"NOTE: The first unfinished subject is %s"
+ % fs.originPage)
+ pywikibot.output(u"NOTE: Number of pages queued is %d, trying to add %d
more."
+ % (len(self.subjects), number))
+ for i in xrange(number):
+ try:
+ while True:
+ try:
+ page = self.pageGenerator.next()
+ except IOError:
+ pywikibot.output(u'IOError occured; skipping')
+ continue
+ if page in globalvar.skip:
+ pywikibot.output(u'Skipping: %s is in the skip list' %
page)
+ continue
+ if globalvar.skipauto:
+ dictName, year = page.autoFormat()
+ if dictName is not None:
+ pywikibot.output(u'Skipping: %s is an auto entry
%s(%s)' % (page, dictName, year))
+ continue
+ if globalvar.parenthesesonly:
+ # Only yield pages that have ( ) in titles
+ if "(" not in page.title():
+ continue
+ if page.isTalkPage():
+ pywikibot.output(u'Skipping: %s is a talk page' % page)
+ continue
+ #doesn't work: page must be preloaded for this test
+ #if page.isEmpty():
+ # pywikibot.output(u'Skipping: %s is a empty page' %
page.title())
+ # continue
+ if page.namespace() == 10:
+ loc = None
+ try:
+ tmpl, loc = moved_links[page.site.lang]
+ del tmpl
+ except KeyError:
+ pass
+ if loc is not None and loc in page.title():
+ pywikibot.output(u'Skipping: %s is a templates
subpage' % page.title())
+ continue
+ break
+
+ if self.generateUntil:
+ until = self.generateUntil
+ if page.site.lang not in page.site.family.nocapitalize:
+ until = until[0].upper()+until[1:]
+ if page.title(withNamespace=False) > until:
+ raise StopIteration
+ self.add(page, hints = globalvar.hints)
+ self.generated += 1
+ if self.generateNumber:
+ if self.generated >= self.generateNumber:
+ raise StopIteration
+ except StopIteration:
+ self.pageGenerator = None
+ break
+
+ def firstSubject(self):
+ """Return the first subject that is still being worked
on"""
+ if self.subjects:
+ return self.subjects[0]
+
+ def maxOpenSite(self):
+ """Return the site that has the most
+ open queries plus the number. If there is nothing left, return
+ None. Only languages that are TODO for the first Subject
+ are returned."""
+ max = 0
+ maxlang = None
+ if not self.firstSubject():
+ return None
+ oc = dict(self.firstSubject().openSites())
+ if not oc:
+ # The first subject is done. This might be a recursive call made because we
+ # have to wait before submitting another modification to go live. Select
+ # any language from counts.
+ oc = self.counts
+ if pywikibot.getSite() in oc:
+ return pywikibot.getSite()
+ for lang in oc:
+ count = self.counts[lang]
+ if count > max:
+ max = count
+ maxlang = lang
+ return maxlang
+
+ def selectQuerySite(self):
+ """Select the site the next query should go out
for."""
+ # How many home-language queries we still have?
+ mycount = self.counts.get(pywikibot.getSite(), 0)
+ # Do we still have enough subjects to work on for which the
+ # home language has been retrieved? This is rough, because
+ # some subjects may need to retrieve a second home-language page!
+ if len(self.subjects) - mycount < globalvar.minsubjects:
+ # Can we make more home-language queries by adding subjects?
+ if self.pageGenerator and mycount < globalvar.maxquerysize:
+ timeout = 60
+ while timeout<3600:
+ try:
+ self.generateMore(globalvar.maxquerysize - mycount)
+ except pywikibot.ServerError:
+ # Could not extract allpages special page?
+ pywikibot.output(u'ERROR: could not retrieve more pages. Will
try again in %d seconds'%timeout)
+ time.sleep(timeout)
+ timeout *= 2
+ else:
+ break
+ # If we have a few, getting the home language is a good thing.
+ if not globalvar.restoreAll:
+ try:
+ if self.counts[pywikibot.getSite()] > 4:
+ return pywikibot.getSite()
+ except KeyError:
+ pass
+ # If getting the home language doesn't make sense, see how many
+ # foreign page queries we can find.
+ return self.maxOpenSite()
+
+ def oneQuery(self):
+ """
+ Perform one step in the solution process.
+
+ Returns True if pages could be preloaded, or false
+ otherwise.
+ """
+ # First find the best language to work on
+ site = self.selectQuerySite()
+ if site is None:
+ pywikibot.output(u"NOTE: Nothing left to do")
+ return False
+ # Now assemble a reasonable list of pages to get
+ subjectGroup = []
+ pageGroup = []
+ for subject in self.subjects:
+ # Promise the subject that we will work on the site.
+ # We will get a list of pages we can do.
+ pages = subject.whatsNextPageBatch(site)
+ if pages:
+ pageGroup.extend(pages)
+ subjectGroup.append(subject)
+ if len(pageGroup) >= globalvar.maxquerysize:
+ # We have found enough pages to fill the bandwidth.
+ break
+ if len(pageGroup) == 0:
+ pywikibot.output(u"NOTE: Nothing left to do 2")
+ return False
+ # Get the content of the assembled list in one blow
+ gen = pagegenerators.PreloadingGenerator(iter(pageGroup))
+ for page in gen:
+ # we don't want to do anything with them now. The
+ # page contents will be read via the Subject class.
+ pass
+ # Tell all of the subjects that the promised work is done
+ for subject in subjectGroup:
+ subject.batchLoaded(self)
+ return True
+
+ def queryStep(self):
+ self.oneQuery()
+ # Delete the ones that are done now.
+ for i in xrange(len(self.subjects)-1, -1, -1):
+ subj = self.subjects[i]
+ if subj.isDone():
+ subj.finish(self)
+ subj.clean()
+ del self.subjects[i]
+
+ def isDone(self):
+ """Check whether there is still more work to do"""
+ return len(self) == 0 and self.pageGenerator is None
+
+ def plus(self, site, count=1):
+ """This is a routine that the Subject class expects in a
counter"""
+ try:
+ self.counts[site] += count
+ except KeyError:
+ self.counts[site] = count
+
+ def minus(self, site, count=1):
+ """This is a routine that the Subject class expects in a
counter"""
+ self.counts[site] -= count
+
+ def run(self):
+ """Start the process until finished"""
+ while not self.isDone():
+ self.queryStep()
+
+ def __len__(self):
+ return len(self.subjects)
+
+def compareLanguages(old, new, insite):
+
+ oldiw = set(old)
+ newiw = set(new)
+
+ # sort by language code
+ adding = sorted(newiw - oldiw)
+ removing = sorted(oldiw - newiw)
+ modifying = sorted(site for site in oldiw & newiw if old[site] != new[site])
+
+ if not globalvar.summary and \
+ len(adding) + len(removing) + len(modifying) <= 3:
+ # Use an extended format for the string linking to all added pages.
+ fmt = lambda d, site: unicode(d[site])
+ else:
+ # Use short format, just the language code
+ fmt = lambda d, site: site.lang
+
+ mods = mcomment = u''
+
+ commentname = 'interwiki'
+ if adding:
+ commentname += '-adding'
+ if removing:
+ commentname += '-removing'
+ if modifying:
+ commentname += '-modifying'
+
+ if adding or removing or modifying:
+ #Version info marks bots without unicode error
+ #This also prevents abuse filter blocking on de-wiki
+ if not pywikibot.unicode_error:
+ mcomment += u'r%s) (' % sys.version.split()[0]
+
+ mcomment += globalvar.summary
+
+ changes = {'adding': ', '.join([fmt(new, x) for x in
adding]),
+ 'removing': ', '.join([fmt(old, x) for x in
removing]),
+ 'modifying': ', '.join([fmt(new, x) for x in
modifying])}
+
+ mcomment += i18n.twtranslate(insite.lang, commentname) % changes
+ mods = i18n.twtranslate('en', commentname) % changes
+
+ return mods, mcomment, adding, removing, modifying
+
+def botMayEdit (page):
+ tmpl = []
+ try:
+ tmpl, loc = moved_links[page.site.lang]
+ except KeyError:
+ pass
+ if type(tmpl) != list:
+ tmpl = [tmpl]
+ try:
+ tmpl += ignoreTemplates[page.site.lang]
+ except KeyError:
+ pass
+ tmpl += ignoreTemplates['_default']
+ if tmpl != []:
+ templates = page.templatesWithParams(get_redirect=True);
+ for template in templates:
+ if template[0].lower() in tmpl:
+ return False
+ return True
+
+def readWarnfile(filename, bot):
+ import warnfile
+ reader = warnfile.WarnfileReader(filename)
+ # we won't use removeHints
+ (hints, removeHints) = reader.getHints()
+ for page, pagelist in hints.iteritems():
+ # The WarnfileReader gives us a list of pagelinks, but titletranslate.py expects
a list of strings, so we convert it back.
+ # TODO: This is a quite ugly hack, in the future we should maybe make
titletranslate expect a list of pagelinks.
+ hintStrings = ['%s:%s' % (hintedPage.site.language(), hintedPage.title())
for hintedPage in pagelist]
+ bot.add(page, hints = hintStrings)
+
+def main():
+ singlePageTitle = []
+ opthintsonly = False
+ start = None
+ # Which namespaces should be processed?
+ # default to [] which means all namespaces will be processed
+ namespaces = []
+ number = None
+ until = None
+ warnfile = None
+ # a normal PageGenerator (which doesn't give hints, only Pages)
+ hintlessPageGen = None
+ optContinue = False
+ optRestore = False
+ restoredFiles = []
+ File2Restore = []
+ dumpFileName = ''
+ append = True
+ newPages = None
+ # This factory is responsible for processing command line arguments
+ # that are also used by other scripts and that determine on which pages
+ # to work on.
+ genFactory = pagegenerators.GeneratorFactory()
+
+ for arg in pywikibot.handleArgs():
+ if globalvar.readOptions(arg):
+ continue
+ elif arg.startswith('-warnfile:'):
+ warnfile = arg[10:]
+ elif arg.startswith('-years'):
+ # Look if user gave a specific year at which to start
+ # Must be a natural number or negative integer.
+ if len(arg) > 7 and (arg[7:].isdigit() or (arg[7] == "-" and
arg[8:].isdigit())):
+ startyear = int(arg[7:])
+ else:
+ startyear = 1
+ # avoid problems where year pages link to centuries etc.
+ globalvar.followredirect = False
+ hintlessPageGen = pagegenerators.YearPageGenerator(startyear)
+ elif arg.startswith('-days'):
+ if len(arg) > 6 and arg[5] == ':' and arg[6:].isdigit():
+ # Looks as if the user gave a specific month at which to start
+ # Must be a natural number.
+ startMonth = int(arg[6:])
+ else:
+ startMonth = 1
+ hintlessPageGen = pagegenerators.DayPageGenerator(startMonth)
+ elif arg.startswith('-new'):
+ if len(arg) > 5 and arg[4] == ':' and arg[5:].isdigit():
+ # Looks as if the user gave a specific number of pages
+ newPages = int(arg[5:])
+ else:
+ newPages = 100
+ elif arg.startswith('-restore'):
+ globalvar.restoreAll = arg[9:].lower() == 'all'
+ optRestore = not globalvar.restoreAll
+ elif arg == '-continue':
+ optContinue = True
+ elif arg == '-hintsonly':
+ opthintsonly = True
+ elif arg.startswith('-namespace:'):
+ try:
+ namespaces.append(int(arg[11:]))
+ except ValueError:
+ namespaces.append(arg[11:])
+ # deprecated for consistency with other scripts
+ elif arg.startswith('-number:'):
+ number = int(arg[8:])
+ elif arg.startswith('-until:'):
+ until = arg[7:]
+ else:
+ if not genFactory.handleArg(arg):
+ singlePageTitle.append(arg)
+
+ # Do not use additional summary with autonomous mode
+ if globalvar.autonomous:
+ globalvar.summary = u''
+ elif globalvar.summary:
+ globalvar.summary += u'; '
+
+ # ensure that we don't try to change main page
+ try:
+ site = pywikibot.getSite()
+ try:
+ mainpagename = site.siteinfo()['mainpage']
+ except TypeError: #pywikibot module handle
+ mainpagename = site.siteinfo['mainpage']
+ globalvar.skip.add(pywikibot.Page(site, mainpagename))
+ except pywikibot.Error:
+ pywikibot.output(u'Missing main page name')
+
+ if newPages is not None:
+ if len(namespaces) == 0:
+ ns = 0
+ elif len(namespaces) == 1:
+ ns = namespaces[0]
+ if ns != 'all':
+ if isinstance(ns, unicode) or isinstance(ns, str):
+ index = site.getNamespaceIndex(ns)
+ if index is None:
+ raise ValueError(u'Unknown namespace: %s' % ns)
+ ns = index
+ namespaces = []
+ else:
+ ns = 'all'
+ hintlessPageGen = pagegenerators.NewpagesPageGenerator(newPages, namespace=ns)
+
+ elif optRestore or optContinue or globalvar.restoreAll:
+ site = pywikibot.getSite()
+ if globalvar.restoreAll:
+ import glob
+ for FileName in glob.iglob('interwiki-dumps/interwikidump-*.txt'):
+ s =
FileName.split('\\')[1].split('.')[0].split('-')
+ sitename = s[1]
+ for i in xrange(0,2):
+ s.remove(s[0])
+ sitelang = '-'.join(s)
+ if site.family.name == sitename:
+ File2Restore.append([sitename, sitelang])
+ else:
+ File2Restore.append([site.family.name, site.lang])
+ for sitename, sitelang in File2Restore:
+ dumpfn = pywikibot.config.datafilepath(
+ 'interwiki-dumps',
+ u'interwikidump-%s-%s.txt'
+ % (sitename, sitelang))
+ pywikibot.output(u'Reading interwikidump-%s-%s.txt' % (sitename,
sitelang))
+ site = pywikibot.getSite(sitelang, sitename)
+ if not hintlessPageGen:
+ hintlessPageGen = pagegenerators.TextfilePageGenerator(dumpfn, site)
+ else:
+ hintlessPageGen =
pagegenerators.CombinedPageGenerator([hintlessPageGen,pagegenerators.TextfilePageGenerator(dumpfn,
site)])
+ restoredFiles.append(dumpfn)
+ if hintlessPageGen:
+ hintlessPageGen =
pagegenerators.DuplicateFilterPageGenerator(hintlessPageGen)
+ if optContinue:
+ # We waste this generator to find out the last page's title
+ # This is an ugly workaround.
+ nextPage = "!"
+ namespace = 0
+ searchGen = pagegenerators.TextfilePageGenerator(dumpfn, site)
+ for page in searchGen:
+ lastPage = page.title(withNamespace=False)
+ if lastPage > nextPage:
+ nextPage = lastPage
+ namespace = page.namespace()
+ if nextPage == "!":
+ pywikibot.output(u"Dump file is empty?! Starting at the
beginning.")
+ else:
+ nextPage += '!'
+ hintlessPageGen = pagegenerators.CombinedPageGenerator([hintlessPageGen,
pagegenerators.AllpagesPageGenerator(nextPage, namespace, includeredirects = False)])
+ if not hintlessPageGen:
+ pywikibot.output(u'No Dumpfiles found.')
+ return
+
+ bot = InterwikiBot()
+
+ if not hintlessPageGen:
+ hintlessPageGen = genFactory.getCombinedGenerator()
+ if hintlessPageGen:
+ if len(namespaces) > 0:
+ hintlessPageGen =
pagegenerators.NamespaceFilterPageGenerator(hintlessPageGen, namespaces)
+ # we'll use iter() to create make a next() function available.
+ bot.setPageGenerator(iter(hintlessPageGen), number = number, until=until)
+ elif warnfile:
+ # TODO: filter namespaces if -namespace parameter was used
+ readWarnfile(warnfile, bot)
+ else:
+ singlePageTitle = ' '.join(singlePageTitle)
+ if not singlePageTitle and not opthintsonly:
+ singlePageTitle = pywikibot.input(u'Which page to check:')
+ if singlePageTitle:
+ singlePage = pywikibot.Page(pywikibot.getSite(), singlePageTitle)
+ else:
+ singlePage = None
+ bot.add(singlePage, hints = globalvar.hints)
+
+ try:
+ try:
+ append = not (optRestore or optContinue or globalvar.restoreAll)
+ bot.run()
+ except KeyboardInterrupt:
+ dumpFileName = bot.dump(append)
+ except:
+ dumpFileName = bot.dump(append)
+ raise
+ finally:
+ if globalvar.contentsondisk:
+ StoredPage.SPdeleteStore()
+ if dumpFileName:
+ try:
+ restoredFiles.remove(dumpFileName)
+ except ValueError:
+ pass
+ for dumpFileName in restoredFiles:
+ try:
+ os.remove(dumpFileName)
+ pywikibot.output(u'Dumpfile %s deleted' %
dumpFileName.split('\\')[-1])
+ except WindowsError:
+ pass
+
+#===========
+globalvar=Global()
+
+if __name__ == "__main__":
+ try:
+ main()
+ finally:
+ pywikibot.stopme()
Copied: archive/old python 2.3 scripts/wikipedia.py (from rev 10463,
trunk/pywikipedia/wikipedia.py)
===================================================================
--- archive/old python 2.3 scripts/wikipedia.py (rev 0)
+++ archive/old python 2.3 scripts/wikipedia.py 2012-09-16 13:48:36 UTC (rev 10528)
@@ -0,0 +1,8639 @@
+# -*- coding: utf-8 -*-
+"""
+Library to get and put pages on a MediaWiki.
+
+Contents of the library (objects and functions to be used outside)
+
+Classes:
+ Page(site, title): A page on a MediaWiki site
+ ImagePage(site, title): An image descriptor Page
+ Site(lang, fam): A MediaWiki site
+
+Factory functions:
+ Family(name): Import the named family
+ getSite(lang, fam): Return a Site instance
+
+Exceptions:
+ Error: Base class for all exceptions in this module
+ NoUsername: Username is not in user-config.py
+ NoPage: Page does not exist on the wiki
+ NoSuchSite: Site does not exist
+ IsRedirectPage: Page is a redirect page
+ IsNotRedirectPage: Page is not a redirect page
+ LockedPage: Page is locked
+ SectionError: The section specified in the Page title does not exist
+ PageNotSaved: Saving the page has failed
+ EditConflict: PageNotSaved due to edit conflict while uploading
+ SpamfilterError: PageNotSaved due to MediaWiki spam filter
+ LongPageError: PageNotSaved due to length limit
+ ServerError: Got unexpected response from wiki server
+ BadTitle: Server responded with BadTitle
+ UserBlocked: Client's username or IP has been blocked
+ PageNotFound: Page not found in list
+
+Objects:
+ get_throttle: Call to limit rate of read-access to wiki
+ put_throttle: Call to limit rate of write-access to wiki
+
+Other functions:
+ getall(): Load a group of pages
+ handleArgs(): Process all standard command line arguments (such as
+ -family, -lang, -log and others)
+ translate(xx, dict): dict is a dictionary, giving text depending on
+ language, xx is a language. Returns the text in the most applicable
+ language for the xx: wiki
+ setAction(text): Use 'text' instead of "Wikipedia python library"
in
+ edit summaries
+ setUserAgent(text): Sets the string being passed to the HTTP server as
+ the User-agent: header. Defaults to 'Pywikipediabot/1.0'.
+
+ output(text): Prints the text 'text' in the encoding of the user's
+ console. **Use this instead of "print" statements**
+ input(text): Asks input from the user, printing the text 'text' first.
+ inputChoice: Shows user a list of choices and returns user's selection.
+
+ showDiff(oldtext, newtext): Prints the differences between oldtext and
+ newtext on the screen
+
+Wikitext manipulation functions: each of these takes a unicode string
+containing wiki text as its first argument, and returns a modified version
+of the text unless otherwise noted --
+
+ replaceExcept: replace all instances of 'old' by 'new', skipping any
+ instances of 'old' within comments and other special text blocks
+ removeDisabledParts: remove text portions exempt from wiki markup
+ isDisabled(text,index): return boolean indicating whether text[index] is
+ within a non-wiki-markup section of text
+ decodeEsperantoX: decode Esperanto text using the x convention.
+ encodeEsperantoX: convert wikitext to the Esperanto x-encoding.
+ findmarker(text, startwith, append): return a string which is not part
+ of text
+ expandmarker(text, marker, separator): return marker string expanded
+ backwards to include separator occurrences plus whitespace
+
+Wikitext manipulation functions for interlanguage links:
+
+ getLanguageLinks(text,xx): extract interlanguage links from text and
+ return in a dict
+ removeLanguageLinks(text): remove all interlanguage links from text
+ removeLanguageLinksAndSeparator(text, site, marker, separator = ''):
+ remove language links, whitespace, preceeding separators from text
+ replaceLanguageLinks(oldtext, new): remove the language links and
+ replace them with links from a dict like the one returned by
+ getLanguageLinks
+ interwikiFormat(links): convert a dict of interlanguage links to text
+ (using same dict format as getLanguageLinks)
+ interwikiSort(sites, inSite): sorts a list of sites according to interwiki
+ sort preference of inSite.
+ url2link: Convert urlname of a wiki page into interwiki link format.
+
+Wikitext manipulation functions for category links:
+
+ getCategoryLinks(text): return list of Category objects corresponding
+ to links in text
+ removeCategoryLinks(text): remove all category links from text
+ replaceCategoryLinksAndSeparator(text, site, marker, separator = ''):
+ remove language links, whitespace, preceeding separators from text
+ replaceCategoryLinks(oldtext,new): replace the category links in oldtext by
+ those in a list of Category objects
+ replaceCategoryInPlace(text,oldcat,newtitle): replace a single link to
+ oldcat with a link to category given by newtitle
+ categoryFormat(links): return a string containing links to all
+ Categories in a list.
+
+Unicode utility functions:
+ UnicodeToAsciiHtml: Convert unicode to a bytestring using HTML entities.
+ url2unicode: Convert url-encoded text to unicode using a site's encoding.
+ unicode2html: Ensure unicode string is encodable; if not, convert it to
+ ASCII for HTML.
+ html2unicode: Replace HTML entities in text with unicode characters.
+
+stopme(): Put this on a bot when it is not or not communicating with the Wiki
+ any longer. It will remove the bot from the list of running processes,
+ and thus not slow down other bot threads anymore.
+
+"""
+from __future__ import generators
+#
+# (C) Pywikipedia bot team, 2003-2012
+#
+# Distributed under the terms of the MIT license.
+#
+__version__ = '$Id$'
+
+import os, sys
+import httplib, socket, urllib, urllib2, cookielib
+import traceback
+import time, threading, Queue
+import math
+import re, codecs, difflib, locale
+try:
+ from hashlib import md5
+except ImportError: # Python 2.4 compatibility
+ from md5 import new as md5
+import xml.sax, xml.sax.handler
+import htmlentitydefs
+import warnings
+import unicodedata
+import xmlreader
+from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup, SoupStrainer
+import weakref
+# Splitting the bot into library parts
+from pywikibot import *
+
+# Set the locale to system default. This will ensure correct string
+# handling for non-latin characters on Python 2.3.x. For Python 2.4.x it's no
+# longer needed.
+locale.setlocale(locale.LC_ALL, '')
+
+import config, login, query, version
+
+try:
+ set # introduced in Python2.4: faster and future
+except NameError:
+ from sets import Set as set
+
+# Check Unicode support (is this a wide or narrow python build?)
+# See
http://www.python.org/doc/peps/pep-0261/
+try:
+ unichr(66365) # a character in th: alphabet, uses 32 bit encoding
+ WIDEBUILD = True
+except ValueError:
+ WIDEBUILD = False
+
+
+SaxError = xml.sax._exceptions.SAXParseException
+
+# Pre-compile re expressions
+reNamespace = re.compile("^(.+?) *: *(.*)$")
+Rwatch = re.compile(
+ r"<input type='hidden' value=\"(.*?)\"
name=\"wpEditToken\"")
+Rwatchlist = re.compile(r"<input tabindex='[\d]+' type='checkbox'
"
+ r"name='wpWatchthis'
checked='checked'")
+Rlink = re.compile(r'\[\[(?P<title>[^\]\|\[]*)(\|[^\]]*)?\]\]')
+
+
+# Page objects (defined here) represent the page itself, including its contents.
+class Page(object):
+ """Page: A MediaWiki page
+
+ Constructor has two required parameters:
+ 1) The wiki Site on which the page resides [note that, if the
+ title is in the form of an interwiki link, the Page object may
+ have a different Site than this]
+ 2) The title of the page as a unicode string
+
+ Optional parameters:
+ insite - the wiki Site where this link was found (to help decode
+ interwiki links)
+ defaultNamespace - A namespace to use if the link does not contain one
+
+ Methods available:
+
+ title : The name of the page, including namespace and
+ section if any
+ urlname : Title, in a form suitable for a URL
+ namespace : The namespace in which the page is found
+ section : The section of the page (the part of the title
+ after '#', if any)
+ sectionFreeTitle : Title, without the section part
+ site : The wiki this page is in
+ encoding : The encoding of the page
+ isAutoTitle : Title can be translated using the autoFormat method
+ autoFormat : Auto-format certain dates and other standard
+ format page titles
+ isCategory : True if the page is a category
+ isDisambig (*) : True if the page is a disambiguation page
+ isImage : True if the page is an image
+ isRedirectPage (*) : True if the page is a redirect, false otherwise
+ getRedirectTarget (*) : The page the page redirects to
+ isTalkPage : True if the page is in any "talk" namespace
+ toggleTalkPage : Return the talk page (if this is one, return the
+ non-talk page)
+ get (*) : The text of the page
+ getSections (*) : Retrieve page section heading and assign them to
+ the byte offset
+ latestRevision (*) : The page's current revision id
+ userName : Last user to edit page
+ userNameHuman : Last human (non-bot) user to edit page
+ isIpEdit : True if last editor was unregistered
+ editTime : Timestamp of the last revision to the page
+ previousRevision (*) : The revision id of the previous version
+ permalink (*) : The url of the permalink of the current version
+ getOldVersion(id) (*) : The text of a previous version of the page
+ getRestrictions : Returns a protection dictionary
+ getVersionHistory : Load the version history information from wiki
+ getVersionHistoryTable: Create a wiki table from the history data
+ fullVersionHistory : Return all past versions including wikitext
+ contributingUsers : Return set of users who have edited page
+ getCreator : Function to get the first editor of a page
+ getLatestEditors : Function to get the last editors of a page
+ exists (*) : True if the page actually exists, false otherwise
+ isEmpty (*) : True if the page has 4 characters or less content,
+ not counting interwiki and category links
+ interwiki (*) : The interwiki links from the page (list of Pages)
+ categories (*) : The categories the page is in (list of Pages)
+ linkedPages (*) : The normal pages linked from the page (list of
+ Pages)
+ imagelinks (*) : The pictures on the page (list of ImagePages)
+ templates (*) : All templates referenced on the page (list of
+ Pages)
+ templatesWithParams(*): All templates on the page, with list of parameters
+ getReferences : List of pages linking to the page
+ canBeEdited (*) : True if page is unprotected or user has edit
+ privileges
+ protection(*) : This page protection level
+ botMayEdit (*) : True if bot is allowed to edit page
+ put(newtext) : Saves the page
+ put_async(newtext) : Queues the page to be saved asynchronously
+ append(newtext) : Append to page section
+ watch : Add the page to the watchlist
+ unwatch : Remove the page from the watchlist
+ move : Move the page to another title
+ delete : Deletes the page (requires being logged in)
+ protect : Protect or unprotect a page (requires sysop status)
+ removeImage : Remove all instances of an image from this page
+ replaceImage : Replace all instances of an image with another
+ loadDeletedRevisions : Load all deleted versions of this page
+ getDeletedRevision : Return a particular deleted revision
+ markDeletedRevision : Mark a version to be undeleted, or not
+ undelete : Undelete past version(s) of the page
+ purgeCache : Purge page from server cache
+
+ (*) : This loads the page if it has not been loaded before; permalink might
+ even reload it if it has been loaded before
+
+ """
+ def __init__(self, site, title, insite=None, defaultNamespace=0):
+ """Instantiate a Page object.
+
+ """
+ try:
+ # if _editrestriction is True, it means that the page has been found
+ # to have an edit restriction, but we do not know yet whether the
+ # restriction affects us or not
+ self._editrestriction = False
+
+ if site is None or isinstance(site, basestring):
+ site = getSite(site)
+ self._site = site
+
+ if not insite:
+ insite = site
+
+ # Clean up the name, it can come from anywhere.
+ # Convert HTML entities to unicode
+ t = html2unicode(title)
+
+ # Convert URL-encoded characters to unicode
+ # Sometimes users copy the link to a site from one to another.
+ # Try both the source site and the destination site to decode.
+ try:
+ t = url2unicode(t, site=insite, site2=site)
+ except UnicodeDecodeError:
+ raise InvalidTitle(u'Bad page title : %s' % t)
+
+ # Normalize unicode string to a NFC (composed) format to allow
+ # proper string comparisons. According to
+ #
http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_6/phase3/includes/n…
+ # the mediawiki code normalizes everything to NFC, not NFKC
+ # (which might result in information loss).
+ t = unicodedata.normalize('NFC', t)
+
+ if u'\ufffd' in t:
+ raise InvalidTitle("Title contains illegal char (\\uFFFD)")
+
+ # Replace underscores by spaces
+ t = t.replace(u"_", u" ")
+ # replace multiple spaces a single space
+ while u" " in t: t = t.replace(u" ", u" ")
+ # Strip spaces at both ends
+ t = t.strip()
+ # Remove left-to-right and right-to-left markers.
+ t = t.replace(u'\u200e', '').replace(u'\u200f',
'')
+ # leading colon implies main namespace instead of the default
+ if t.startswith(':'):
+ t = t[1:]
+ self._namespace = 0
+ else:
+ self._namespace = defaultNamespace
+
+ if not t:
+ raise InvalidTitle(u"Invalid title '%s'" % title )
+
+ self._namespace = defaultNamespace
+ #
+ # This code was adapted from Title.php : secureAndSplit()
+ #
+ # Namespace or interwiki prefix
+ while True:
+ m = reNamespace.match(t)
+ if not m:
+ break
+ p = m.group(1)
+ lowerNs = p.lower()
+ ns = self._site.getNamespaceIndex(lowerNs)
+ if ns:
+ t = m.group(2)
+ self._namespace = ns
+ break
+
+ if lowerNs in self._site.family.langs.keys():
+ # Interwiki link
+ t = m.group(2)
+
+ # Redundant interwiki prefix to the local wiki
+ if lowerNs == self._site.lang:
+ if t == '':
+ raise Error("Can't have an empty self-link")
+ else:
+ self._site = getSite(lowerNs, self._site.family.name)
+ if t == '':
+ t = self._site.mediawiki_message('Mainpage')
+
+ # If there's an initial colon after the interwiki, that also
+ # resets the default namespace
+ if t != '' and t[0] == ':':
+ self._namespace = 0
+ t = t[1:]
+ elif lowerNs in self._site.family.get_known_families(site = self._site):
+ if self._site.family.get_known_families(site = self._site)[lowerNs]
== self._site.family.name:
+ t = m.group(2)
+ else:
+ # This page is from a different family
+ if verbose:
+ output(u"Target link '%s' has different family
'%s'" % (title, lowerNs))
+ if self._site.family.name in ['commons',
'meta']:
+ #When the source wiki is commons or meta,
+ #w:page redirects you to w:en:page
+ otherlang = 'en'
+ else:
+ otherlang = self._site.lang
+ familyName = self._site.family.get_known_families(site =
self._site)[lowerNs]
+ if familyName in ['commons', 'meta']:
+ otherlang = familyName
+ try:
+ self._site = getSite(otherlang, familyName)
+ except ValueError:
+ raise NoPage("""\
+%s is not a local page on %s, and the %s family is
+not supported by PyWikipediaBot!"""
+ % (title, self._site, familyName))
+ t = m.group(2)
+ else:
+ # If there's no recognized interwiki or namespace,
+ # then let the colon expression be part of the title.
+ break
+
+ sectionStart = t.find(u'#')
+ # But maybe there are magic words like {{#time|}}
+ # TODO: recognize magic word and templates inside links
+ # see
http://la.wikipedia.org/w/index.php?title=997_Priska&diff=prev&oldi…
+ if sectionStart > 0:
+ # Categories does not have sections.
+ if self._namespace == 14:
+ raise InvalidTitle(u"Invalid section in category
'%s'" % t)
+ else:
+ t, sec = t.split(u'#', 1)
+ self._section = sec.lstrip() or None
+ t = t.rstrip()
+ elif sectionStart == 0:
+ raise InvalidTitle(u"Invalid title starting with a #:
'%s'" % t)
+ else:
+ self._section = None
+
+ if t:
+ if not self._site.nocapitalize:
+ t = t[:1].upper() + t[1:]
+
+ # reassemble the title from its parts
+ if self._namespace != 0:
+ t = u'%s:%s' % (self._site.namespace(self._namespace), t)
+ if self._section:
+ t += u'#' + self._section
+
+ self._title = t
+ self.editRestriction = None
+ self.moveRestriction = None
+ self._permalink = None
+ self._userName = None
+ self._ipedit = None
+ self._editTime = None
+ self._startTime = '0'
+ # For the Flagged Revisions MediaWiki extension
+ self._revisionId = None
+ self._deletedRevs = None
+ except NoSuchSite:
+ raise
+ except:
+ if verbose:
+ output(u"Exception in Page constructor")
+ output(
+ u"site=%s, title=%s, insite=%s, defaultNamespace=%i"
+ % (site, title, insite, defaultNamespace)
+ )
+ raise
+
+ @property
+ def site(self):
+ """Return the Site object for the wiki on which this Page
resides."""
+ return self._site
+
+ def namespace(self):
+ """Return the number of the namespace of the page.
+
+ Only recognizes those namespaces defined in family.py.
+ If not defined, it will return 0 (the main namespace).
+
+ """
+ return self._namespace
+
+ def encoding(self):
+ """Return the character encoding used on this Page's wiki
Site."""
+ return self._site.encoding()
+
+ @deprecate_arg("decode", None)
+ def title(self, underscore=False, savetitle=False, withNamespace=True,
+ withSection=True, asUrl=False, asLink=False,
+ allowInterwiki=True, forceInterwiki=False, textlink=False,
+ as_filename=False):
+ """Return the title of this Page, as a Unicode string.
+
+ @param underscore: if true, replace all ' ' characters with '_'
+ @param withNamespace: if false, omit the namespace prefix
+ @param withSection: if false, omit the section
+ @param asUrl: - not implemented yet -
+ @param asLink: if true, return the title in the form of a wikilink
+ @param allowInterwiki: (only used if asLink is true) if true, format
+ the link as an interwiki link if necessary
+ @param forceInterwiki: (only used if asLink is true) if true, always
+ format the link as an interwiki link
+ @param textlink: (only used if asLink is true) if true, place a ':'
+ before Category: and Image: links
+ @param as_filename: - not implemented yet -
+ @param savetitle: if True, encode any wiki syntax in the title.
+
+ """
+ title = self._title
+ if not withNamespace and self.namespace() != 0:
+ title = title.split(':', 1)[1]
+ if asLink:
+ iw_target_site = getSite()
+ iw_target_family = getSite().family
+ if iw_target_family.interwiki_forward:
+ iw_target_family = pywikibot.Family(iw_target_family.interwiki_forward)
+
+ if allowInterwiki and (forceInterwiki or self._site != iw_target_site):
+ colon = ""
+ if textlink:
+ colon = ":"
+ if self._site.family != iw_target_family \
+ and self._site.family.name != self._site.lang:
+ title = u'[[%s%s:%s:%s]]' % (colon, self._site.family.name,
+ self._site.lang, title)
+ else:
+ title = u'[[%s%s:%s]]' % (colon, self._site.lang, title)
+ elif textlink and (self.isImage() or self.isCategory()):
+ title = u'[[:%s]]' % title
+ else:
+ title = u'[[%s]]' % title
+ if savetitle or asLink:
+ # Ensure there's no wiki syntax in the title
+ title = title.replace(u"''", u'%27%27')
+ if underscore:
+ title = title.replace(' ', '_')
+ if not withSection:
+ sectionName = self.section(underscore=underscore)
+ if sectionName:
+ title = title[:-len(sectionName)-1]
+ return title
+
+ #(a)deprecated("Page.title(withNamespace=False)")
+ def titleWithoutNamespace(self, underscore=False):
+ """Return title of Page without namespace and without
section."""
+ return self.title(underscore=underscore, withNamespace=False,
+ withSection=False)
+
+ def titleForFilename(self):
+ """
+ Return the title of the page in a form suitable for a filename on
+ the user's file system.
+ """
+ result = self.title()
+ # Replace characters that are not possible in file names on some
+ # systems.
+ # Spaces are possible on most systems, but are bad for URLs.
+ for forbiddenChar in ':*?/\\ ':
+ result = result.replace(forbiddenChar, '_')
+ return result
+
+ @deprecate_arg("decode", None)
+ def section(self, underscore = False):
+ """Return the name of the section this Page refers to.
+
+ The section is the part of the title following a '#' character, if
+ any. If no section is present, return None.
+
+ """
+ section = self._section
+ if section and underscore:
+ section = section.replace(' ', '_')
+ return section
+
+ def sectionFreeTitle(self, underscore=False):
+ """Return the title of this Page, without the section (if
any)."""
+ sectionName = self.section(underscore=underscore)
+ title = self.title(underscore=underscore)
+ if sectionName:
+ return title[:-len(sectionName)-1]
+ else:
+ return title
+
+ def urlname(self, withNamespace=True):
+ """Return the Page title encoded for use in an
URL."""
+ title = self.title(withNamespace=withNamespace, underscore=True)
+ encodedTitle = title.encode(self.site().encoding())
+ return urllib.quote(encodedTitle)
+
+ def __str__(self):
+ """Return a console representation of the
pagelink."""
+ return self.title(asLink=True, forceInterwiki=True
+ ).encode(config.console_encoding,
+ "xmlcharrefreplace")
+
+ def __unicode__(self):
+ return self.title(asLink=True, forceInterwiki=True)
+
+ def __repr__(self):
+ """Return a more complete string
representation."""
+ return "%s{%s}" % (self.__class__.__name__,
+ self.title(asLink=True).encode(config.console_encoding))
+
+ def __cmp__(self, other):
+ """Test for equality and inequality of Page objects.
+
+ Page objects are "equal" if and only if they are on the same site
+ and have the same normalized title, including section if any.
+
+ Page objects are sortable by namespace first, then by title.
+
+ """
+ if not isinstance(other, Page):
+ # especially, return -1 if other is None
+ return -1
+ if self._site == other._site:
+ return cmp(self._title, other._title)
+ else:
+ return cmp(self._site, other._site)
+
+ def __hash__(self):
+ # Pseudo method that makes it possible to store Page objects as keys
+ # in hash-tables. This relies on the fact that the string
+ # representation of an instance can not change after the construction.
+ return hash(unicode(self))
+
+ @deprecated("Page.title(asLink=True)")
+ def aslink(self, forceInterwiki=False, textlink=False, noInterwiki=False):
+ """Return a string representation in the form of a wikilink.
+
+ If forceInterwiki is True, return an interwiki link even if it
+ points to the home wiki. If False, return an interwiki link only if
+ needed.
+
+ If textlink is True, always return a link in text form (that is,
+ interwiki links and internal links to the Category: and Image:
+ namespaces will be preceded by a : character).
+
+ DEPRECATED to merge to rewrite branch:
+ use self.title(asLink=True) instead.
+ """
+ return self.title(asLink=True, forceInterwiki=forceInterwiki,
+ allowInterwiki=not noInterwiki, textlink=textlink)
+
+ def autoFormat(self):
+ """Return (dictName, value) if title is in date.autoFormat
dictionary.
+
+ Value can be a year, date, etc., and dictName is 'YearBC',
+ 'Year_December', or another dictionary name. Please note that two
+ entries may have exactly the same autoFormat, but be in two
+ different namespaces, as some sites have categories with the
+ same names. Regular titles return (None, None).
+
+ """
+ if not hasattr(self, '_autoFormat'):
+ import date
+ self._autoFormat = date.getAutoFormat(self.site().language(),
+ self.title(withNamespace=False))
+ return self._autoFormat
+
+ def isAutoTitle(self):
+ """Return True if title of this Page is in the autoFormat
dictionary."""
+ return self.autoFormat()[0] is not None
+
+ def get(self, force=False, get_redirect=False, throttle=True,
+ sysop=False, change_edit_time=True, expandtemplates=False):
+ """Return the wiki-text of the page.
+
+ This will retrieve the page from the server if it has not been
+ retrieved yet, or if force is True. This can raise the following
+ exceptions that should be caught by the calling code:
+
+ @exception NoPage The page does not exist
+ @exception IsRedirectPage The page is a redirect. The argument of the
+ exception is the title of the page it
+ redirects to.
+ @exception SectionError The section does not exist on a page with
+ a # link
+
+ @param force reload all page attributes, including errors.
+ @param get_redirect return the redirect text, do not follow the
+ redirect, do not raise an exception.
+ @param sysop if the user has a sysop account, use it to
+ retrieve this page
+ @param change_edit_time if False, do not check this version for
+ changes before saving. This should be used only
+ if the page has been loaded previously.
+ @param expandtemplates all templates in the page content are fully
+ resolved too (if API is used).
+
+ """
+ # NOTE: The following few NoPage exceptions could already be thrown at
+ # the Page() constructor. They are raised here instead for convenience,
+ # because all scripts are prepared for NoPage exceptions raised by
+ # get(), but not for such raised by the constructor.
+ # \ufffd represents a badly encoded character, the other characters are
+ # disallowed by MediaWiki.
+ for illegalChar in u'#<>[]|{}\n\ufffd':
+ if illegalChar in self.sectionFreeTitle():
+ if verbose:
+ output(u'Illegal character in %s!'
+ % self.title(asLink=True))
+ raise NoPage('Illegal character in %s!'
+ % self.title(asLink=True))
+ if self.namespace() == -1:
+ raise NoPage('%s is in the Special namespace!'
+ % self.title(asLink=True))
+ if self.site().isInterwikiLink(self.title()):
+ raise NoPage('%s is not a local page on %s!'
+ % (self.title(asLink=True), self.site()))
+ if force:
+ # When forcing, we retry the page no matter what:
+ # * Old exceptions and contents do not apply any more
+ # * Deleting _contents and _expandcontents to force reload
+ for attr in ['_redirarg', '_getexception',
+ '_contents', '_expandcontents',
+ '_sections']:
+ if hasattr(self, attr):
+ delattr(self, attr)
+ else:
+ # Make sure we re-raise an exception we got on an earlier attempt
+ if hasattr(self, '_redirarg') and not get_redirect:
+ raise IsRedirectPage, self._redirarg
+ elif hasattr(self, '_getexception'):
+ if self._getexception == IsRedirectPage and get_redirect:
+ pass
+ else:
+ raise self._getexception
+ # Make sure we did try to get the contents once
+ if expandtemplates:
+ attr = '_expandcontents'
+ else:
+ attr = '_contents'
+ if not hasattr(self, attr):
+ try:
+ contents = self._getEditPage(get_redirect=get_redirect,
throttle=throttle, sysop=sysop,
+ expandtemplates = expandtemplates)
+ if expandtemplates:
+ self._expandcontents = contents
+ else:
+ self._contents = contents
+ hn = self.section()
+ if hn:
+ m = re.search("=+[ ']*%s[ ']*=+" % re.escape(hn),
+ self._contents)
+ if verbose and not m:
+ output(u"WARNING: Section does not exist: %s" % self)
+ # Store any exceptions for later reference
+ except NoPage:
+ self._getexception = NoPage
+ raise
+ except IsRedirectPage, arg:
+ self._getexception = IsRedirectPage
+ self._redirarg = arg
+ if not get_redirect:
+ raise
+ except SectionError:
+ self._getexception = SectionError
+ raise
+ except UserBlocked:
+ if self.site().loggedInAs(sysop=sysop):
+ raise UserBlocked(self.site(), unicode(self))
+ else:
+ if verbose:
+ output("The IP address is blocked, retry by login.")
+ self.site().forceLogin(sysop=sysop)
+ return self.get(force, get_redirect, throttle, sysop,
change_edit_time)
+ if expandtemplates:
+ return self._expandcontents
+ return self._contents
+
+ def _getEditPage(self, get_redirect=False, throttle=True, sysop=False,
+ oldid=None, change_edit_time=True, expandtemplates=False):
+ """Get the contents of the Page via API query
+
+ Do not use this directly, use get() instead.
+
+ Arguments:
+ oldid - Retrieve an old revision (by id), not the current one
+ get_redirect - Get the contents, even if it is a redirect page
+ expandtemplates - Fully resolve templates within page content
+ (if API is used)
+
+ This method returns the raw wiki text as a unicode string.
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 12:
+ return self._getEditPageOld(get_redirect, throttle, sysop, oldid,
change_edit_time)
+ params = {
+ 'action': 'query',
+ 'titles': self.title(),
+ 'prop': ['revisions', 'info'],
+ 'rvprop': ['content', 'ids', 'flags',
'timestamp', 'user', 'comment', 'size'],
+ 'rvlimit': 1,
+ #'talkid' valid for release > 1.12
+ #'url', 'readable' valid for release > 1.14
+ 'inprop': ['protection', 'subjectid'],
+ #'intoken': 'edit',
+ }
+ if oldid:
+ params['rvstartid'] = oldid
+ if expandtemplates:
+ params[u'rvexpandtemplates'] = u''
+
+ if throttle:
+ get_throttle()
+ textareaFound = False
+ # retrying loop is done by query.GetData
+ data = query.GetData(params, self.site(), sysop=sysop)
+ if 'error' in data:
+ raise RuntimeError("API query error: %s" % data)
+ if not 'pages' in data['query']:
+ raise RuntimeError("API query error, no pages found: %s" % data)
+ pageInfo = data['query']['pages'].values()[0]
+ if data['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo:
+ raise NoPage(self.site(), unicode(self),
+"Page does not exist. In rare cases, if you are certain the page does exist, look
into overriding family.RversionTab")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+ elif 'revisions' in pageInfo: #valid Title
+ lastRev = pageInfo['revisions'][0]
+ if isinstance(lastRev['*'], basestring):
+ textareaFound = True
+ # I got page date with 'revisions' in pageInfo but
+ # lastRev['*'] = False instead of the content. The Page itself was
+ # deleted but there was not 'missing' in pageInfo as expected
+ # I raise a ServerError() yet, but maybe it should be NoPage().
+ if not textareaFound:
+ if verbose:
+ print pageInfo
+ raise ServerError('ServerError: No textarea found in %s' % self)
+
+ self.editRestriction = ''
+ self.moveRestriction = ''
+
+ # Note: user may be hidden and mw returns 'userhidden' flag
+ if 'userhidden' in lastRev:
+ self._userName = None
+ else:
+ self._userName = lastRev['user']
+ self._ipedit = 'anon' in lastRev
+ for restr in pageInfo['protection']:
+ if restr['type'] == 'edit':
+ self.editRestriction = restr['level']
+ elif restr['type'] == 'move':
+ self.moveRestriction = restr['level']
+
+ self._revisionId = lastRev['revid']
+
+ if change_edit_time:
+ self._editTime = parsetime2stamp(lastRev['timestamp'])
+ if "starttimestamp" in pageInfo:
+ self._startTime = parsetime2stamp(pageInfo["starttimestamp"])
+
+ self._isWatched = False #cannot handle in API in my research for now.
+
+ pagetext = lastRev['*']
+ pagetext = pagetext.rstrip()
+ # pagetext must not decodeEsperantoX() if loaded via API
+ m = self.site().redirectRegex().match(pagetext)
+ if m:
+ # page text matches the redirect pattern
+ if self.section() and not "#" in m.group(1):
+ redirtarget = "%s#%s" % (m.group(1), self.section())
+ else:
+ redirtarget = m.group(1)
+ if get_redirect:
+ self._redirarg = redirtarget
+ else:
+ raise IsRedirectPage(redirtarget)
+
+ if self.section() and \
+ not textlib.does_text_contain_section(pagetext, self.section()):
+ try:
+ self._getexception
+ except AttributeError:
+ raise SectionError # Page has no section by this name
+ return pagetext
+
+ def _getEditPageOld(self, get_redirect=False, throttle=True, sysop=False,
+ oldid=None, change_edit_time=True):
+ """Get the contents of the Page via the edit
page."""
+
+ if verbose:
+ output(u'Getting page %s' % self.title(asLink=True))
+ path = self.site().edit_address(self.urlname())
+ if oldid:
+ path += "&oldid="+oldid
+ # Make sure Brion doesn't get angry by waiting if the last time a page
+ # was retrieved was not long enough ago.
+ if throttle:
+ get_throttle()
+ textareaFound = False
+ retry_idle_time = 1
+ while not textareaFound:
+ text = self.site().getUrl(path, sysop = sysop)
+
+ if "<title>Wiki does not exist</title>" in text:
+ raise NoSuchSite(u'Wiki %s does not exist yet' % self.site())
+
+ # Extract the actual text from the textarea
+ m1 = re.search('<textarea([^>]*)>', text)
+ m2 = re.search('</textarea>', text)
+ if m1 and m2:
+ i1 = m1.end()
+ i2 = m2.start()
+ textareaFound = True
+ else:
+ # search for messages with no "view source" (aren't used in
new versions)
+ if self.site().mediawiki_message('whitelistedittitle') in text:
+ raise NoPage(u'Page editing is forbidden for anonymous
users.')
+ elif self.site().has_mediawiki_message('nocreatetitle') and
self.site().mediawiki_message('nocreatetitle') in text:
+ raise NoPage(self.site(), unicode(self))
+ # Bad title
+ elif 'var wgPageName = "Special:Badtitle";' in text \
+ or self.site().mediawiki_message('badtitle') in text:
+ raise BadTitle('BadTitle: %s' % self)
+ # find out if the username or IP has been blocked
+ elif self.site().isBlocked():
+ raise UserBlocked(self.site(), unicode(self))
+ # If there is no text area and the heading is 'View Source'
+ # but user is not blocked, the page does not exist, and is
+ # locked
+ elif self.site().mediawiki_message('viewsource') in text:
+ raise NoPage(self.site(), unicode(self))
+ # Some of the newest versions don't have a "view source"
tag for
+ # non-existant pages
+ # Check also the div class because if the language is not english
+ # the bot can not seeing that the page is blocked.
+ elif self.site().mediawiki_message('badaccess') in text or \
+ "<div class=\"permissions-errors\">" in text:
+ raise NoPage(self.site(), unicode(self))
+ elif config.retry_on_fail:
+ if "<title>Wikimedia Error</title>" in text:
+ output( u"Wikimedia has technical problems; will retry in %i
minutes." % retry_idle_time)
+ else:
+ output( unicode(text) )
+ # We assume that the server is down. Wait some time, then try
again.
+ output( u"WARNING: No text area found on %s%s. Maybe the
server is down. Retrying in %i minutes..." % (self.site().hostname(), path,
retry_idle_time) )
+ time.sleep(retry_idle_time * 60)
+ # Next time wait longer, but not longer than half an hour
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ else:
+ output( u"Failed to access wiki")
+ sys.exit(1)
+ # Check for restrictions
+ m = re.search('var wgRestrictionEdit = \\["(\w+)"\\]', text)
+ if m:
+ if verbose:
+ output(u"DBG> page is locked for group %s" % m.group(1))
+ self.editRestriction = m.group(1);
+ else:
+ self.editRestriction = ''
+ m = re.search('var wgRestrictionMove = \\["(\w+)"\\]', text)
+ if m:
+ self.moveRestriction = m.group(1);
+ else:
+ self.moveRestriction = ''
+ m = re.search('name=["\']baseRevId["\']
type=["\']hidden["\'] value="(\d+)"', text)
+ if m:
+ self._revisionId = m.group(1)
+ if change_edit_time:
+ # Get timestamps
+ m = re.search('value="(\d+)"
name=["\']wpEdittime["\']', text)
+ if m:
+ self._editTime = m.group(1)
+ else:
+ self._editTime = "0"
+ m = re.search('value="(\d+)"
name=["\']wpStarttime["\']', text)
+ if m:
+ self._startTime = m.group(1)
+ else:
+ self._startTime = "0"
+ # Find out if page actually exists. Only existing pages have a
+ # version history tab.
+ if self.site().family.RversionTab(self.site().language()):
+ # In case a family does not have version history tabs, or in
+ # another form
+ RversionTab =
re.compile(self.site().family.RversionTab(self.site().language()))
+ else:
+ RversionTab = re.compile(r'<li id="ca-history"><a
href=".*?title=.*?&action=history".*?>.*?</a></li>',
re.DOTALL)
+ matchVersionTab = RversionTab.search(text)
+ if not matchVersionTab and not self.site().family.name == 'wikitravel':
+ raise NoPage(self.site(), unicode(self),
+"Page does not exist. In rare cases, if you are certain the page does exist, look
into overriding family.RversionTab" )
+ # Look if the page is on our watchlist
+ matchWatching = Rwatchlist.search(text)
+ if matchWatching:
+ self._isWatched = True
+ else:
+ self._isWatched = False
+ # Now process the contents of the textarea
+ # Unescape HTML characters, strip whitespace
+ pagetext = text[i1:i2]
+ pagetext = unescape(pagetext)
+ pagetext = pagetext.rstrip()
+ if self.site().lang == 'eo':
+ pagetext = decodeEsperantoX(pagetext)
+ m = self.site().redirectRegex().match(pagetext)
+ if m:
+ # page text matches the redirect pattern
+ if self.section() and not "#" in m.group(1):
+ redirtarget = "%s#%s" % (m.group(1), self.section())
+ else:
+ redirtarget = m.group(1)
+ if get_redirect:
+ self._redirarg = redirtarget
+ else:
+ raise IsRedirectPage(redirtarget)
+
+ if self.section() and \
+ not textlib.does_text_contain_section(text, self.section()):
+ try:
+ self._getexception
+ except AttributeError:
+ raise SectionError # Page has no section by this name
+
+ return pagetext
+
+ def getOldVersion(self, oldid, force=False, get_redirect=False,
+ throttle=True, sysop=False, change_edit_time=True):
+ """Return text of an old revision of this page; same options as
get().
+
+ @param oldid: The revid of the revision desired.
+
+ """
+ # TODO: should probably check for bad pagename, NoPage, and other
+ # exceptions that would prevent retrieving text, as get() does
+
+ # TODO: should this default to change_edit_time = False? If we're not
+ # getting the current version, why change the timestamps?
+ return self._getEditPage(
+ get_redirect=get_redirect, throttle=throttle,
+ sysop=sysop, oldid=oldid,
+ change_edit_time=change_edit_time
+ )
+
+ ## @since r10309
+ # @remarks needed by various bots
+ def getSections(self, minLevel=2, sectionsonly=False, force=False):
+ """Parses the page with API and return section information.
+
+ @param minLevel: The minimal level of heading for section to be reported.
+ @type minLevel: int
+ @param sectionsonly: Report only the result from API call, do not assign
+ the headings to wiki text (for compression e.g.).
+ @type sectionsonly: bool
+ @param force: Use API for full section list resolution, works always but
+ is extremely slow, since each single section has to be
retrieved.
+ @type force: bool
+
+ Returns a list with entries: (byteoffset, level, wikiline, line, anchor)
+ This list may be empty and if sections are embedded by template, the
according
+ byteoffset and wikiline entries are None. The wikiline is the wiki text,
+ line is the parsed text and anchor ist the (unique) link label.
+ """
+ # replace 'byteoffset' ALWAYS by self calculated, since parsed does not
match wiki text
+ # bug fix; JIRA: DRTRIGON-82
+
+ # was there already a call? already some info available?
+ if hasattr(self, '_sections'):
+ return self._sections
+
+ # Old exceptions and contents do not apply any more.
+ for attr in ['_sections']:
+ if hasattr(self, attr):
+ delattr(self,attr)
+
+ # call the wiki to get info
+ params = {
+ u'action' : u'parse',
+ u'page' : self.title(),
+ u'prop' : u'sections',
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Reading section info from %s via API..." %
self.title(asLink=True))
+
+ result = query.GetData(params, self.site())
+ # JIRA: DRTRIGON-90; catch and convert error (convert it such that the whole page
gets processed later)
+ try:
+ r = result[u'parse'][u'sections']
+ except KeyError: # sequence of sometimes occuring "KeyError:
u'parse'"
+ pywikibot.output(u'WARNING: Query result (gS): %r' % result)
+ raise pywikibot.Error('Problem occured during data retrieval for sections
in %s!' % self.title(asLink=True))
+ #debug_data = str(r) + '\n'
+ debug_data = str(result) + '\n'
+
+ if not sectionsonly:
+ # assign sections with wiki text and section byteoffset
+ #pywikibot.output(u" Reading wiki page text (if not already
done).")
+
+ debug_data += str(len(self.__dict__.get('_contents',u''))) +
'\n'
+ self.get()
+ debug_data += str(len(self._contents)) + '\n'
+ debug_data += self._contents + '\n'
+
+ # code debugging
+ if verbose:
+ debugDump( 'Page.getSections', self.site,
'Page.getSections', debug_data.encode(config.textfile_encoding) )
+
+ for setting in [(0.05,0.95), (0.4,0.8), (0.05,0.8), (0.0,0.8)]: # 0.6 is
default upper border
+ try:
+ pos = 0
+ for i, item in enumerate(r):
+ item[u'level'] = int(item[u'level'])
+ # byteoffset may be 0; 'None' means template
+ #if (item[u'byteoffset'] != None) and
item[u'line']:
+ # (empty index means also template - workaround for bug:
+ #
https://bugzilla.wikimedia.org/show_bug.cgi?id=32753)
+ if (item[u'byteoffset'] != None) and
item[u'line'] and item[u'index']:
+ # section on this page and index in format u"%i"
+ self._getSectionByteOffset(item, pos, force, cutoff=setting)
# raises 'Error' if not sucessfull !
+ pos = item[u'wikiline_bo'] +
len(item[u'wikiline'])
+ item[u'byteoffset'] = item[u'wikiline_bo']
+ else:
+ # section embedded from template (index in format
u"T-%i") or the
+ # parser was not able to recongnize section correct (e.g.
html) at all
+ # (the byteoffset, index, ... may be correct or not)
+ item[u'wikiline'] = None
+ r[i] = item
+ break
+ except pywikibot.Error:
+ pos = None
+ if (pos == None):
+ raise # re-raise
+
+ # check min. level
+ data = []
+ for item in r:
+ if (item[u'level'] < minLevel): continue
+ data.append( item )
+ r = data
+
+ # prepare resulting data
+ self._sections = [ (item[u'byteoffset'], item[u'level'],
item[u'wikiline'], item[u'line'], item[u'anchor']) for item in r
]
+
+ return self._sections
+
+ ## @since r10309
+ # @remarks needed by Page.getSections()
+ def _getSectionByteOffset(self, section, pos, force=False, cutoff=(0.05, 0.95)):
+ """determine the byteoffset of the given section (can be slow due
another API call).
+ """
+ wikitextlines = self._contents[pos:].splitlines()
+ possible_headers = []
+ #print section[u'line']
+
+ if not force:
+ # how the heading should look like (re)
+ l = section[u'level']
+ headers = [
u'^(\s*)%(spacer)s(.*?)%(spacer)s(\s*)((<!--(.*?)-->)?)(\s*)$' %
{'line': section[u'line'], 'spacer': u'=' * l},
+ u'^(\s*)<h%(level)i>(.*?)</h%(level)i>(.*?)$' %
{'line': section[u'line'], 'level': l}, ]
+
+ # try to give exact match for heading (remove HTML comments)
+ for h in headers:
+ #ph = re.search(h, pywikibot.removeDisabledParts(self._contents[pos:]),
re.M)
+ ph = re.search(h, self._contents[pos:], re.M)
+ if ph:
+ ph = ph.group(0).strip()
+ possible_headers += [ (ph, section[u'line']) ]
+
+ # how the heading could look like (difflib)
+ headers = [ u'%(spacer)s %(line)s %(spacer)s' % {'line':
section[u'line'], 'spacer': u'=' * l},
+ u'<h%(level)i>%(line)s</h%(level)i>' %
{'line': section[u'line'], 'level': l}, ]
+
+ # give possible match for heading
+ #
http://stackoverflow.com/questions/2923420/fuzzy-string-matching-algorithm-…
+ #
http://docs.python.org/library/difflib.html
+ # (
http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/)
+ for h in headers:
+ ph = difflib.get_close_matches(h, wikitextlines, cutoff=cutoff[1]) #
cutoff=0.6 (default)
+ possible_headers += [ (p, section[u'line']) for p in ph ]
+ #print h, possible_headers
+
+ if not possible_headers and section[u'index']: # nothing found,
try 'prop=revisions (rv)'
+ # call the wiki to get info
+ params = {
+ u'action' : u'query',
+ u'titles' : self.title(),
+ u'prop' : u'revisions',
+ u'rvprop' : u'content',
+ u'rvsection' : section[u'index'],
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u" Reading section %s from %s via API..." %
(section[u'index'], self.title(asLink=True)))
+
+ result = query.GetData(params, self.site())
+ # JIRA: DRTRIGON-90; catch and convert error (convert it such that the whole
page gets processed later)
+ try:
+ r = result[u'query'][u'pages'].values()[0]
+ pl = r[u'revisions'][0][u'*'].splitlines()
+ except KeyError: # sequence of sometimes occuring "KeyError:
u'parse'"
+ pywikibot.output(u'WARNING: Query result (gSBO): %r' % result)
+ raise pywikibot.Error('Problem occured during data retrieval for
sections in %s!' % self.title(asLink=True))
+
+ if pl:
+ possible_headers = [ (pl[0], pl[0]) ]
+
+ # find the most probable match for heading
+ #print possible_headers
+ best_match = (0.0, None)
+ for i, (ph, header) in enumerate(possible_headers):
+ #print u' ', i, difflib.SequenceMatcher(None, header, ph).ratio(),
header, ph
+ mr = difflib.SequenceMatcher(None, header, ph).ratio()
+ if mr >= best_match[0]: best_match = (mr, ph)
+ if (i in [0, 1]) and (mr >= cutoff[0]): break # use first (exact; re)
match directly (if good enough)
+ #print u' ', best_match
+
+ # prepare resulting data
+ section[u'wikiline'] = best_match[1]
+ section[u'wikiline_mq'] = best_match[0] # match quality
+ section[u'wikiline_bo'] = -1 # byteoffset
+ if section[u'wikiline']:
+ section[u'wikiline_bo'] =
self._contents.find(section[u'wikiline'], pos)
+ if section[u'wikiline_bo'] < 0: # nothing found, report/raise
error !
+ #page._getexception = ...
+ raise pywikibot.Error('Problem occured during attempt to retrieve and
resolve sections in %s!' % self.title(asLink=True))
+ #pywikibot.output(...)
+ # (or create a own error, e.g. look into interwiki.py)
+
+ def permalink(self):
+ """Return the permalink URL for current revision of this
page."""
+ return "%s://%s%s&oldid=%i" % (self.site().protocol(),
+ self.site().hostname(),
+ self.site().get_address(self.title()),
+ self.latestRevision())
+
+ def latestRevision(self):
+ """Return the current revision id for this
page."""
+ if not self._permalink:
+ # When we get the page with getall, the permalink is received
+ # automatically
+ getall(self.site(),[self],force=True)
+ # Check for exceptions
+ if hasattr(self, '_getexception'):
+ raise self._getexception
+ return int(self._permalink)
+
+ def userName(self):
+ """Return name or IP address of last user to edit page.
+
+ Returns None unless page was retrieved with getAll().
+
+ """
+ return self._userName
+
+ ## @since r10310
+ # @remarks needed by various bots
+ def userNameHuman(self):
+ """Return name or IP address of last human/non-bot user to edit
page.
+
+ Returns the most recent human editor out of the last revisions
+ (optimal used with getAll()). If it was not able to retrieve a
+ human user returns None.
+ """
+
+ # was there already a call? already some info available?
+ if hasattr(self, '_userNameHuman'):
+ return self._userNameHuman
+
+ # get history (use preloaded if available)
+ (revid, timestmp, username, comment) = self.getVersionHistory(revCount=1)[0][:4]
+
+ # is the last/actual editor already a human?
+ import botlist # like watchlist
+ if not botlist.isBot(username):
+ self._userNameHuman = username
+ return username
+
+ # search the last human
+ self._userNameHuman = None
+ for vh in self.getVersionHistory()[1:]:
+ (revid, timestmp, username, comment) = vh[:4]
+
+ if username and (not botlist.isBot(username)):
+ # user is a human (not a bot)
+ self._userNameHuman = username
+ break
+
+ # store and return info
+ return self._userNameHuman
+
+ def isIpEdit(self):
+ """Return True if last editor was unregistered.
+
+ Returns None unless page was retrieved with getAll() or _getEditPage().
+
+ """
+ return self._ipedit
+
+ def editTime(self, datetime=False):
+ """Return timestamp (in MediaWiki format) of last revision to
page.
+
+ Returns None unless page was retrieved with getAll() or _getEditPage().
+
+ """
+ if self._editTime and datetime:
+ import datetime
+ return datetime.datetime.strptime(str(self._editTime),
'%Y%m%d%H%M%S')
+
+ return self._editTime
+
+ def previousRevision(self):
+ """Return the revision id for the previous revision of this
Page."""
+ vh = self.getVersionHistory(revCount=2)
+ return vh[1][0]
+
+ def exists(self):
+ """Return True if page exists on the wiki, even if it's a
redirect.
+
+ If the title includes a section, return False if this section isn't
+ found.
+
+ """
+ try:
+ self.get()
+ except NoPage:
+ return False
+ except IsRedirectPage:
+ return True
+ except SectionError:
+ return False
+ return True
+
+ def pageAPInfo(self):
+ """Return the last revid if page exists on the wiki,
+ Raise IsRedirectPage if it's a redirect
+ Raise NoPage if the page doesn't exist
+
+ Using the API should be a lot faster.
+ Function done in order to improve the scripts performance.
+
+ """
+ params = {
+ 'action' :'query',
+ 'prop' :'info',
+ 'titles' :self.title(),
+ }
+ data = query.GetData(params, self.site(), encodeTitle =
False)['query']['pages'].values()[0]
+ if 'redirect' in data:
+ raise IsRedirectPage
+ elif 'missing' in data:
+ raise NoPage
+ elif 'lastrevid' in data:
+ return data['lastrevid'] # if ok, return the last revid
+ else:
+ # should not exists, OR we have problems.
+ # better double check in this situations
+ x = self.get()
+ return True # if we reach this point, we had no problems.
+
+ def getTemplates(self, tllimit = 5000):
+ #action=query&prop=templates&titles=Main Page
+ """
+ Returns the templates that are used in the page given by API.
+
+ If no templates found, returns None.
+
+ """
+ params = {
+ 'action': 'query',
+ 'prop': 'templates',
+ 'titles': self.title(),
+ 'tllimit': tllimit,
+ }
+ if tllimit > config.special_page_limit:
+ params['tllimit'] = config.special_page_limit
+ if tllimit > 5000 and self.site.isAllowed('apihighlimits'):
+ params['tllimit'] = 5000
+
+ tmpsFound = []
+ count = 0
+ while True:
+ data = query.GetData(params, self.site(), encodeTitle =
False)['query']['pages'].values()[0]
+ if "templates" not in data:
+ return []
+
+ for tmp in data['templates']:
+ count += 1
+ tmpsFound.append(Page(self.site(), tmp['title'],
defaultNamespace=tmp['ns']) )
+ if count >= tllimit:
+ break
+
+ if 'query-continue' in data and count < tllimit:
+ params["tlcontinue"] =
data["query-continue"]["templates"]["tlcontinue"]
+ else:
+ break
+
+ return tmpsFound
+
+ def isRedirectPage(self):
+ """Return True if this is a redirect, False if not or not
existing."""
+ try:
+ self.get()
+ except NoPage:
+ return False
+ except IsRedirectPage:
+ return True
+ except SectionError:
+ return False
+ return False
+
+ def isStaticRedirect(self, force=False):
+ """Return True if this is a redirect containing the magic word
+ __STATICREDIRECT__, False if not or not existing.
+
+ """
+ found = False
+ if self.isRedirectPage() and self.site().versionnumber() > 13:
+ staticKeys = self.site().getmagicwords('staticredirect')
+ text = self.get(get_redirect=True, force=force)
+ if staticKeys:
+ for key in staticKeys:
+ if key in text:
+ found = True
+ break
+ return found
+
+ def isCategoryRedirect(self, text=None):
+ """Return True if this is a category redirect page, False
otherwise."""
+
+ if not self.isCategory():
+ return False
+ if not hasattr(self, "_catredirect"):
+ if not text:
+ try:
+ text = self.get(get_redirect=True)
+ except NoPage:
+ return False
+ catredirs = self.site().category_redirects()
+ for (t, args) in self.templatesWithParams(thistxt=text):
+ template = Page(self.site(), t, defaultNamespace=10
+ ).title(withNamespace=False) # normalize title
+ if template in catredirs:
+ # Get target (first template argument)
+ if not args:
+ pywikibot.output(u'Warning: redirect target for %s is
missing'
+ % self.title(asLink=True))
+ self._catredirect = False
+ else:
+ self._catredirect = self.site().namespace(14) + ":" +
args[0]
+ break
+ else:
+ self._catredirect = False
+ return bool(self._catredirect)
+
+ def getCategoryRedirectTarget(self):
+ """If this is a category redirect, return the target category
title."""
+ if self.isCategoryRedirect():
+ import catlib
+ return catlib.Category(self.site(), self._catredirect)
+ raise IsNotRedirectPage
+
+ def isEmpty(self):
+ """Return True if the page text has less than 4 characters.
+
+ Character count ignores language links and category links.
+ Can raise the same exceptions as get().
+
+ """
+ txt = self.get()
+ txt = removeLanguageLinks(txt, site = self.site())
+ txt = removeCategoryLinks(txt, site = self.site())
+ if len(txt) < 4:
+ return True
+ else:
+ return False
+
+ def isTalkPage(self):
+ """Return True if this page is in any talk
namespace."""
+ ns = self.namespace()
+ return ns >= 0 and ns % 2 == 1
+
+ def toggleTalkPage(self):
+ """Return other member of the article-talk page pair for this
Page.
+
+ If self is a talk page, returns the associated content page;
+ otherwise, returns the associated talk page.
+ Returns None if self is a special page.
+
+ """
+ ns = self.namespace()
+ if ns < 0: # Special page
+ return None
+ if self.isTalkPage():
+ ns -= 1
+ else:
+ ns += 1
+
+ if ns == 6:
+ return ImagePage(self.site(), self.title(withNamespace=False))
+
+ return Page(self.site(), self.title(withNamespace=False),
+ defaultNamespace=ns)
+
+ def isCategory(self):
+ """Return True if the page is a Category, False
otherwise."""
+ return self.namespace() == 14
+
+ def isImage(self):
+ """Return True if this is an image description page, False
otherwise."""
+ return self.namespace() == 6
+
+ def isDisambig(self, get_Index=True):
+ """Return True if this is a disambiguation page, False otherwise.
+
+ Relies on the presence of specific templates, identified in
+ the Family file or on a wiki page, to identify disambiguation
+ pages.
+
+ By default, loads a list of template names from the Family file;
+ if the value in the Family file is None no entry was made, looks for
+ the list on [[MediaWiki:Disambiguationspage]]. If this page does not
+ exist, take the mediawiki message.
+
+ If get_Index is True then also load the templates for index articles
+ which are given on en-wiki
+
+ Template:Disambig is always assumed to be default, and will be
+ appended regardless of its existence.
+
+ """
+ if not hasattr(self, "_isDisambig"):
+ if not hasattr(self._site, "_disambigtemplates"):
+ try:
+ default = set(self._site.family.disambig('_default'))
+ except KeyError:
+ default = set([u'Disambig'])
+ try:
+ distl = self._site.family.disambig(self._site.lang,
+ fallback=False)
+ except KeyError:
+ distl = None
+ if distl is None:
+ try:
+ disambigpages = Page(self._site,
+ "MediaWiki:Disambiguationspage")
+ disambigs = set(link.title(withNamespace=False)
+ for link in disambigpages.linkedPages()
+ if link.namespace() == 10)
+ # add index article templates
+ if get_Index and \
+ self._site.sitename() == 'wikipedia:en':
+ regex = re.compile('\(\((.+?)\)\)')
+ content = disambigpages.get()
+ for index in regex.findall(content):
+ disambigs.add(index[:1].upper() + index[1:])
+ except NoPage:
+ disambigs = set([self._site.mediawiki_message(
+ 'Disambiguationspage').split(':', 1)[1]])
+ # add the default template(s)
+ self._site._disambigtemplates = disambigs | default
+ else:
+ # Normalize template capitalization
+ self._site._disambigtemplates = set(
+ t[:1].upper() + t[1:] for t in distl
+ )
+ disambigInPage = self._site._disambigtemplates.intersection(
+ self.templates())
+ self._isDisambig = self.namespace() != 10 and \
+ len(disambigInPage) > 0
+ return self._isDisambig
+
+ def canBeEdited(self):
+ """Return bool indicating whether this page can be edited.
+
+ This returns True if and only if:
+ - page is unprotected, and bot has an account for this site, or
+ - page is protected, and bot has a sysop account for this site.
+
+ """
+ try:
+ self.get()
+ except:
+ pass
+ if self.editRestriction == 'sysop':
+ userdict = config.sysopnames
+ else:
+ userdict = config.usernames
+ try:
+ userdict[self.site().family.name][self.site().lang]
+ return True
+ except:
+ # We don't have a user account for that wiki, or the
+ # page is locked and we don't have a sysop account.
+ return False
+
+ def botMayEdit(self, username):
+ """Return True if this page allows bots to edit it.
+
+ This will be True if the page doesn't contain {{bots}} or
+ {{nobots}}, or it contains them and the active bot is allowed to
+ edit this page. (This method is only useful on those sites that
+ recognize the bot-exclusion protocol; on other sites, it will always
+ return True.)
+
+ The framework enforces this restriction by default. It is possible
+ to override this by setting ignore_bot_templates=True in
+ user-config.py, or using page.put(force=True).
+
+ """
+
+ if self.site().family.name == 'wikitravel': # Wikitravel's bot
control.
+ self.site().family.bot_control(self.site())
+
+ if config.ignore_bot_templates: #Check the "master ignore switch"
+ return True
+
+ try:
+ templates = self.templatesWithParams(get_redirect=True);
+ except (NoPage, IsRedirectPage, SectionError):
+ return True
+
+ for template in templates:
+ if template[0].lower() == 'nobots':
+ return False
+ elif template[0].lower() == 'bots':
+ if len(template[1]) == 0:
+ return True
+ else:
+ (ttype, bots) = template[1][0].split('=', 1)
+ bots = bots.split(',')
+ if ttype == 'allow':
+ if 'all' in bots or username in bots:
+ return True
+ else:
+ return False
+ if ttype == 'deny':
+ if 'all' in bots or username in bots:
+ return False
+ else:
+ return True
+ # no restricting template found
+ return True
+
+ def getReferences(self, follow_redirects=True, withTemplateInclusion=True,
+ onlyTemplateInclusion=False, redirectsOnly=False, internal = False):
+ """Yield all pages that link to the page by API
+
+ If you need a full list of referring pages, use this:
+ pages = [page for page in s.getReferences()]
+ Parameters:
+ * follow_redirects - if True, also returns pages that link to a
+ redirect pointing to the page.
+ * withTemplateInclusion - if True, also returns pages where self is
+ used as a template.
+ * onlyTemplateInclusion - if True, only returns pages where self is
+ used as a template.
+ * redirectsOnly - if True, only returns redirects to self.
+
+ """
+ if not self.site().has_api():
+ for s in self.getReferencesOld(follow_redirects, withTemplateInclusion,
onlyTemplateInclusion, redirectsOnly):
+ yield s
+ return
+
+ params = {
+ 'action': 'query',
+ 'list': [],
+ }
+ if not onlyTemplateInclusion:
+ params['list'].append('backlinks')
+ params['bltitle'] = self.title()
+ params['bllimit'] = config.special_page_limit
+ params['blfilterredir'] = 'all'
+ if follow_redirects:
+ params['blredirect'] = 1
+ if redirectsOnly:
+ params['blfilterredir'] = 'redirects'
+ if not self.site().isAllowed('apihighlimits') and
config.special_page_limit > 500:
+ params['bllimit'] = 500
+
+ if withTemplateInclusion or onlyTemplateInclusion:
+ params['list'].append('embeddedin')
+ params['eititle'] = self.title()
+ params['eilimit'] = config.special_page_limit
+ params['eifilterredir'] = 'all'
+ if follow_redirects:
+ params['eiredirect'] = 1
+ if redirectsOnly:
+ params['eifilterredir'] = 'redirects'
+ if not self.site().isAllowed('apihighlimits') and
config.special_page_limit > 500:
+ params['eilimit'] = 500
+
+ allDone = False
+
+ while not allDone:
+ if not internal:
+ output(u'Getting references to %s via API...'
+ % self.title(asLink=True))
+
+ datas = query.GetData(params, self.site())
+ data = datas['query'].values()
+ if len(data) == 2:
+ data = data[0] + data[1]
+ else:
+ data = data[0]
+
+ refPages = set()
+ for blp in data:
+ pg = Page(self.site(), blp['title'], defaultNamespace =
blp['ns'])
+ if pg in refPages:
+ continue
+
+ yield pg
+ refPages.add(pg)
+ if follow_redirects and 'redirect' in blp and
'redirlinks' in blp:
+ for p in blp['redirlinks']:
+ plk = Page(self.site(), p['title'], defaultNamespace =
p['ns'])
+ if plk in refPages:
+ continue
+
+ yield plk
+ refPages.add(plk)
+ if follow_redirects and 'redirect' in p and plk != self:
+ for zms in plk.getReferences(follow_redirects,
withTemplateInclusion,
+ onlyTemplateInclusion, redirectsOnly,
internal=True):
+ yield zms
+ else:
+ continue
+ else:
+ continue
+
+ if 'query-continue' in datas:
+ if 'backlinks' in datas['query-continue']:
+ params['blcontinue'] =
datas['query-continue']['backlinks']['blcontinue']
+
+ if 'embeddedin' in datas['query-continue']:
+ params['eicontinue'] =
datas['query-continue']['embeddedin']['eicontinue']
+ else:
+ allDone = True
+
+
+ def getReferencesOld(self,
+ follow_redirects=True, withTemplateInclusion=True,
+ onlyTemplateInclusion=False, redirectsOnly=False):
+ """Yield all pages that link to the page.
+ """
+ # Temporary bug-fix while researching more robust solution:
+ if config.special_page_limit > 999:
+ config.special_page_limit = 999
+ site = self.site()
+ path = self.site().references_address(self.urlname())
+ if withTemplateInclusion:
+ path+=u'&hidetrans=0'
+ if onlyTemplateInclusion:
+
path+=u'&hidetrans=0&hidelinks=1&hideredirs=1&hideimages=1'
+ if redirectsOnly:
+
path+=u'&hideredirs=0&hidetrans=1&hidelinks=1&hideimages=1'
+ content = SoupStrainer("div", id=self.site().family.content_id)
+ try:
+ next_msg = self.site().mediawiki_message('whatlinkshere-next')
+ except KeyError:
+ next_msg = "next %i" % config.special_page_limit
+ plural = (config.special_page_limit == 1) and "\\1" or "\\2"
+ next_msg = re.sub(r"{{PLURAL:\$1\|(.*?)\|(.*?)}}", plural, next_msg)
+ nextpattern = re.compile("^%s$" % next_msg.replace("$1",
"[0-9]+"))
+ delay = 1
+ if self.site().has_mediawiki_message("Isredirect"):
+ self._isredirectmessage =
self.site().mediawiki_message("Isredirect")
+ if self.site().has_mediawiki_message("Istemplate"):
+ self._istemplatemessage =
self.site().mediawiki_message("Istemplate")
+ # to avoid duplicates:
+ refPages = set()
+ while path:
+ output(u'Getting references to %s' % self.title(asLink=True))
+ get_throttle()
+ txt = self.site().getUrl(path)
+ body = BeautifulSoup(txt,
+ convertEntities=BeautifulSoup.HTML_ENTITIES,
+ parseOnlyThese=content)
+ next_text = body.find(text=nextpattern)
+ if next_text is not None and next_text.parent.has_key('href'):
+ path = next_text.parent['href'].replace("&",
"&")
+ else:
+ path = ""
+ reflist = body.find("ul")
+ if reflist is None:
+ return
+ for page in self._parse_reflist(reflist,
+ follow_redirects, withTemplateInclusion,
+ onlyTemplateInclusion, redirectsOnly):
+ if page not in refPages:
+ yield page
+ refPages.add(page)
+
+ def _parse_reflist(self, reflist,
+ follow_redirects=True, withTemplateInclusion=True,
+ onlyTemplateInclusion=False, redirectsOnly=False):
+ """For internal use only
+
+ Parse a "Special:Whatlinkshere" list of references and yield Page
+ objects that meet the criteria (used by getReferences)
+ """
+ for link in reflist("li", recursive=False):
+ title = link.a.string
+ if title is None:
+ output(u"DBG> invalid <li> item in Whatlinkshere: %s"
% link)
+ try:
+ p = Page(self.site(), title)
+ except InvalidTitle:
+ output(u"DBG> Whatlinkshere:%s contains invalid link to %s"
+ % (self.title(), title))
+ continue
+ isredirect, istemplate = False, False
+ textafter = link.a.findNextSibling(text=True)
+ if textafter is not None:
+ if self.site().has_mediawiki_message("Isredirect") \
+ and self._isredirectmessage in textafter:
+ # make sure this is really a redirect to this page
+ # (MediaWiki will mark as a redirect any link that follows
+ # a #REDIRECT marker, not just the first one).
+ if p.getRedirectTarget().sectionFreeTitle() ==
self.sectionFreeTitle():
+ isredirect = True
+ if self.site().has_mediawiki_message("Istemplate") \
+ and self._istemplatemessage in textafter:
+ istemplate = True
+ if (withTemplateInclusion or onlyTemplateInclusion or not istemplate
+ ) and (not redirectsOnly or isredirect
+ ) and (not onlyTemplateInclusion or istemplate
+ ):
+ yield p
+ continue
+
+ if isredirect and follow_redirects:
+ sublist = link.find("ul")
+ if sublist is not None:
+ for p in self._parse_reflist(sublist,
+ follow_redirects, withTemplateInclusion,
+ onlyTemplateInclusion, redirectsOnly):
+ yield p
+
+ def _getActionUser(self, action, restriction = '', sysop = False):
+ """
+ Get the user to do an action: sysop or not sysop, or raise an exception
+ if the user cannot do that.
+
+ Parameters:
+ * action - the action to be done, which is the name of the right
+ * restriction - the restriction level or an empty string for no restriction
+ * sysop - initially use sysop user?
+ """
+ # Login
+ self.site().forceLogin(sysop = sysop)
+
+ # Check permissions
+ if not self.site().isAllowed(action, sysop):
+ if sysop:
+ raise LockedPage(u'The sysop user is not allowed to %s in site
%s' % (action, self.site()))
+ else:
+ try:
+ user = self._getActionUser(action, restriction, sysop = True)
+ output(u'The user is not allowed to %s on site %s. Using sysop
account.' % (action, self.site()))
+ return user
+ except NoUsername:
+ raise LockedPage(u'The user is not allowed to %s on site %s, and
no sysop account is defined.' % (action, self.site()))
+ except LockedPage:
+ raise
+
+ # Check restrictions
+ if not self.site().isAllowed(restriction, sysop):
+ if sysop:
+ raise LockedPage(u'Page on %s is locked in a way that sysop user
cannot %s it' % (self.site(), action))
+ else:
+ try:
+ user = self._getActionUser(action, restriction, sysop = True)
+ output(u'Page is locked on %s - cannot %s, using sysop
account.' % (self.site(), action))
+ return user
+ except NoUsername:
+ raise LockedPage(u'Page is locked on %s - cannot %s, and no sysop
account is defined.' % (self.site(), action))
+ except LockedPage:
+ raise
+
+ return sysop
+
+ def getRestrictions(self):
+ """
+ Get the protections on the page.
+ * Returns a restrictions dictionary. Keys are 'edit' and 'move',
+ Values are None (no restriction for that action) or [level, expiry] :
+ * level is the level of auth needed to perform that action
+ ('autoconfirmed' or 'sysop')
+ * expiry is the expiration time of the restriction
+ """
+ #, titles = None
+ #if titles:
+ # restrictions = {}
+ #else:
+ restrictions = { 'edit': None, 'move': None }
+ try:
+ api_url = self.site().api_address()
+ except NotImplementedError:
+ return restrictions
+
+ predata = {
+ 'action': 'query',
+ 'prop': 'info',
+ 'inprop': 'protection',
+ 'titles': self.title(),
+ }
+ #if titles:
+ # predata['titles'] = titles
+
+ text = query.GetData(predata, self.site())['query']['pages']
+
+ for pageid in text:
+ if 'missing' in text[pageid]:
+ self._getexception = NoPage
+ raise NoPage('Page %s does not exist' % self.title(asLink=True))
+ elif not 'pageid' in text[pageid]:
+ # Don't know what may happen here.
+ # We may want to have better error handling
+ raise Error("BUG> API problem.")
+ if text[pageid]['protection'] != []:
+ #if titles:
+ # restrictions = dict([ detail['type'], [
detail['level'], detail['expiry'] ] ]
+ # for detail in text[pageid]['protection'])
+ #else:
+ restrictions = dict([ detail['type'], [ detail['level'],
detail['expiry'] ] ]
+ for detail in text[pageid]['protection'])
+
+ return restrictions
+
+ def put_async(self, newtext,
+ comment=None, watchArticle=None, minorEdit=True, force=False,
+ callback=None):
+ """Put page on queue to be saved to wiki asynchronously.
+
+ Asynchronous version of put (takes the same arguments), which places
+ pages on a queue to be saved by a daemon thread. All arguments are
+ the same as for .put(), except --
+
+ callback: a callable object that will be called after the page put
+ operation; this object must take two arguments:
+ (1) a Page object, and (2) an exception instance, which
+ will be None if the page was saved successfully.
+
+ The callback is intended to be used by bots that need to keep track
+ of which saves were successful.
+
+ """
+ try:
+ page_put_queue.mutex.acquire()
+ try:
+ _putthread.start()
+ except (AssertionError, RuntimeError):
+ pass
+ finally:
+ page_put_queue.mutex.release()
+ page_put_queue.put((self, newtext, comment, watchArticle, minorEdit,
+ force, callback))
+
+ def put(self, newtext, comment=None, watchArticle=None, minorEdit=True,
+ force=False, sysop=False, botflag=True, maxTries=-1):
+ """Save the page with the contents of the first argument as the
text.
+
+ Optional parameters:
+ comment: a unicode string that is to be used as the summary for
+ the modification.
+ watchArticle: a bool, add or remove this Page to/from bot user's
+ watchlist (if None, leave watchlist status unchanged)
+ minorEdit: mark this edit as minor if True
+ force: ignore botMayEdit() setting.
+ maxTries: the maximum amount of save attempts. -1 for infinite.
+ """
+ # Login
+ try:
+ self.get()
+ except:
+ pass
+ sysop = self._getActionUser(action = 'edit', restriction =
self.editRestriction, sysop = sysop)
+ username = self.site().loggedInAs()
+
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+
+ # Determine if we are allowed to edit
+ if not force:
+ if not self.botMayEdit(username):
+ raise LockedPage(
+ u'Not allowed to edit %s because of a restricting template'
+ % self.title(asLink=True))
+ elif self.site().has_api() and self.namespace() in [2,3] \
+ and (self.title().endswith('.css') or \
+ self.title().endswith('.js')):
+ titleparts = self.title().split("/")
+ userpageowner = titleparts[0].split(":")[1]
+ if userpageowner != username:
+ # API enable: if title ends with .css or .js in ns2,3
+ # it needs permission to edit user pages
+ if self.title().endswith('css'):
+ permission = 'editusercss'
+ else:
+ permission = 'edituserjs'
+ sysop = self._getActionUser(action=permission,
+ restriction=self.editRestriction,
+ sysop=True)
+
+ # If there is an unchecked edit restriction, we need to load the page
+ if self._editrestriction:
+ output(
+u'Page %s is semi-protected. Getting edit page to find out if we are allowed to
edit.'
+ % self.title(asLink=True))
+ oldtime = self.editTime()
+ # Note: change_edit_time=True is always True since
+ # self.get() calls self._getEditPage without this parameter
+ self.get(force=True, change_edit_time=True)
+ newtime = self.editTime()
+ ### TODO: we have different timestamp formats
+ if re.sub('\D', '', str(oldtime)) != re.sub('\D',
'', str(newtime)): # page was changed
+ raise EditConflict(u'Page has been changed after first read.')
+ self._editrestriction = False
+ # If no comment is given for the change, use the default
+ comment = comment or action
+ if config.cosmetic_changes and not self.isTalkPage() and \
+ not calledModuleName() in ('cosmetic_changes', 'touch'):
+ if config.cosmetic_changes_mylang_only:
+ cc = (self.site().family.name == config.family and self.site().lang ==
config.mylang) or \
+ self.site().family.name in config.cosmetic_changes_enable.keys() and
\
+ self.site().lang in
config.cosmetic_changes_enable[self.site().family.name]
+ else:
+ cc = True
+ cc = cc and not \
+ (self.site().family.name in config.cosmetic_changes_disable.keys() and
\
+ self.site().lang in
config.cosmetic_changes_disable[self.site().family.name])
+ if cc:
+ old = newtext
+ if verbose:
+ output(u'Cosmetic Changes for %s-%s enabled.' %
(self.site().family.name, self.site().lang))
+ import cosmetic_changes
+ from pywikibot import i18n
+ ccToolkit = cosmetic_changes.CosmeticChangesToolkit(self.site(),
redirect=self.isRedirectPage(), namespace = self.namespace(), pageTitle=self.title())
+ newtext = ccToolkit.change(newtext)
+ if comment and old.strip().replace('\r\n', '\n') !=
newtext.strip().replace('\r\n', '\n'):
+ comment += i18n.twtranslate(self.site(),
'cosmetic_changes-append')
+
+ if watchArticle is None:
+ # if the page was loaded via get(), we know its status
+ if hasattr(self, '_isWatched'):
+ watchArticle = self._isWatched
+ else:
+ import watchlist
+ watchArticle = watchlist.isWatched(self.title(), site = self.site())
+ newPage = not self.exists()
+ # if posting to an Esperanto wiki, we must e.g. write Bordeauxx instead
+ # of Bordeaux
+ if self.site().lang == 'eo' and not self.site().has_api():
+ newtext = encodeEsperantoX(newtext)
+ comment = encodeEsperantoX(comment)
+
+ return self._putPage(newtext, comment, watchArticle, minorEdit,
+ newPage, self.site().getToken(sysop = sysop), sysop = sysop,
botflag=botflag, maxTries=maxTries)
+
+ def _encodeArg(self, arg, msgForError):
+ """Encode an ascii string/Unicode string to the site's
encoding"""
+ try:
+ return arg.encode(self.site().encoding())
+ except UnicodeDecodeError, e:
+ # happens when arg is a non-ascii bytestring :
+ # when reencoding bytestrings, python decodes first to ascii
+ e.reason += ' (cannot convert input %s string to unicode)' %
msgForError
+ raise e
+ except UnicodeEncodeError, e:
+ # happens when arg is unicode
+ e.reason += ' (cannot convert %s to wiki encoding %s)' %
(msgForError, self.site().encoding())
+ raise e
+
+ def _putPage(self, text, comment=None, watchArticle=False, minorEdit=True,
+ newPage=False, token=None, newToken=False, sysop=False,
+ captcha=None, botflag=True, maxTries=-1):
+ """Upload 'text' as new content of Page by API
+
+ Don't use this directly, use put() instead.
+
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 13:
+ # api not enabled or version not supported
+ return self._putPageOld(text, comment, watchArticle, minorEdit,
+ newPage, token, newToken, sysop, captcha, botflag, maxTries)
+
+ retry_attempt = 0
+ retry_delay = 1
+ dblagged = False
+ params = {
+ 'action': 'edit',
+ 'title': self.title(),
+ 'text': self._encodeArg(text, 'text'),
+ 'summary': self._encodeArg(comment, 'summary'),
+ }
+
+ if token:
+ params['token'] = token
+ else:
+ params['token'] = self.site().getToken(sysop = sysop)
+
+ # Add server lag parameter (see config.py for details)
+ if config.maxlag:
+ params['maxlag'] = str(config.maxlag)
+
+ if self._editTime:
+ params['basetimestamp'] = self._editTime
+ else:
+ params['basetimestamp'] = time.strftime('%Y%m%d%H%M%S',
time.gmtime())
+
+ if self._startTime:
+ params['starttimestamp'] = self._startTime
+ else:
+ params['starttimestamp'] = time.strftime('%Y%m%d%H%M%S',
time.gmtime())
+
+ if botflag:
+ params['bot'] = 1
+
+ if minorEdit:
+ params['minor'] = 1
+ else:
+ params['notminor'] = 1
+
+ if watchArticle:
+ params['watch'] = 1
+ #else:
+ # params['unwatch'] = 1
+
+ if captcha:
+ params['captchaid'] = captcha['id']
+ params['captchaword'] = captcha['answer']
+
+ while True:
+ if (maxTries == 0):
+ raise MaxTriesExceededError()
+ maxTries -= 1
+ # Check whether we are not too quickly after the previous
+ # putPage, and wait a bit until the interval is acceptable
+ if not dblagged:
+ put_throttle()
+ # Which web-site host are we submitting to?
+ if newPage:
+ output(u'Creating page %s via API' % self.title(asLink=True))
+ params['createonly'] = 1
+ else:
+ output(u'Updating page %s via API' % self.title(asLink=True))
+ params['nocreate'] = 1
+ # Submit the prepared information
+ try:
+ response, data = query.GetData(params, self.site(), sysop=sysop,
back_response = True)
+ if isinstance(data,basestring):
+ raise KeyError
+ except httplib.BadStatusLine, line:
+ raise PageNotSaved('Bad status line: %s' % line.line)
+ except ServerError:
+ output(u''.join(traceback.format_exception(*sys.exc_info())))
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise
+ output(u'Got a server error when putting %s; will retry in %i
minute%s.' % (self.title(asLink=True), retry_delay, retry_delay != 1 and "s"
or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ except ValueError: # API result cannot decode
+ output(u"Server error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or
""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ # If it has gotten this far then we should reset dblagged
+ dblagged = False
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+ # A second text area means that an edit conflict has occured.
+ if response.code == 500:
+ output(u"Server error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or
""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if 'error' in data:
+ #All available error key in edit mode: (from ApiBase.php)
+ # 'noimageredirect-anon':"Anonymous users can't create
image redirects",
+ # 'noimageredirect':"You don't have permission to create
image redirects",
+ # 'filtered':"The filter callback function refused your
edit",
+ # 'noedit-anon':"Anonymous users can't edit
pages",
+ # 'noedit':"You don't have permission to edit
pages",
+ # 'emptypage':"Creating new, empty pages is not
allowed",
+ # 'badmd5':"The supplied MD5 hash was incorrect",
+ # 'notext':"One of the text, appendtext, prependtext and
undo parameters must be set",
+ # 'emptynewsection':'Creating empty new sections is not
possible.',
+ # 'revwrongpage':"r\$1 is not a revision of
``\$2''",
+ # 'undofailure':'Undo failed due to conflicting intermediate
edits',
+
+ #for debug only
+ #------------------------
+ if verbose:
+ output("error occured,
code:%s\ninfo:%s\nstatus:%s\nresponse:%s" % (
+ data['error']['code'],
data['error']['info'], response.code, response.msg))
+ faked = params
+ if 'text' in faked:
+ del faked['text']
+ output("OriginalData:%s" % faked)
+ del faked
+ #------------------------
+ errorCode = data['error']['code']
+ #cannot handle longpageerror and PageNoSave yet
+ if errorCode == 'maxlag' or response.code == 503:
+ # server lag; wait for the lag time and retry
+ lagpattern = re.compile(r"Waiting for [\d.]+: (?P<lag>\d+)
seconds? lagged")
+ lag = lagpattern.search(data['error']['info'])
+ timelag = int(lag.group("lag"))
+ output(u"Pausing %d seconds due to database server lag." %
min(timelag,300))
+ dblagged = True
+ time.sleep(min(timelag,300))
+ continue
+ elif errorCode == 'editconflict':
+ # 'editconflict':"Edit conflict detected",
+ raise EditConflict(u'An edit conflict has occured.')
+ elif errorCode == 'spamdetected':
+ # 'spamdetected':"Your edit was refused because it
contained a spam fragment: ``\$1''",
+ raise SpamfilterError(data['error']['info'][62:-2])
+ elif errorCode == 'pagedeleted':
+ # 'pagedeleted':"The page has been deleted since you
fetched its timestamp",
+ # Make sure your system clock is correct if this error occurs
+ # without any reason!
+ # raise EditConflict(u'Someone deleted the page.')
+ # No raise, simply define these variables and retry:
+ params['recreate'] = 1
+ if self._editTime:
+ params['basetimestamp'] = self._editTime
+ else:
+ params['basetimestamp'] =
time.strftime('%Y%m%d%H%M%S', time.gmtime())
+
+ if self._startTime:
+ params['starttimestamp'] = self._startTime
+ else:
+ params['starttimestamp'] =
time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ continue
+ elif errorCode == 'readonly':
+ # 'readonly':"The wiki is currently in read-only
mode"
+ output(u"The database is currently locked for write access; will
retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or
""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ elif errorCode == 'contenttoobig':
+ # 'contenttoobig':"The content you supplied exceeds the
article size limit of \$1 kilobytes",
+ raise LongPageError(len(params['text']),
int(data['error']['info'][59:-10]))
+ elif errorCode in ['protectedpage',
'customcssjsprotected', 'cascadeprotected', 'protectednamespace',
'protectednamespace-interface']:
+ # 'protectedpage':"The ``\$1'' right is required
to edit this page"
+ # 'cascadeprotected':"The page you're trying to edit
is protected because it's included in a cascade-protected page"
+ # 'customcssjsprotected': "You're not allowed to
edit custom CSS and JavaScript pages"
+ # 'protectednamespace': "You're not allowed to edit
pages in the ``\$1'' namespace"
+ # 'protectednamespace-interface':"You're not allowed
to edit interface messages"
+ #
+ # The page is locked. This should have already been
+ # detected when getting the page, but there are some
+ # reasons why this didn't work, e.g. the page might be
+ # locked via a cascade lock.
+ try:
+ # Page is locked - try using the sysop account, unless we're
using one already
+ if sysop:# Unknown permissions error
+ raise LockedPage()
+ else:
+ self.site().forceLogin(sysop = True)
+ output(u'Page is locked, retrying using sysop
account.')
+ return self._putPage(text, comment, watchArticle, minorEdit,
newPage, token=self.site().getToken(sysop = True), sysop = True)
+ except NoUsername:
+ raise LockedPage()
+ elif errorCode == 'badtoken':
+ if newToken:
+ output(u"Edit token has failed. Giving up.")
+ else:
+ # We might have been using an outdated token
+ output(u"Edit token has failed. Retrying.")
+ return self._putPage(text, comment, watchArticle, minorEdit,
newPage, token=self.site().getToken(sysop = sysop, getagain = True), newToken = True,
sysop = sysop)
+ # I think the error message title was changed from "Wikimedia
Error"
+ # to "Wikipedia has a problem", but I'm not sure. Maybe we
could
+ # just check for HTTP Status 500 (Internal Server Error)?
+ else:
+ output("Unknown Error. API Error code:%s" %
data['error']['code'] )
+ output("Information:%s" %
data['error']['info'])
+ else:
+ if data['edit']['result'] == u"Success":
+ #
+ # The status code for update page completed in ordinary mode is 302 -
Found
+ # But API is always 200 - OK because it only send "success"
back in string.
+ # if the page update is successed, we need to return code 302 for
cheat script who
+ # using status code
+ #
+ return 302, response.msg, data['edit']
+
+ solve = self.site().solveCaptcha(data)
+ if solve:
+ return self._putPage(text, comment, watchArticle, minorEdit, newPage,
token, newToken, sysop, captcha=solve)
+
+ return response.code, response.msg, data
+
+
+ def _putPageOld(self, text, comment=None, watchArticle=False, minorEdit=True,
+ newPage=False, token=None, newToken=False, sysop=False,
+ captcha=None, botflag=True, maxTries=-1):
+ """Upload 'text' as new content of Page by filling out the
edit form.
+
+ Don't use this directly, use put() instead.
+
+ """
+ host = self.site().hostname()
+ # Get the address of the page on that host.
+ address = self.site().put_address(self.urlname())
+ predata = {
+ 'wpSave': '1',
+ 'wpSummary': self._encodeArg(comment, 'edit summary'),
+ 'wpTextbox1': self._encodeArg(text, 'wikitext'),
+ # As of October 2008, MW HEAD requires wpSection to be set.
+ # We will need to fill this more smartly if we ever decide to edit by
section
+ 'wpSection': '',
+ }
+ if not botflag:
+ predata['bot']='0'
+ if captcha:
+ predata["wpCaptchaId"] = captcha['id']
+ predata["wpCaptchaWord"] = captcha['answer']
+ # Add server lag parameter (see config.py for details)
+ if config.maxlag:
+ predata['maxlag'] = str(config.maxlag)
+ # <s>Except if the page is new, we need to supply the time of the
+ # previous version to the wiki to prevent edit collisions</s>
+ # As of Oct 2008, these must be filled also for new pages
+ if self._editTime:
+ predata['wpEdittime'] = self._editTime
+ else:
+ predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S',
time.gmtime())
+ if self._startTime:
+ predata['wpStarttime'] = self._startTime
+ else:
+ predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S',
time.gmtime())
+ if self._revisionId:
+ predata['baseRevId'] = self._revisionId
+ # Pass the minorEdit and watchArticle arguments to the Wiki.
+ if minorEdit:
+ predata['wpMinoredit'] = '1'
+ if watchArticle:
+ predata['wpWatchthis'] = '1'
+ # Give the token, but only if one is supplied.
+ if token:
+ predata['wpEditToken'] = token
+
+ # Sorry, single-site exception...
+ if self.site().fam().name == 'loveto' and self.site().language() ==
'recipes':
+ predata['masteredit'] = '1'
+
+ retry_delay = 1
+ retry_attempt = 0
+ dblagged = False
+ wait = 5
+ while True:
+ if (maxTries == 0):
+ raise MaxTriesExceededError()
+ maxTries -= 1
+ # Check whether we are not too quickly after the previous
+ # putPage, and wait a bit until the interval is acceptable
+ if not dblagged:
+ put_throttle()
+ # Which web-site host are we submitting to?
+ if newPage:
+ output(u'Creating page %s' % self.title(asLink=True))
+ else:
+ output(u'Changing page %s' % self.title(asLink=True))
+ # Submit the prepared information
+ try:
+ response, data = self.site().postForm(address, predata, sysop)
+ if response.code == 503:
+ if 'x-database-lag' in response.msg.keys():
+ # server lag; Mediawiki recommends waiting 5 seconds
+ # and retrying
+ if verbose:
+ output(data, newline=False)
+ output(u"Pausing %d seconds due to database server
lag." % wait)
+ dblagged = True
+ time.sleep(wait)
+ wait = min(wait*2, 300)
+ continue
+ # Squid error 503
+ raise ServerError(response.code)
+ except httplib.BadStatusLine, line:
+ raise PageNotSaved('Bad status line: %s' % line.line)
+ except ServerError:
+ output(u''.join(traceback.format_exception(*sys.exc_info())))
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise
+ output(
+ u'Got a server error when putting %s; will retry in %i minute%s.'
+ % (self.title(asLink=True), retry_delay, retry_delay != 1 and
"s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ # If it has gotten this far then we should reset dblagged
+ dblagged = False
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+ # A second text area means that an edit conflict has occured.
+ editconflict1 = re.compile('id=["\']wpTextbox2[\'"]
name="wpTextbox2"')
+ editconflict2 = re.compile('name="wpTextbox2"
id="wpTextbox2"')
+ if editconflict1.search(data) or editconflict2.search(data):
+ raise EditConflict(u'An edit conflict has occured.')
+
+ # remove the wpAntispam keyword before checking for Spamfilter
+ data = re.sub(u'(?s)<label
for="wpAntispam">.*?</label>', '', data)
+ if self.site().has_mediawiki_message("spamprotectiontitle")\
+ and self.site().mediawiki_message('spamprotectiontitle') in
data:
+ try:
+ reasonR =
re.compile(re.escape(self.site().mediawiki_message('spamprotectionmatch')).replace('\$1',
'(?P<url>[^<]*)'))
+ url = reasonR.search(data).group('url')
+ except:
+ # Some wikis have modified the spamprotectionmatch
+ # template in a way that the above regex doesn't work,
+ # e.g. on he.wikipedia the template includes a
+ # wikilink, and on fr.wikipedia there is bold text.
+ # This is a workaround for this: it takes the region
+ # which should contain the spamfilter report and the
+ # URL. It then searches for a plaintext URL.
+ relevant = data[data.find('<!-- start content
-->')+22:data.find('<!-- end content -->')].strip()
+ # Throw away all the other links etc.
+ relevant = re.sub('<.*?>', '', relevant)
+ relevant = relevant.replace(':', ':')
+ # MediaWiki only spam-checks HTTP links, and only the
+ # domain name part of the URL.
+ m = re.search('http://[\w\-\.]+', relevant)
+ if m:
+ url = m.group()
+ else:
+ # Can't extract the exact URL. Let the user search.
+ url = relevant
+ raise SpamfilterError(url)
+ if '<label for=\'wpRecreate\'' in data:
+ # Make sure your system clock is correct if this error occurs
+ # without any reason!
+ # raise EditConflict(u'Someone deleted the page.')
+ # No raise, simply define these variables and retry:
+ if self._editTime:
+ predata['wpEdittime'] = self._editTime
+ else:
+ predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S',
time.gmtime())
+ if self._startTime:
+ predata['wpStarttime'] = self._startTime
+ else:
+ predata['wpStarttime'] =
time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ continue
+ if self.site().has_mediawiki_message("viewsource")\
+ and self.site().mediawiki_message('viewsource') in data:
+ # The page is locked. This should have already been
+ # detected when getting the page, but there are some
+ # reasons why this didn't work, e.g. the page might be
+ # locked via a cascade lock.
+ try:
+ # Page is locked - try using the sysop account, unless we're
using one already
+ if sysop:
+ # Unknown permissions error
+ raise LockedPage()
+ else:
+ self.site().forceLogin(sysop = True)
+ output(u'Page is locked, retrying using sysop account.')
+ return self._putPageOld(text, comment, watchArticle, minorEdit,
newPage, token=self.site().getToken(sysop = True), sysop = True)
+ except NoUsername:
+ raise LockedPage()
+ if not newToken and "<textarea" in data:
+ ##if "<textarea" in data: # for debug use only, if badtoken
still happen
+ # We might have been using an outdated token
+ output(u"Changing page has failed. Retrying.")
+ return self._putPageOld(text, comment, watchArticle, minorEdit, newPage,
token=self.site().getToken(sysop = sysop, getagain = True), newToken = True, sysop =
sysop)
+ # I think the error message title was changed from "Wikimedia
Error"
+ # to "Wikipedia has a problem", but I'm not sure. Maybe we
could
+ # just check for HTTP Status 500 (Internal Server Error)?
+ if ("<title>Wikimedia Error</title>" in data or
"has a problem</title>" in data) \
+ or response.code == 500:
+ output(u"Server error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or
""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if ("1213: Deadlock found when trying to get lock" in data):
+ output(u"Deadlock error encountered; will retry in %i
minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or
""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if self.site().mediawiki_message('readonly') in data or
self.site().mediawiki_message('readonly_lag') in data:
+ output(u"The database is currently locked for write access; will
retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or
""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if self.site().has_mediawiki_message('longpageerror'):
+ # FIXME: Long page error detection isn't working in Vietnamese
Wikipedia.
+ long_page_errorR = re.compile(
+ # Some wikis (e.g. Lithuanian and Slovak Wikipedia) use {{plural}}
in
+ # [[MediaWiki:longpageerror]]
+ re.sub(r'\\{\\{plural\\:.*?\\}\\}', '.*?',
+ re.escape(
+ html2unicode(
+ self.site().mediawiki_message('longpageerror')
+ )
+ )
+ ).replace("\$1", "(?P<length>[\d,.\s]+)",
1).replace("\$2", "(?P<limit>[\d,.\s]+)", 1),
+ re.UNICODE)
+
+ match = long_page_errorR.search(data)
+ if match:
+ # Some wikis (e.g. Lithuanian Wikipedia) don't use $2 parameter
in
+ # [[MediaWiki:longpageerror]]
+ longpage_length = 0 ; longpage_limit = 0
+ if 'length' in match.groups():
+ longpage_length = match.group('length')
+ if 'limit' in match.groups():
+ longpage_limit = match.group('limit')
+ raise LongPageError(longpage_length, longpage_limit)
+
+ # We might have been prompted for a captcha if the
+ # account is not autoconfirmed, checking....
+ ## output('%s' % data) # WHY?
+ solve = self.site().solveCaptcha(data)
+ if solve:
+ return self._putPageOld(text, comment, watchArticle, minorEdit, newPage,
token, newToken, sysop, captcha=solve)
+
+ # We are expecting a 302 to the action=view page. I'm not sure why this
was removed in r5019
+ if response.code != 302 and data.strip() != u"":
+ # Something went wrong, and we don't know what. Show the
+ # HTML code that hopefully includes some error message.
+ output(u"ERROR: Unexpected response from wiki server.")
+ output(u" %s (%s) " % (response.code, response.msg))
+ output(data)
+ # Unexpected responses should raise an error and not pass,
+ # be it silently or loudly. This should raise an error
+
+ if 'name="wpTextbox1"' in data and 'var wgAction =
"submit"' in data:
+ # We are on the preview page, so the page was not saved
+ raise PageNotSaved
+
+ return response.code, response.msg, data
+
+ ## @since r10311
+ # @remarks to support appending to single sections
+ def append(self, newtext, comment=None, minorEdit=True, section=0):
+ """Append the wiki-text to the page.
+
+ Returns the result of text append to page section number 'section'.
+ 0 for the top section, 'new' for a new section (end of page).
+ """
+
+ # If no comment is given for the change, use the default
+ comment = comment or pywikibot.action
+
+ # send data by POST request
+ params = {
+ 'action' : 'edit',
+ 'title' : self.title(),
+ 'section' : '%s' % section,
+ 'appendtext' : self._encodeArg(newtext, 'text'),
+ 'token' : self.site().getToken(),
+ 'summary' : self._encodeArg(comment, 'summary'),
+ 'bot' : 1,
+ }
+
+ if minorEdit:
+ params['minor'] = 1
+ else:
+ params['notminor'] = 1
+
+ response, data = query.GetData(params, self.site(), back_response = True)
+
+ if not (data['edit']['result'] == u"Success"):
+ raise PageNotSaved('Bad result returned: %s' %
data['edit']['result'])
+
+ return response.code, response.msg, data
+
+ def protection(self):
+ """Return list of dicts of this page protection level. like:
+ [{u'expiry': u'2010-05-26T14:41:51Z', u'type':
u'edit', u'level': u'autoconfirmed'}, {u'expiry':
u'2010-05-26T14:41:51Z', u'type': u'move', u'level':
u'sysop'}]
+
+ if the page non protection, return []
+ """
+
+ params = {
+ 'action': 'query',
+ 'prop' : 'info',
+ 'inprop': 'protection',
+ 'titles' : self.title(),
+ }
+
+ datas = query.GetData(params, self.site())
+ data=datas['query']['pages'].values()[0]['protection']
+ return data
+
+ def interwiki(self):
+ """Return a list of interwiki links in the page text.
+
+ This will retrieve the page to do its work, so it can raise
+ the same exceptions that are raised by the get() method.
+
+ The return value is a list of Page objects for each of the
+ interwiki links in the page text.
+
+ """
+ if hasattr(self, "_interwikis"):
+ return self._interwikis
+
+ text = self.get()
+
+ # Replace {{PAGENAME}} by its value
+ for pagenametext in self.site().pagenamecodes(
+ self.site().language()):
+ text = text.replace(u"{{%s}}" % pagenametext, self.title())
+
+ ll = getLanguageLinks(text, insite=self.site(),
pageLink=self.title(asLink=True))
+
+ result = ll.values()
+
+ self._interwikis = result
+ return result
+
+
+
+ def categories(self, get_redirect=False, api=False):
+ """Return a list of Category objects that the article is in.
+ Please be aware: the api call returns also categies which are included
+ by templates. This differs to the old non-api code. If you need only
+ these categories which are in the page text please use getCategoryLinks
+ (or set api=False but this could be deprecated in future).
+ """
+ if not (self.site().has_api() and api):
+ try:
+ category_links_to_return =
getCategoryLinks(self.get(get_redirect=get_redirect), self.site())
+ except NoPage:
+ category_links_to_return = []
+ return category_links_to_return
+
+ else:
+ import catlib
+ params = {
+ 'action': 'query',
+ 'prop' : 'categories',
+ 'titles' : self.title(),
+ }
+ if not self.site().isAllowed('apihighlimits') and
config.special_page_limit > 500:
+ params['cllimit'] = 500
+
+ output(u'Getting categories in %s via API...' %
self.title(asLink=True))
+ allDone = False
+ cats=[]
+ while not allDone:
+ datas = query.GetData(params, self.site())
+ data=datas['query']['pages'].values()[0]
+ if "categories" in data:
+ for c in data['categories']:
+ if c['ns'] is 14:
+ cat = catlib.Category(self.site(), c['title'])
+ cats.append(cat)
+
+ if 'query-continue' in datas:
+ if 'categories' in datas['query-continue']:
+ params['clcontinue'] =
datas['query-continue']['categories']['clcontinue']
+ else:
+ allDone = True
+ return cats
+
+ def linkedPages(self, withImageLinks = False):
+ """Return a list of Pages that this Page links to.
+
+ Excludes interwiki and category links, and also image links by default.
+ """
+ result = []
+ try:
+ thistxt = removeLanguageLinks(self.get(get_redirect=True),
+ self.site())
+ except NoPage:
+ raise
+ except IsRedirectPage:
+ raise
+ except SectionError:
+ return []
+ thistxt = removeCategoryLinks(thistxt, self.site())
+
+ # remove HTML comments, pre, nowiki, and includeonly sections
+ # from text before processing
+ thistxt = removeDisabledParts(thistxt)
+
+ # resolve {{ns:-1}} or {{ns:Help}}
+ thistxt = self.site().resolvemagicwords(thistxt)
+
+ for match in Rlink.finditer(thistxt):
+ title = match.group('title')
+ title = title.replace("_", " ").strip(" ")
+ if title.startswith("#"):
+ # this is an internal section link
+ continue
+ if not self.site().isInterwikiLink(title):
+ try:
+ page = Page(self.site(), title)
+ try:
+ hash(str(page))
+ except Exception:
+ raise Error(u"Page %s contains invalid link to
[[%s]]."
+ % (self.title(), title))
+ except Error:
+ if verbose:
+ output(u"Page %s contains invalid link to [[%s]]."
+ % (self.title(), title))
+ continue
+ if not withImageLinks and page.isImage():
+ continue
+ if page.sectionFreeTitle() and page not in result:
+ result.append(page)
+ return result
+
+ def imagelinks(self, followRedirects=False, loose=False):
+ """Return a list of ImagePage objects for images displayed on this
Page.
+
+ Includes images in galleries.
+ If loose is True, this will find anything that looks like it
+ could be an image. This is useful for finding, say, images that are
+ passed as parameters to templates.
+
+ """
+ results = []
+ # Find normal images
+ for page in self.linkedPages(withImageLinks = True):
+ if page.isImage():
+ # convert Page object to ImagePage object
+ results.append( ImagePage(page.site(), page.title()) )
+ # Find images in galleries
+ pageText = self.get(get_redirect=followRedirects)
+ galleryR = re.compile('<gallery>.*?</gallery>', re.DOTALL)
+ galleryEntryR = re.compile('(?P<title>(%s|%s):.+?)(\|.+)?\n' %
(self.site().image_namespace(), self.site().family.image_namespace(code =
'_default')))
+ for gallery in galleryR.findall(pageText):
+ for match in galleryEntryR.finditer(gallery):
+ results.append( ImagePage(self.site(), match.group('title')) )
+ if loose:
+ ns = getSite().image_namespace()
+ imageR =
re.compile('\w\w\w+\.(?:gif|png|jpg|jpeg|svg|JPG|xcf|pdf|mid|ogg|djvu)',
re.IGNORECASE)
+ for imageName in imageR.findall(pageText):
+ results.append( ImagePage(self.site(), imageName) )
+ return list(set(results))
+
+ def templates(self, get_redirect=False):
+ """Return a list of titles (unicode) of templates used on this
Page.
+
+ Template parameters are ignored.
+ """
+ if not hasattr(self, "_templates"):
+ self._templates = list(set([template
+ for (template, param)
+ in self.templatesWithParams(
+ get_redirect=get_redirect)]))
+ return self._templates
+
+ def templatesWithParams(self, thistxt=None, get_redirect=False):
+ """Return a list of templates used on this Page.
+
+ Return value is a list of tuples. There is one tuple for each use of
+ a template in the page, with the template title as the first entry
+ and a list of parameters as the second entry.
+
+ If thistxt is set, it is used instead of current page content.
+ """
+ if not thistxt:
+ try:
+ thistxt = self.get(get_redirect=get_redirect)
+ except (IsRedirectPage, NoPage):
+ return []
+
+ # remove commented-out stuff etc.
+ thistxt = removeDisabledParts(thistxt)
+
+ # marker for inside templates or parameters
+ marker = findmarker(thistxt, u'@@', u'@')
+
+ # marker for links
+ marker2 = findmarker(thistxt, u'##', u'#')
+
+ # marker for math
+ marker3 = findmarker(thistxt, u'%%', u'%')
+
+ result = []
+ inside = {}
+ count = 0
+ Rtemplate = re.compile(
+
ur'{{(msg:)?(?P<name>[^{\|]+?)(\|(?P<params>[^{]*?))?}}')
+ Rlink = re.compile(ur'\[\[[^\]]+\]\]')
+ Rmath = re.compile(ur'<math>[^<]+</math>')
+ Rmarker = re.compile(ur'%s(\d+)%s' % (marker, marker))
+ Rmarker2 = re.compile(ur'%s(\d+)%s' % (marker2, marker2))
+ Rmarker3 = re.compile(ur'%s(\d+)%s' % (marker3, marker3))
+
+ # Replace math with markers
+ maths = {}
+ count = 0
+ for m in Rmath.finditer(thistxt):
+ count += 1
+ text = m.group()
+ thistxt = thistxt.replace(text, '%s%d%s' % (marker3, count,
marker3))
+ maths[count] = text
+
+ while Rtemplate.search(thistxt) is not None:
+ for m in Rtemplate.finditer(thistxt):
+ # Make sure it is not detected again
+ count += 1
+ text = m.group()
+ thistxt = thistxt.replace(text,
+ '%s%d%s' % (marker, count, marker))
+ # Make sure stored templates don't contain markers
+ for m2 in Rmarker.finditer(text):
+ text = text.replace(m2.group(), inside[int(m2.group(1))])
+ for m2 in Rmarker3.finditer(text):
+ text = text.replace(m2.group(), maths[int(m2.group(1))])
+ inside[count] = text
+
+ # Name
+ name = m.group('name').strip()
+ m2 = Rmarker.search(name) or Rmath.search(name)
+ if m2 is not None:
+ # Doesn't detect templates whose name changes,
+ # or templates whose name contains math tags
+ continue
+ if self.site().isInterwikiLink(name):
+ continue
+
+ # {{#if: }}
+ if name.startswith('#'):
+ continue
+ # {{DEFAULTSORT:...}}
+ defaultKeys = self.site().versionnumber() > 13 and \
+ self.site().getmagicwords('defaultsort')
+ # It seems some wikis does not have this magic key
+ if defaultKeys:
+ found = False
+ for key in defaultKeys:
+ if name.startswith(key):
+ found = True
+ break
+ if found: continue
+
+ try:
+ name = Page(self.site(), name).title()
+ except InvalidTitle:
+ if name:
+ output(
+ u"Page %s contains invalid template name {{%s}}."
+ % (self.title(), name.strip()))
+ continue
+ # Parameters
+ paramString = m.group('params')
+ params = []
+ if paramString:
+ # Replace links to markers
+ links = {}
+ count2 = 0
+ for m2 in Rlink.finditer(paramString):
+ count2 += 1
+ text = m2.group()
+ paramString = paramString.replace(text,
+ '%s%d%s' % (marker2, count2, marker2))
+ links[count2] = text
+ # Parse string
+ markedParams = paramString.split('|')
+ # Replace markers
+ for param in markedParams:
+ for m2 in Rmarker.finditer(param):
+ param = param.replace(m2.group(),
+ inside[int(m2.group(1))])
+ for m2 in Rmarker2.finditer(param):
+ param = param.replace(m2.group(),
+ links[int(m2.group(1))])
+ for m2 in Rmarker3.finditer(param):
+ param = param.replace(m2.group(),
+ maths[int(m2.group(1))])
+ params.append(param)
+
+ # Add it to the result
+ result.append((name, params))
+ return result
+
+ def getRedirectTarget(self):
+ """Return a Page object for the target this Page redirects to.
+
+ If this page is not a redirect page, will raise an IsNotRedirectPage
+ exception. This method also can raise a NoPage exception.
+
+ """
+ try:
+ self.get()
+ except NoPage:
+ raise
+ except IsRedirectPage, err:
+ # otherwise it will return error pages with " inside.
+ target = err[0].replace('&quot;', '"')
+
+ if '|' in target:
+ warnings.warn("'%s' has a | character, this makes no
sense"
+ % target, Warning)
+ return Page(self.site(), target)
+ else:
+ raise IsNotRedirectPage(self)
+
+ def getVersionHistory(self, forceReload=False, reverseOrder=False,
+ getAll=False, revCount=500):
+ """Load the version history page and return history information.
+
+ Return value is a list of tuples, where each tuple represents one
+ edit and is built of revision id, edit date/time, user name,
+ edit summary, size and tags. Starts with the most current revision,
+ unless reverseOrder is True.
+ Defaults to getting the first revCount edits, unless getAll is True.
+
+ @param revCount: iterate no more than this number of revisions in total
+ """
+
+ # regular expression matching one edit in the version history.
+ # results will have 4 groups: oldid, edit date/time, user name, and edit
+ # summary.
+ thisHistoryDone = False
+ skip = False # Used in determining whether we need to skip the first page
+ dataQuery = []
+ hasData = False
+
+
+ # Are we getting by Earliest first?
+ if reverseOrder:
+ # Check if _versionhistoryearliest exists
+ if not hasattr(self, '_versionhistoryearliest') or forceReload:
+ self._versionhistoryearliest = []
+ elif getAll and len(self._versionhistoryearliest) == revCount:
+ # Cause a reload, or at least make the loop run
+ thisHistoryDone = False
+ skip = True
+ dataQuery = self._versionhistoryearliest
+ else:
+ thisHistoryDone = True
+ elif not hasattr(self, '_versionhistory') or forceReload or \
+ len(self._versionhistory) < revCount:
+ self._versionhistory = []
+ # ?? does not load if len(self._versionhistory) > revCount
+ # shouldn't it
+ elif getAll and len(self._versionhistory) == revCount:
+ # Cause a reload, or at least make the loop run
+ thisHistoryDone = False
+ skip = True
+ dataQuery = self._versionhistory
+ else:
+ thisHistoryDone = True
+
+ if not thisHistoryDone:
+ dataQuery.extend(self._getVersionHistory(getAll, skip, reverseOrder,
revCount))
+
+ if reverseOrder:
+ # Return only revCount edits, even if the version history is extensive
+ if dataQuery != []:
+ self._versionhistoryearliest = dataQuery
+ del dataQuery
+ if len(self._versionhistoryearliest) > revCount and not getAll:
+ return self._versionhistoryearliest[:revCount]
+ return self._versionhistoryearliest
+
+ if dataQuery != []:
+ self._versionhistory = dataQuery
+ del dataQuery
+ # Return only revCount edits, even if the version history is extensive
+ if len(self._versionhistory) > revCount and not getAll:
+ return self._versionhistory[:revCount]
+ return self._versionhistory
+
+ def _getVersionHistory(self, getAll=False, skipFirst=False, reverseOrder=False,
+ revCount=500):
+ """Load history informations by API query.
+ Internal use for self.getVersionHistory(), don't use this function
directly.
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 8:
+ return self._getVersionHistoryOld(reExist, getAll, skipFirst, reverseOrder,
revCount)
+ dataQ = []
+ thisHistoryDone = False
+ params = {
+ 'action': 'query',
+ 'prop': 'revisions',
+ 'titles': self.title(),
+ 'rvprop': 'ids|timestamp|flags|comment|user|size|tags',
+ 'rvlimit': revCount,
+ }
+ while not thisHistoryDone:
+ if reverseOrder:
+ params['rvdir'] = 'newer'
+
+ result = query.GetData(params, self.site())
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+ pageInfo = result['query']['pages'].values()[0]
+ if result['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo:
+ raise NoPage(self.site(), unicode(self),
+ "Page does not exist.")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+
+ if 'query-continue' in result and getAll:
+ params['rvstartid'] =
result['query-continue']['revisions']['rvstartid']
+ else:
+ thisHistoryDone = True
+
+ if skipFirst:
+ skipFirst = False
+ else:
+ for r in pageInfo['revisions']:
+ c = ''
+ if 'comment' in r:
+ c = r['comment']
+ #revision id, edit date/time, user name, edit summary
+ (revidStrr, timestampStrr, userStrr) = (None, None, None)
+ if 'revid' in r:
+ revidStrr = r['revid']
+ if 'timestamp' in r:
+ timestampStrr = r['timestamp']
+ if 'user' in r:
+ userStrr = r['user']
+ s=-1 #Will return -1 if not found
+ if 'size' in r:
+ s = r['size']
+ tags=[]
+ if 'tags' in r:
+ tags = r['tags']
+ dataQ.append((revidStrr, timestampStrr, userStrr, c, s, tags))
+ if
len(result['query']['pages'].values()[0]['revisions']) <
revCount:
+ thisHistoryDone = True
+ return dataQ
+
+ def _getVersionHistoryOld(self, getAll = False, skipFirst = False,
+ reverseOrder = False, revCount=500):
+ """Load the version history page and return history information.
+ Internal use for self.getVersionHistory(), don't use this function
directly.
+ """
+ dataQ = []
+ thisHistoryDone = False
+ startFromPage = None
+ if self.site().versionnumber() < 4:
+ editR = re.compile('<li>\(.*?\)\s+\(.*\).*?<a
href=".*?oldid=([0-9]*)" title=".*?">([^<]*)</a> <span
class=\'user\'><a href=".*?"
title=".*?">([^<]*?)</a></span>.*?(?:<span
class=\'comment\'>(.*?)</span>)?</li>')
+ elif self.site().versionnumber() < 15:
+ editR = re.compile('<li>\(.*?\)\s+\(.*\).*?<a
href=".*?oldid=([0-9]*)" title=".*?">([^<]*)</a>
(?:<span class=\'history-user\'>|)<a href=".*?"
title=".*?">([^<]*?)</a>.*?(?:</span>|).*?(?:<span
class=[\'"]comment[\'"]>(.*?)</span>)?</li>')
+ elif self.site().versionnumber() < 16:
+ editR = re.compile(r'<li
class=".*?">\((?:\w*|<a[^<]*</a>)\)\s\((?:\w*|<a[^<]*</a>)\).*?<a
href=".*?([0-9]*)" title=".*?">([^<]*)</a> <span
class=\'history-user\'><a
[^>]*?>([^<]*?)</a>.*?</span></span>(?: <span
class="minor">.*?</span>|)(?: <span
class="history-size">.*?</span>|)(?: <span
class=[\'"]comment[\'"]>\((?:<span
class="autocomment">|)(.*?)(?:</span>|)\)</span>)?(?: \(<span
class="mw-history-undo">.*?</span>\)|)\s*</li>', re.UNICODE)
+ else:
+ editR = re.compile(r'<li(?:
class="mw-tag[^>]+)?>\((?:\w+|<a[^<]*</a>)\)\s\((?:\w+|<a[^<]*</a>)\).*?<a
href=".*?([0-9]*)" title=".*?">([^<]*)</a> <span
class=\'history-user\'><a
[^>]*?>([^<]*?)</a>.*?</span></span>(?: <abbr
class="minor"[^>]*?>.*?</abbr>|)(?: <span
class="history-size">.*?</span>|)(?: <span
class="comment">\((?:<span
class="autocomment">|)(.*?)(?:</span>|)\)</span>)?(?: \(<span
class="mw-history-undo">.*?</span>\))?(?: <span
class="mw-tag-markers">.*?</span>\)</span>)?\s*</li>',
re.UNICODE)
+
+ RLinkToNextPage = re.compile('&offset=(.*?)&')
+
+ while not thisHistoryDone:
+ path = self.site().family.version_history_address(self.site().language(),
self.urlname(), config.special_page_limit)
+
+ if reverseOrder:
+ path += '&dir=prev'
+
+ if startFromPage:
+ path += '&offset=' + startFromPage
+
+ # this loop will run until the page could be retrieved
+ # Try to retrieve the page until it was successfully loaded (just in case
+ # the server is down or overloaded)
+ # wait for retry_idle_time minutes (growing!) between retries.
+ retry_idle_time = 1
+
+ if verbose:
+ if startFromPage:
+ output(u'Continuing to get version history of %s' % self)
+ else:
+ output(u'Getting version history of %s' % self)
+
+ txt = self.site().getUrl(path)
+
+ # save a copy of the text
+ self_txt = txt
+
+ #Find the nextPage link, if not exist, the page is last history page
+ matchObj = RLinkToNextPage.search(self_txt)
+ if getAll and matchObj:
+ startFromPage = matchObj.group(1)
+ else:
+ thisHistoryDone = True
+
+ if not skipFirst:
+ edits = editR.findall(self_txt)
+
+ if skipFirst:
+ # Skip the first page only,
+ skipFirst = False
+ else:
+ if reverseOrder:
+ edits.reverse()
+ #for edit in edits:
+ dataQ.extend(edits)
+ if len(edits) < revCount:
+ thisHistoryDone = True
+ return dataQ
+
+ def getVersionHistoryTable(self, forceReload=False, reverseOrder=False,
+ getAll=False, revCount=500):
+ """Return the version history as a wiki table."""
+
+ result = '{| class="wikitable"\n'
+ result += '! oldid || date/time || size || username || edit summary\n'
+ for oldid, time, username, summary, size, tags \
+ in self.getVersionHistory(forceReload=forceReload,
+ reverseOrder=reverseOrder,
+ getAll=getAll, revCount=revCount):
+ result += '|----\n'
+ result += '| %s || %s || %d || %s ||
<nowiki>%s</nowiki>\n' \
+ % (oldid, time, size, username, summary)
+ result += '|}\n'
+ return result
+
+ def fullVersionHistory(self, getAll=False, skipFirst=False, reverseOrder=False,
+ revCount=500):
+ """Iterate previous versions including wikitext.
+
+ Gives a list of tuples consisting of revision ID, edit date/time, user name and
+ content
+
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 8:
+ address = self.site().export_address()
+ predata = {
+ 'action': 'submit',
+ 'pages': self.title()
+ }
+ get_throttle(requestsize = 10)
+ now = time.time()
+ response, data = self.site().postForm(address, predata)
+ data = data.encode(self.site().encoding())
+# get_throttle.setDelay(time.time() - now)
+ output = []
+ # TODO: parse XML using an actual XML parser instead of regex!
+ r =
re.compile("\<revision\>.*?\<id\>(?P<id>.*?)\<\/id\>.*?\<timestamp\>(?P<timestamp>.*?)\<\/timestamp\>.*?\<(?:ip|username)\>(?P<user>.*?)\</(?:ip|username)\>.*?\<text.*?\>(?P<content>.*?)\<\/text\>",re.DOTALL)
+ #r =
re.compile("\<revision\>.*?\<timestamp\>(.*?)\<\/timestamp\>.*?\<(?:ip|username)\>(.*?)\<",re.DOTALL)
+ return [ (match.group('id'),
+ match.group('timestamp'),
+ unescape(match.group('user')),
+ unescape(match.group('content')))
+ for match in r.finditer(data) ]
+
+ # Load history informations by API query.
+
+ dataQ = []
+ thisHistoryDone = False
+ params = {
+ 'action': 'query',
+ 'prop': 'revisions',
+ 'titles': self.title(),
+ 'rvprop': 'ids|timestamp|user|content',
+ 'rvlimit': revCount,
+ }
+ while not thisHistoryDone:
+ if reverseOrder:
+ params['rvdir'] = 'newer'
+
+ result = query.GetData(params, self.site())
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+ pageInfo = result['query']['pages'].values()[0]
+ if result['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo:
+ raise NoPage(self.site(), unicode(self),
+ "Page does not exist.")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+
+ if 'query-continue' in result and getAll:
+ params['rvstartid'] =
result['query-continue']['revisions']['rvstartid']
+ else:
+ thisHistoryDone = True
+
+ if skipFirst:
+ skipFirst = False
+ else:
+ for r in pageInfo['revisions']:
+ c = ''
+ if 'comment' in r:
+ c = r['comment']
+ #revision id, edit date/time, user name, edit summary
+ (revidStrr, timestampStrr, userStrr) = (None, None, None)
+ if 'revid' in r:
+ revidStrr = r['revid']
+ if 'timestamp' in r:
+ timestampStrr = r['timestamp']
+ if 'user' in r:
+ userStrr = r['user']
+ s='' #Will return -1 if not found
+ if '*' in r:
+ s = r['*']
+ dataQ.append((revidStrr, timestampStrr, userStrr, s))
+ if
len(result['query']['pages'].values()[0]['revisions']) <
revCount:
+ thisHistoryDone = True
+ return dataQ
+
+ def contributingUsers(self, step=None, total=None):
+ """Return a set of usernames (or IPs) of users who edited this
page.
+
+ @param step: limit each API call to this number of revisions
+ - not used yet, only in rewrite branch -
+ @param total: iterate no more than this number of revisions in total
+
+ """
+ if total is None:
+ total = 500 #set to default of getVersionHistory
+ edits = self.getVersionHistory(revCount=total)
+ users = set([edit[2] for edit in edits])
+ return users
+
+ def getCreator(self):
+ """ Function to get the first editor and time stamp of a page
"""
+ inf = self.getVersionHistory(reverseOrder=True, revCount=1)[0]
+ return inf[2], inf[1]
+
+ def getLatestEditors(self, limit=1):
+ """ Function to get the last editors of a page """
+
#action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment
+ if hasattr(self, '_versionhistory'):
+ data = self.getVersionHistory(getAll=True, revCount=limit)
+ else:
+ data = self.getVersionHistory(revCount = limit)
+
+ result = []
+ for i in data:
+ result.append({'user':i[2], 'timestamp':i[1]})
+ return result
+
+ def watch(self, unwatch=False):
+ """Add this page to the watchlist"""
+ if self.site().has_api:
+ params = {
+ 'action': 'watch',
+ 'title': self.title()
+ }
+ # watchtoken is needed for mw 1.18
+ # TODO: Find a better implementation for other actions too
+ # who needs a token
+ if self.site().versionnumber() >= 18:
+ api = {
+ 'action': 'query',
+ 'prop': 'info',
+ 'titles' : self.title(),
+ 'intoken' : 'watch',
+ }
+ data = query.GetData(api, self.site())
+ params['token'] =
data['query']['pages'].values()[0]['watchtoken']
+ if unwatch:
+ params['unwatch'] = ''
+
+ data = query.GetData(params, self.site())
+ if 'error' in data:
+ raise RuntimeError("API query error: %s" %
data['error'])
+ else:
+ urlname = self.urlname()
+ if not unwatch:
+ address = self.site().watch_address(urlname)
+ else:
+ address = self.site().unwatch_address(urlname)
+ response = self.site().getUrl(address)
+ return response
+
+ def unwatch(self):
+ self.watch(unwatch=True)
+
+ def move(self, newtitle, reason=None, movetalkpage=True, movesubpages=False,
+ sysop=False, throttle=True, deleteAndMove=False, safe=True,
+ fixredirects=True, leaveRedirect=True):
+ """Move this page to new title.
+
+ * fixredirects has no effect in MW < 1.13
+
+ @param newtitle: The new page title.
+ @param reason: The edit summary for the move.
+ @param movetalkpage: If true, move this page's talk page (if it exists)
+ @param sysop: Try to move using sysop account, if available
+ @param deleteAndMove: if move succeeds, delete the old page
+ (usually requires sysop privileges, depending on wiki settings)
+ @param safe: If false, attempt to delete existing page at newtitle
+ (if there is one) and then move this page to that title
+
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 12:
+ return self._moveOld(newtitle, reason, movetalkpage, sysop,
+ throttle, deleteAndMove, safe, fixredirects, leaveRedirect)
+ # Login
+ try:
+ self.get()
+ except:
+ pass
+ sysop = self._getActionUser(action = 'move', restriction =
self.moveRestriction, sysop = False)
+ if deleteAndMove:
+ sysop = self._getActionUser(action = 'delete', restriction =
'', sysop = True)
+ Page(self.site(),
newtitle).delete(self.site().mediawiki_message('delete_and_move_reason'), False,
False)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+
+ if throttle:
+ put_throttle()
+ if reason is None:
+ pywikibot.output(u'Moving %s to [[%s]].'
+ % (self.title(asLink=True), newtitle))
+ reason = input(u'Please enter a reason for the move:')
+ if self.isTalkPage():
+ movetalkpage = False
+
+ params = {
+ 'action': 'move',
+ 'from': self.title(),
+ 'to': newtitle,
+ 'token': self.site().getToken(sysop=sysop),
+ 'reason': reason,
+ }
+ if movesubpages:
+ params['movesubpages'] = 1
+
+ if movetalkpage:
+ params['movetalk'] = 1
+
+ if not leaveRedirect:
+ params['noredirect'] = 1
+
+ result = query.GetData(params, self.site(), sysop=sysop)
+ if 'error' in result:
+ err = result['error']['code']
+ if err == 'articleexists':
+ if safe:
+ output(u'Page move failed: Target page [[%s]] already
exists.' % newtitle)
+ else:
+ try:
+ # Try to delete and move
+ return self.move(newtitle, reason, movetalkpage, movesubpages,
throttle = throttle, deleteAndMove = True)
+ except NoUsername:
+ # We dont have the user rights to delete
+ output(u'Page moved failed: Target page [[%s]] already
exists.' % newtitle)
+ #elif err == 'protectedpage':
+ #
+ else:
+ output("Unknown Error: %s" % result)
+ return False
+ elif 'move' in result:
+ if deleteAndMove:
+ output(u'Page %s moved to %s, deleting the existing page' %
(self.title(), newtitle))
+ else:
+ output(u'Page %s moved to %s' % (self.title(), newtitle))
+
+ if hasattr(self, '_contents'):
+ #self.__init__(self.site(), newtitle, defaultNamespace =
self._namespace)
+ try:
+ self.get(force=True, get_redirect=True, throttle=False)
+ except NoPage:
+ output(u'Page %s is moved and no longer exist.' %
self.title() )
+ #delattr(self, '_contents')
+ return True
+
+ def _moveOld(self, newtitle, reason=None, movetalkpage=True, movesubpages=False,
sysop=False,
+ throttle=True, deleteAndMove=False, safe=True, fixredirects=True,
leaveRedirect=True):
+
+ # Login
+ try:
+ self.get()
+ except:
+ pass
+ sysop = self._getActionUser(action = 'move', restriction =
self.moveRestriction, sysop = False)
+ if deleteAndMove:
+ sysop = self._getActionUser(action = 'delete', restriction =
'', sysop = True)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+
+ if throttle:
+ put_throttle()
+ if reason is None:
+ reason = input(u'Please enter a reason for the move:')
+ if self.isTalkPage():
+ movetalkpage = False
+
+ host = self.site().hostname()
+ address = self.site().move_address()
+ token = self.site().getToken(sysop = sysop)
+ predata = {
+ 'wpOldTitle': self.title().encode(self.site().encoding()),
+ 'wpNewTitle': newtitle.encode(self.site().encoding()),
+ 'wpReason': reason.encode(self.site().encoding()),
+ }
+ if deleteAndMove:
+ predata['wpDeleteAndMove'] =
self.site().mediawiki_message('delete_and_move_confirm')
+ predata['wpConfirm'] = '1'
+
+ if movetalkpage:
+ predata['wpMovetalk'] = '1'
+ else:
+ predata['wpMovetalk'] = '0'
+
+ if self.site().versionnumber() >= 13:
+ if fixredirects:
+ predata['wpFixRedirects'] = '1'
+ else:
+ predata['wpFixRedirects'] = '0'
+
+ if leaveRedirect:
+ predata['wpLeaveRedirect'] = '1'
+ else:
+ predata['wpLeaveRedirect'] = '0'
+
+ if movesubpages:
+ predata['wpMovesubpages'] = '1'
+ else:
+ predata['wpMovesubpages'] = '0'
+
+ if token:
+ predata['wpEditToken'] = token
+
+ response, data = self.site().postForm(address, predata, sysop = sysop)
+
+ if data == u'' or self.site().mediawiki_message('pagemovedsub')
in data:
+ #Move Success
+ if deleteAndMove:
+ output(u'Page %s moved to %s, deleting the existing page' %
(self.title(), newtitle))
+ else:
+ output(u'Page %s moved to %s' % (self.title(), newtitle))
+
+ if hasattr(self, '_contents'):
+ #self.__init__(self.site(), newtitle, defaultNamespace =
self._namespace)
+ try:
+ self.get(force=True, get_redirect=True, throttle=False)
+ except NoPage:
+ output(u'Page %s is moved and no longer exist.' %
self.title() )
+ #delattr(self, '_contents')
+
+ return True
+ else:
+ #Move Failure
+ self.site().checkBlocks(sysop = sysop)
+ if self.site().mediawiki_message('articleexists') in data or
self.site().mediawiki_message('delete_and_move') in data:
+ if safe:
+ output(u'Page move failed: Target page [[%s]] already
exists.' % newtitle)
+ return False
+ else:
+ try:
+ # Try to delete and move
+ return self._moveOld(newtitle, reason, movetalkpage,
movesubpages, throttle = throttle, deleteAndMove = True)
+ except NoUsername:
+ # We dont have the user rights to delete
+ output(u'Page moved failed: Target page [[%s]] already
exists.' % newtitle)
+ return False
+ elif not self.exists():
+ raise NoPage(u'Page move failed: Source page [[%s]] does not
exist.' % newtitle)
+ elif Page(self.site(),newtitle).exists():
+ # XXX : This might be buggy : if the move was successful, the target pase
*has* been created
+ raise PageNotSaved(u'Page move failed: Target page [[%s]] already
exists.' % newtitle)
+ else:
+ output(u'Page move failed for unknown reason.')
+ try:
+ ibegin = data.index('<!-- start content -->') + 22
+ iend = data.index('<!-- end content -->')
+ except ValueError:
+ # if begin/end markers weren't found, show entire HTML file
+ output(data)
+ else:
+ # otherwise, remove the irrelevant sections
+ data = data[ibegin:iend]
+ output(data)
+ return False
+
+ def delete(self, reason=None, prompt=True, throttle=True, mark=False):
+ """Deletes the page from the wiki. Requires administrator status.
+
+ @param reason: The edit summary for the deletion. If None, ask for it.
+ @param prompt: If true, prompt user for confirmation before deleting.
+ @param mark: if true, and user does not have sysop rights, place a
+ speedy-deletion request on the page instead.
+
+ """
+ # Login
+ try:
+ self._getActionUser(action = 'delete', sysop = True)
+ except NoUsername:
+ if mark and self.exists():
+ text = self.get(get_redirect = True)
+ output(u'Cannot delete page %s - marking the page for deletion
instead:' % self.title(asLink=True))
+ # Note: Parameters to {{delete}}, and their meanings, vary from one
Wikipedia to another.
+ # If you want or need to use them, you must be careful not to break
others. Else don't.
+ self.put(u'{{delete|bot=yes}}\n%s --~~~~\n----\n\n%s' % (reason,
text), comment = reason)
+ return
+ else:
+ raise
+
+ # Check blocks
+ self.site().checkBlocks(sysop = True)
+
+ if throttle:
+ put_throttle()
+ if reason is None:
+ output(u'Deleting %s.' % (self.title(asLink=True)))
+ reason = input(u'Please enter a reason for the deletion:')
+ answer = u'y'
+ if prompt and not hasattr(self.site(), '_noDeletePrompt'):
+ answer = inputChoice(u'Do you want to delete %s?' % self,
+ ['yes', 'no', 'all'],
['y', 'N', 'a'], 'N')
+ if answer == 'a':
+ answer = 'y'
+ self.site()._noDeletePrompt = True
+ if answer == 'y':
+
+ token = self.site().getToken(self, sysop = True)
+ reason = reason.encode(self.site().encoding())
+
+ if self.site().has_api() and self.site().versionnumber() >= 12:
+ #API Mode
+ params = {
+ 'action': 'delete',
+ 'title': self.title(),
+ 'token': token,
+ 'reason': reason,
+ }
+ datas = query.GetData(params, self.site(), sysop = True)
+ if 'delete' in datas:
+ output(u'Page %s deleted' % self)
+ return True
+ else:
+ if datas['error']['code'] == 'missingtitle':
+ output(u'Page %s could not be deleted - it doesn\'t
exist'
+ % self)
+ else:
+ output(u'Deletion of %s failed for an unknown reason. The
response text is:'
+ % self)
+ output('%s' % datas)
+
+ return False
+ else:
+ #Ordinary mode from webpage.
+ host = self.site().hostname()
+ address = self.site().delete_address(self.urlname())
+
+ predata = {
+ 'wpDeleteReasonList': 'other',
+ 'wpReason': reason,
+ #'wpComment': reason, <- which version?
+ 'wpConfirm': '1',
+ 'wpConfirmB': '1',
+ 'wpEditToken': token,
+ }
+ response, data = self.site().postForm(address, predata, sysop = True)
+ if data:
+ self.site().checkBlocks(sysop = True)
+ if self.site().mediawiki_message('actioncomplete') in data:
+ output(u'Page %s deleted' % self)
+ return True
+ elif self.site().mediawiki_message('cannotdelete') in data:
+ output(u'Page %s could not be deleted - it doesn\'t
exist'
+ % self)
+ return False
+ else:
+ output(u'Deletion of %s failed for an unknown reason. The
response text is:'
+ % self)
+ try:
+ ibegin = data.index('<!-- start content -->') +
22
+ iend = data.index('<!-- end content -->')
+ except ValueError:
+ # if begin/end markers weren't found, show entire HTML
file
+ output(data)
+ else:
+ # otherwise, remove the irrelevant sections
+ data = data[ibegin:iend]
+ output(data)
+ return False
+
+ def loadDeletedRevisions(self, step=None, total=None):
+ """Retrieve all deleted revisions for this Page from
Special/Undelete.
+
+ Stores all revisions' timestamps, dates, editors and comments in
+ self._deletedRevs attribute.
+
+ @return: list of timestamps (which can be used to retrieve
+ revisions later on).
+
+ """
+ # Login
+ self._getActionUser(action = 'deletedhistory', sysop = True)
+
+ #TODO: Handle image file revisions too.
+ output(u'Loading list of deleted revisions for [[%s]]...' %
self.title())
+
+ self._deletedRevs = {}
+
+ if self.site().has_api() and self.site().versionnumber() >= 12:
+ params = {
+ 'action': 'query',
+ 'list': 'deletedrevs',
+ 'drfrom': self.title(withNamespace=False),
+ 'drnamespace': self.namespace(),
+ 'drprop':
['revid','user','comment','content'],#','minor','len','token'],
+ 'drlimit': 100,
+ 'drdir': 'older',
+ #'': '',
+ }
+ count = 0
+ while True:
+ data = query.GetData(params, self.site(), sysop=True)
+ for x in data['query']['deletedrevs']:
+ if x['title'] != self.title():
+ continue
+
+ for y in x['revisions']:
+ count += 1
+ self._deletedRevs[parsetime2stamp(y['timestamp'])] =
[y['timestamp'], y['user'], y['comment'] , y['*'], False]
+
+ if 'query-continue' in data:
+ # get the continue key for backward compatibility
+ # with pre 1.20wmf8
+ contKey =
data['query-continue']['deletedrevs'].keys()[0]
+ if
data['query-continue']['deletedrevs'][contKey].split(
+ '|')[1] == self.title(withNamespace=False):
+ params[contKey] =
data['query-continue']['deletedrevs'][contKey]
+ else: break
+ else:
+ break
+ self._deletedRevsModified = False
+
+ else:
+ address = self.site().undelete_view_address(self.urlname())
+ text = self.site().getUrl(address, sysop = True)
+ #TODO: Handle non-existent pages etc
+
+ rxRevs = re.compile(r'<input
name="(?P<ts>(?:ts|fileid)\d+)".*?title=".*?">(?P<date>.*?)</a>.*?title=".*?">(?P<editor>.*?)</a>.*?<span
class="comment">\((?P<comment>.*?)\)</span>',re.DOTALL)
+ for rev in rxRevs.finditer(text):
+ self._deletedRevs[rev.group('ts')] = [
+ rev.group('date'),
+ rev.group('editor'),
+ rev.group('comment'),
+ None, #Revision text
+ False, #Restoration marker
+ ]
+
+ self._deletedRevsModified = False
+
+ return self._deletedRevs.keys()
+
+ def getDeletedRevision(self, timestamp, retrieveText=False):
+ """Return a particular deleted revision by timestamp.
+
+ @return: a list of [date, editor, comment, text, restoration
+ marker]. text will be None, unless retrieveText is True (or has
+ been retrieved earlier). If timestamp is not found, returns
+ None.
+
+ """
+ if self._deletedRevs is None:
+ self.loadDeletedRevisions()
+ if timestamp not in self._deletedRevs:
+ #TODO: Throw an exception instead?
+ return None
+
+ if retrieveText and not self._deletedRevs[timestamp][3] and
timestamp[:2]=='ts':
+ # Login
+ self._getActionUser(action = 'delete', sysop = True)
+
+ output(u'Retrieving text of deleted revision...')
+ address = self.site().undelete_view_address(self.urlname(),timestamp)
+ text = self.site().getUrl(address, sysop = True)
+ und = re.search('<textarea readonly="1" cols="80"
rows="25">(.*?)</textarea><div><form
method="post"',text,re.DOTALL)
+ if und:
+ self._deletedRevs[timestamp][3] = und.group(1)
+
+ return self._deletedRevs[timestamp]
+
+ def markDeletedRevision(self, timestamp, undelete=True):
+ """Mark the revision identified by timestamp for undeletion.
+
+ @param undelete: if False, mark the revision to remain deleted.
+
+ """
+ if self._deletedRevs is None:
+ self.loadDeletedRevisions()
+ if timestamp not in self._deletedRevs:
+ #TODO: Throw an exception?
+ return None
+ self._deletedRevs[timestamp][4] = undelete
+ self._deletedRevsModified = True
+
+ def undelete(self, comment=None, throttle=True):
+ """Undelete page based on the undeletion markers set by previous
calls.
+
+ If no calls have been made since loadDeletedRevisions(), everything
+ will be restored.
+
+ Simplest case:
+ Page(...).undelete('This will restore all revisions')
+
+ More complex:
+ pg = Page(...)
+ revs = pg.loadDeletedRevsions()
+ for rev in revs:
+ if ... #decide whether to undelete a revision
+ pg.markDeletedRevision(rev) #mark for undeletion
+ pg.undelete('This will restore only selected revisions.')
+
+ @param comment: The undeletion edit summary.
+
+ """
+ # Login
+ self._getActionUser(action = 'undelete', sysop = True)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = True)
+
+ token = self.site().getToken(self, sysop=True)
+ if comment is None:
+ output(u'Preparing to undelete %s.'
+ % (self.title(asLink=True)))
+ comment = input(u'Please enter a reason for the undeletion:')
+
+ if throttle:
+ put_throttle()
+
+ if self.site().has_api() and self.site().versionnumber() >= 12:
+ params = {
+ 'action': 'undelete',
+ 'title': self.title(),
+ 'reason': comment,
+ 'token': token,
+ }
+ if self._deletedRevs and self._deletedRevsModified:
+ selected = []
+
+ for ts in self._deletedRevs:
+ if self._deletedRevs[ts][4]:
+ selected.append(ts)
+ params['timestamps'] = ts,
+
+ result = query.GetData(params, self.site(), sysop=True)
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+ elif 'undelete' in result:
+ output(u'Page %s undeleted' % self.title(asLink=True))
+
+ return result
+
+ else:
+ address = self.site().undelete_address()
+
+ formdata = {
+ 'target': self.title(),
+ 'wpComment': comment,
+ 'wpEditToken': token,
+ 'restore': self.site().mediawiki_message('undeletebtn')
+ }
+
+ if self._deletedRevs and self._deletedRevsModified:
+ for ts in self._deletedRevs:
+ if self._deletedRevs[ts][4]:
+ formdata['ts'+ts] = '1'
+
+ self._deletedRevs = None
+ #TODO: Check for errors below (have we succeeded? etc):
+ result = self.site().postForm(address,formdata,sysop=True)
+ output(u'Page %s undeleted' % self.title(asLink=True))
+
+ return result
+
+ def protect(self, editcreate='sysop', move='sysop', unprotect=False,
+ reason=None, editcreate_duration='infinite',
+ move_duration = 'infinite', cascading = False, prompt = True,
throttle = True):
+ """(Un)protect a wiki page. Requires administrator status.
+
+ If the title is not exist, the protection only ec (aka edit/create) available
+ If reason is None, asks for a reason. If prompt is True, asks the
+ user if he wants to protect the page. Valid values for ec and move
+ are:
+ * '' (equivalent to 'none')
+ * 'autoconfirmed'
+ * 'sysop'
+
+ """
+ # Login
+ self._getActionUser(action = 'protect', sysop = True)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = True)
+
+ #if self.exists() and editcreate != move: # check protect level if edit/move not
same
+ # if editcreate == 'sysop' and move != 'sysop':
+ # raise Error("The level configuration is not safe")
+
+ if unprotect:
+ editcreate = move = ''
+ else:
+ editcreate, move = editcreate.lower(), move.lower()
+ if throttle:
+ put_throttle()
+ if reason is None:
+ reason = input(
+ u'Please enter a reason for the change of the protection level:')
+ reason = reason.encode(self.site().encoding())
+ answer = 'y'
+ if prompt and not hasattr(self.site(), '_noProtectPrompt'):
+ answer = inputChoice(
+ u'Do you want to change the protection level of %s?' % self,
+ ['Yes', 'No', 'All'], ['Y', 'N',
'A'], 'N')
+ if answer == 'a':
+ answer = 'y'
+ self.site()._noProtectPrompt = True
+ if answer == 'y':
+ if not self.site().has_api() or self.site().versionnumber() < 12:
+ return self._oldProtect(editcreate, move, unprotect, reason,
+ editcreate_duration, move_duration,
+ cascading, prompt, throttle)
+
+ token = self.site().getToken(self, sysop = True)
+
+ # Translate 'none' to ''
+ protections = []
+ expiry = []
+ if editcreate == 'none':
+ editcreate = 'all'
+ if move == 'none':
+ move = 'all'
+
+ if editcreate_duration == 'none' or not editcreate_duration:
+ editcreate_duration = 'infinite'
+ if move_duration == 'none' or not move_duration:
+ move_duration = 'infinite'
+
+ if self.exists():
+ protections.append("edit=%s" % editcreate)
+
+ protections.append("move=%s" % move)
+ expiry.append(move_duration)
+ else:
+ protections.append("create=%s" % editcreate)
+
+ expiry.append(editcreate_duration)
+
+ params = {
+ 'action': 'protect',
+ 'title': self.title(),
+ 'token': token,
+ 'protections': protections,
+ 'expiry': expiry,
+ #'': '',
+ }
+ if reason:
+ params['reason'] = reason
+
+ if cascading:
+ if editcreate != 'sysop' or move != 'sysop' or not
self.exists():
+ # You can't protect a page as autoconfirmed and cascading,
prevent the error
+ # Cascade only available exists page, create prot. not.
+ output(u"NOTE: The page can't be protected with cascading
and not also with only-sysop. Set cascading \"off\"")
+ else:
+ params['cascade'] = 1
+
+ result = query.GetData(params, self.site(), sysop=True)
+
+ if 'error' in result: #error occured
+ err = result['error']['code']
+ output('%s' % result)
+ #if err == '':
+ #
+ #elif err == '':
+ #
+ else:
+ if result['protect']:
+ output(u'Changed protection level of page %s.' %
self.title(asLink=True))
+ return True
+
+ return False
+
+ def _oldProtect(self, editcreate = 'sysop', move = 'sysop', unprotect
= False, reason = None, editcreate_duration = 'infinite',
+ move_duration = 'infinite', cascading = False, prompt = True,
throttle = True):
+ """internal use for protect page by ordinary web page
form"""
+ host = self.site().hostname()
+ token = self.site().getToken(sysop = True)
+
+ # Translate 'none' to ''
+ if editcreate == 'none': editcreate = ''
+ if move == 'none': move = ''
+
+ # Translate no duration to infinite
+ if editcreate_duration == 'none' or not editcreate_duration:
editcreate_duration = 'infinite'
+ if move_duration == 'none' or not move_duration: move_duration =
'infinite'
+
+ # Get cascading
+ if cascading == False:
+ cascading = '0'
+ else:
+ if editcreate != 'sysop' or move != 'sysop' or not
self.exists():
+ # You can't protect a page as autoconfirmed and cascading, prevent
the error
+ # Cascade only available exists page, create prot. not.
+ cascading = '0'
+ output(u"NOTE: The page can't be protected with cascading and
not also with only-sysop. Set cascading \"off\"")
+ else:
+ cascading = '1'
+
+ if unprotect:
+ address = self.site().unprotect_address(self.urlname())
+ else:
+ address = self.site().protect_address(self.urlname())
+
+ predata = {}
+ if self.site().versionnumber >= 10:
+ predata['mwProtect-cascade'] = cascading
+
+ predata['mwProtect-reason'] = reason
+
+ if not self.exists(): #and self.site().versionnumber() >= :
+ #create protect
+ predata['mwProtect-level-create'] = editcreate
+ predata['wpProtectExpirySelection-create'] = editcreate_duration
+ else:
+ #edit/move Protect
+ predata['mwProtect-level-edit'] = editcreate
+ predata['mwProtect-level-move'] = move
+
+ if self.site().versionnumber() >= 14:
+ predata['wpProtectExpirySelection-edit'] = editcreate_duration
+ predata['wpProtectExpirySelection-move'] = move_duration
+ else:
+ predata['mwProtect-expiry'] = editcreate_duration
+
+ if token:
+ predata['wpEditToken'] = token
+
+ response, data = self.site().postForm(address, predata, sysop=True)
+
+ if response.code == 302 and not data:
+ output(u'Changed protection level of page %s.' %
self.title(asLink=True))
+ return True
+ else:
+ #Normally, we expect a 302 with no data, so this means an error
+ self.site().checkBlocks(sysop = True)
+ output(u'Failed to change protection level of page %s:'
+ % self.title(asLink=True))
+ output(u"HTTP response code %s" % response.code)
+ output(data)
+ return False
+
+ def removeImage(self, image, put=False, summary=None, safe=True):
+ """Remove all occurrences of an image from this
Page."""
+ # TODO: this should be grouped with other functions that operate on
+ # wiki-text rather than the Page object
+ return self.replaceImage(image, None, put, summary, safe)
+
+ def replaceImage(self, image, replacement=None, put=False, summary=None,
+ safe=True):
+ """Replace all occurences of an image by another image.
+
+ Giving None as argument for replacement will delink instead of
+ replace.
+
+ The argument image must be without namespace and all spaces replaced
+ by underscores.
+
+ If put is False, the new text will be returned. If put is True, the
+ edits will be saved to the wiki and True will be returned on succes,
+ and otherwise False. Edit errors propagate.
+
+ """
+ # TODO: this should be grouped with other functions that operate on
+ # wiki-text rather than the Page object
+
+ # Copyright (c) Orgullomoore, Bryan
+
+ # TODO: document and simplify the code
+ site = self.site()
+
+ text = self.get()
+ new_text = text
+
+ def capitalizationPattern(s):
+ """
+ Given a string, creates a pattern that matches the string, with
+ the first letter case-insensitive if capitalization is switched
+ on on the site you're working on.
+ """
+ if self.site().nocapitalize:
+ return re.escape(s)
+ else:
+ return ur'(?:[%s%s]%s)' % (re.escape(s[0].upper()),
re.escape(s[0].lower()), re.escape(s[1:]))
+
+ namespaces = set(site.namespace(6, all = True) + site.namespace(-2, all = True))
+ # note that the colon is already included here
+ namespacePattern = ur'\s*(?:%s)\s*\:\s*' % u'|'.join(namespaces)
+
+ imagePattern = u'(%s)' %
capitalizationPattern(image).replace(r'\_', '[ _]')
+
+ def filename_replacer(match):
+ if replacement is None:
+ return u''
+ else:
+ old = match.group()
+ return old[:match.start('filename')] + replacement +
old[match.end('filename'):]
+
+ # The group params contains parameters such as thumb and 200px, as well
+ # as the image caption. The caption can contain wiki links, but each
+ # link has to be closed properly.
+ paramPattern = r'(?:\|(?:(?!\[\[).|\[\[.*?\]\])*?)'
+ rImage =
re.compile(ur'\[\[(?P<namespace>%s)(?P<filename>%s)(?P<params>%s*?)\]\]'
% (namespacePattern, imagePattern, paramPattern))
+ if replacement is None:
+ new_text = rImage.sub('', new_text)
+ else:
+ new_text = rImage.sub('[[\g<namespace>%s\g<params>]]' %
replacement, new_text)
+
+ # Remove the image from galleries
+ galleryR =
re.compile(r'(?is)<gallery>(?P<items>.*?)</gallery>')
+ galleryItemR =
re.compile(r'(?m)^%s?(?P<filename>%s)\s*(?P<label>\|.*?)?\s*$' %
(namespacePattern, imagePattern))
+
+ def gallery_replacer(match):
+ return ur'<gallery>%s</gallery>' %
galleryItemR.sub(filename_replacer, match.group('items'))
+
+ new_text = galleryR.sub(gallery_replacer, new_text)
+
+ if (text == new_text) or (not safe):
+ # All previous steps did not work, so the image is
+ # likely embedded in a complicated template.
+ # Note: this regular expression can't handle nested templates.
+ templateR = re.compile(ur'(?s)\{\{(?P<contents>.*?)\}\}')
+ fileReferenceR = re.compile(u'%s(?P<filename>(?:%s)?)' %
(namespacePattern, imagePattern))
+
+ def template_replacer(match):
+ return fileReferenceR.sub(filename_replacer, match.group(0))
+
+ new_text = templateR.sub(template_replacer, new_text)
+
+ if put:
+ if text != new_text:
+ # Save to the wiki
+ self.put(new_text, summary)
+ return True
+ return False
+ else:
+ return new_text
+
+ ## @since 10310
+ # @remarks needed by various bots
+ def purgeCache(self):
+ """Purges the page cache with API.
+ ( non-api purge can be done with Page.purge_address() )
+ """
+
+ # Make sure we re-raise an exception we got on an earlier attempt
+ if hasattr(self, '_getexception'):
+ return self._getexception
+
+ # call the wiki to execute the request
+ params = {
+ u'action' : u'purge',
+ u'titles' : self.title(),
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Purging page cache for %s." %
self.title(asLink=True))
+
+ result = query.GetData(params, self.site())
+ r = result[u'purge'][0]
+
+ # store and return info
+ if (u'missing' in r):
+ self._getexception = pywikibot.NoPage
+ raise pywikibot.NoPage(self.site(), self.title(asLink=True),"Page
does not exist. Was not able to purge cache!" )
+
+ return (u'purged' in r)
+
+
+class ImagePage(Page):
+ """A subclass of Page representing an image descriptor wiki page.
+
+ Supports the same interface as Page, with the following added methods:
+
+ getImagePageHtml : Download image page and return raw HTML text.
+ fileURL : Return the URL for the image described on this
+ page.
+ fileIsOnCommons : Return True if image stored on Wikimedia
+ Commons.
+ fileIsShared : Return True if image stored on Wikitravel
+ shared repository.
+ getFileMd5Sum : Return image file's MD5 checksum.
+ getFileVersionHistory : Return the image file's version history.
+ getFileVersionHistoryTable: Return the version history in the form of a
+ wiki table.
+ usingPages : Yield Pages on which the image is displayed.
+ globalUsage : Yield Pages on which the image is used globally
+
+ """
+ def __init__(self, site, title, insite = None):
+ Page.__init__(self, site, title, insite, defaultNamespace=6)
+ if self.namespace() != 6:
+ raise ValueError(u'BUG: %s is not in the image namespace!' % title)
+ self._imagePageHtml = None
+ self._local = None
+ self._latestInfo = {}
+ self._infoLoaded = False
+
+ def getImagePageHtml(self):
+ """
+ Download the image page, and return the HTML, as a unicode string.
+
+ Caches the HTML code, so that if you run this method twice on the
+ same ImagePage object, the page will only be downloaded once.
+ """
+ if not self._imagePageHtml:
+ path = self.site().get_address(self.urlname())
+ self._imagePageHtml = self.site().getUrl(path)
+ return self._imagePageHtml
+
+ def _loadInfo(self, limit=1):
+ params = {
+ 'action': 'query',
+ 'prop': 'imageinfo',
+ 'titles': self.title(),
+ 'iiprop': ['timestamp', 'user', 'comment',
'url', 'size',
+ 'dimensions', 'sha1', 'mime',
'metadata', 'archivename',
+ 'bitdepth'],
+ 'iilimit': limit,
+ }
+ try:
+ data = query.GetData(params, self.site())
+ except NotImplementedError:
+ output("API not work, loading page HTML.")
+ self.getImagePageHtml()
+ return
+
+ if 'error' in data:
+ raise RuntimeError("%s" %data['error'])
+ count = 0
+ pageInfo = data['query']['pages'].values()[0]
+ self._local = pageInfo["imagerepository"] != "shared"
+ if data['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo and self._local:
+ raise NoPage(self.site(), unicode(self),
+ "Page does not exist.")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+ infos = []
+
+ try:
+ while True:
+ for info in pageInfo['imageinfo']:
+ count += 1
+ if count == 1 and 'iistart' not in params:
+ # count 1 and no iicontinue mean first image revision is latest.
+ self._latestInfo = info
+ infos.append(info)
+ if limit == 1:
+ break
+
+ if 'query-continue' in data and limit != 1:
+ params['iistart'] =
data['query-continue']['imageinfo']['iistart']
+ else:
+ break
+ except KeyError:
+ output("Not image in imagepage")
+ self._infoLoaded = True
+ if limit > 1:
+ return infos
+
+ def fileUrl(self):
+ """Return the URL for the image described on this
page."""
+ # There are three types of image pages:
+ # * normal, small images with links like: filename.png (10KB, MIME type:
image/png)
+ # * normal, large images with links like: Download high resolution version
(1024x768, 200 KB)
+ # * SVG images with links like: filename.svg (1KB, MIME type: image/svg)
+ # This regular expression seems to work with all of them.
+ # The part after the | is required for copying .ogg files from en:, as they do
not
+ # have a "full image link" div. This might change in the future; on
commons, there
+ # is a full image link for .ogg and .mid files.
+ #***********************
+ #change to API query:
action=query&titles=File:wiki.jpg&prop=imageinfo&iiprop=url
+ if not self._infoLoaded:
+ self._loadInfo()
+
+ if self._infoLoaded:
+ return self._latestInfo['url']
+
+ urlR = re.compile(r'<div class="fullImageLink"
id="file">.*?<a href="(?P<url>[^ ]+?)"(?!
class="image")|<span class="dangerousLink"><a
href="(?P<url2>.+?)"', re.DOTALL)
+ m = urlR.search(self.getImagePageHtml())
+
+ url = m.group('url') or m.group('url2')
+ return url
+
+ def fileIsOnCommons(self):
+ """Return True if the image is stored on Wikimedia
Commons"""
+ if not self._infoLoaded:
+ self._loadInfo()
+
+ if self._infoLoaded:
+ return not self._local
+
+ return
self.fileUrl().startswith(u'http://upload.wikimedia.org/wikipedia/commo…
+
+ def fileIsShared(self):
+ """Return True if image is stored on Wikitravel shared
repository."""
+ if 'wikitravel_shared' in self.site().shared_image_repository():
+ return
self.fileUrl().startswith(u'http://wikitravel.org/upload/shared/')
+ return self.fileIsOnCommons()
+
+ # FIXME: MD5 might be performed on not complete file due to server disconnection
+ # (see bug #1795683).
+ def getFileMd5Sum(self):
+ """Return image file's MD5 checksum."""
+ f = MyURLopener.open(self.fileUrl())
+ return md5(f.read()).hexdigest()
+
+ def getFileVersionHistory(self):
+ """Return the image file's version history.
+
+ Return value is a list of tuples containing (timestamp, username,
+ resolution, filesize, comment).
+
+ """
+ result = []
+ infos = self._loadInfo(500)
+ #API query
+ if infos:
+ for i in infos:
+ result.append((i['timestamp'], i['user'],
u"%s×%s" % (i['width'], i['height']), i['size'],
i['comment']))
+
+ return result
+
+ #from ImagePage HTML
+ history = re.search('(?s)<table class="wikitable
filehistory">.+?</table>', self.getImagePageHtml())
+ if history:
+ lineR =
re.compile(r'<tr>(?:<td>.*?</td>){1,2}<td.*?><a
href=".+?">(?P<datetime>.+?)</a></td><td>.*?(?P<resolution>\d+\xd7\d+)
<span.*?>\((?P<filesize>.+?)\)</span></td><td><a
href=".+?"(?: class="new"|)
title=".+?">(?P<username>.+?)</a>.*?</td><td>(?:.*?<span
class="comment">\((?P<comment>.*?)\)</span>)?</td></tr>')
+ if not lineR.search(history.group()):
+ # b/c code
+ lineR =
re.compile(r'<tr>(?:<td>.*?</td>){1,2}<td><a
href=".+?">(?P<datetime>.+?)</a></td><td><a
href=".+?"(?: class="new"|)
title=".+?">(?P<username>.+?)</a>.*?</td><td>(?P<resolution>.*?)</td><td
class=".+?">(?P<filesize>.+?)</td><td>(?P<comment>.*?)</td></tr>')
+ else:
+ # backward compatible code
+ history = re.search('(?s)<ul
class="special">.+?</ul>', self.getImagePageHtml())
+ if history:
+ lineR = re.compile('<li> \(.+?\) \(.+?\) <a
href=".+?" title=".+?">(?P<datetime>.+?)</a> . . <a
href=".+?" title=".+?">(?P<username>.+?)</a> \(.+?\) .
. (?P<resolution>\d+.+?\d+) \((?P<filesize>[\d,\.]+) .+?\)( <span
class="comment">(?P<comment>.*?)</span>)?</li>')
+
+ if history:
+ for match in lineR.finditer(history.group()):
+ datetime = match.group('datetime')
+ username = match.group('username')
+ resolution = match.group('resolution')
+ size = match.group('filesize')
+ comment = match.group('comment') or ''
+ result.append((datetime, username, resolution, size, comment))
+ return result
+
+ def getFirstUploader(self):
+ """ Function that uses the APIs to detect the first uploader of
the image """
+ inf = self.getFileVersionHistory()[-1]
+ return [inf[1], inf[0]]
+
+ def getLatestUploader(self):
+ """ Function that uses the APIs to detect the latest uploader of
the image """
+ if not self._infoLoaded:
+ self._loadInfo()
+ if self._infoLoaded:
+ return [self._latestInfo['user'],
self._latestInfo['timestamp']]
+
+ inf = self.getFileVersionHistory()[0]
+ return [inf[1], inf[0]]
+
+ def getHash(self):
+ """ Function that return the Hash of an file in oder to understand
if two
+ Files are the same or not.
+ """
+ if self.exists():
+ if not self._infoLoaded:
+ self._loadInfo()
+ try:
+ return self._latestInfo['sha1']
+ except (KeyError, IndexError, TypeError):
+ try:
+ self.get()
+ except NoPage:
+ output(u'%s has been deleted before getting the Hash.
Skipping...' % self.title())
+ return None
+ except IsRedirectPage:
+ output("Skipping %s because it's a redirect." %
self.title())
+ return None
+ else:
+ raise NoHash('No Hash found in the APIs! Maybe the regex to catch
it is wrong or someone has changed the APIs structure.')
+ else:
+ output(u'File deleted before getting the Hash. Skipping...')
+ return None
+
+ def getFileVersionHistoryTable(self):
+ """Return the version history in the form of a wiki
table."""
+ lines = []
+ for (datetime, username, resolution, size, comment) in
self.getFileVersionHistory():
+ lines.append(u'| %s || %s || %s || %s ||
<nowiki>%s</nowiki>' % (datetime, username, resolution, size, comment))
+ return u'{| border="1"\n! date/time || username || resolution ||
size || edit summary\n|----\n' + u'\n|----\n'.join(lines) + '\n|}'
+
+ def usingPages(self):
+ if not self.site().has_api() or self.site().versionnumber() < 11:
+ for a in self._usingPagesOld():
+ yield a
+ return
+
+ params = {
+ 'action': 'query',
+ 'list': 'imageusage',
+ 'iutitle': self.title(),
+ 'iulimit': config.special_page_limit,
+ #'': '',
+ }
+
+ while True:
+ data = query.GetData(params, self.site())
+ if 'error' in data:
+ raise RuntimeError("%s" % data['error'])
+
+ for iu in data['query']["imageusage"]:
+ yield Page(self.site(), iu['title'],
defaultNamespace=iu['ns'])
+
+ if 'query-continue' in data:
+ params['iucontinue'] =
data['query-continue']['imageusage']['iucontinue']
+ else:
+ break
+
+ def _usingPagesOld(self):
+ """Yield Pages on which the image is displayed."""
+ titleList = re.search('(?s)<h2 id="filelinks">.+?<!-- end
content -->',
+ self.getImagePageHtml()).group()
+ lineR = re.compile(
+ '<li><a href="[^\"]+"
title=".+?">(?P<title>.+?)</a></li>')
+
+ for match in lineR.finditer(titleList):
+ try:
+ yield Page(self.site(), match.group('title'))
+ except InvalidTitle:
+ output(
+ u"Image description page %s contains invalid reference to [[%s]]."
+ % (self.title(), match.group('title')))
+
+ def globalUsage(self):
+ '''
+ Yield Pages on which the image is used globally.
+ Currently this probably only works on Wikimedia Commonas.
+ '''
+
+ if not self.site().has_api() or self.site().versionnumber() < 11:
+ # Not supported, just return none
+ return
+
+ params = {
+ 'action': 'query',
+ 'prop': 'globalusage',
+ 'titles': self.title(),
+ 'gulimit': config.special_page_limit,
+ #'': '',
+ }
+
+ while True:
+ data = query.GetData(params, self.site())
+ if 'error' in data:
+ raise RuntimeError("%s" % data['error'])
+
+ for (page, globalusage) in data['query']['pages'].items():
+ for gu in globalusage['globalusage']:
+ #FIXME : Should have a cleaner way to get the wiki where the image is
used
+ siteparts = gu['wiki'].split('.')
+ if len(siteparts)==3:
+ if siteparts[0] in self.site().fam().alphabetic and siteparts[1]
in ['wikipedia', 'wiktionary', 'wikibooks',
'wikiquote','wikisource']:
+ code = siteparts[0]
+ fam = siteparts[1]
+ elif siteparts[0] in ['meta', 'incubator'] and
siteparts[1]==u'wikimedia':
+ code = code = siteparts[0]
+ fam = code = siteparts[0]
+ else:
+ code = None
+ fam = None
+ if code and fam:
+ site = getSite(code=code, fam=fam)
+ yield Page(site, gu['title'])
+
+ if 'query-continue' in data:
+ params['gucontinue'] =
data['query-continue']['globalusage']['gucontinue']
+ else:
+ break
+
+
+class _GetAll(object):
+ """For internal use only - supports getall()
function"""
+ def __init__(self, site, pages, throttle, force):
+ self.site = site
+ self.pages = []
+ self.throttle = throttle
+ self.force = force
+ self.sleeptime = 15
+
+ for page in pages:
+ if (not hasattr(page, '_contents') and not hasattr(page,
'_getexception')) or force:
+ self.pages.append(page)
+ elif verbose:
+ output(u"BUGWARNING: %s already done!" %
page.title(asLink=True))
+
+ def sleep(self):
+ time.sleep(self.sleeptime)
+ if self.sleeptime <= 60:
+ self.sleeptime += 15
+ elif self.sleeptime < 360:
+ self.sleeptime += 60
+
+ def run(self):
+ if self.pages:
+ # Sometimes query does not contains revisions
+ if self.site.has_api() and debug:
+ while True:
+ try:
+ data = self.getDataApi()
+ except (socket.error, httplib.BadStatusLine, ServerError):
+ # Print the traceback of the caught exception
+ s = ''.join(traceback.format_exception(*sys.exc_info()))
+ if not isinstance(s, unicode):
+ s = s.decode('utf-8')
+ output(u'%s\nDBG> got network error in _GetAll.run. '
\
+ 'Sleeping for %d seconds...' % (s,
self.sleeptime))
+ self.sleep()
+ else:
+ if 'error' in data:
+ raise RuntimeError(data['error'])
+ else:
+ break
+
+ self.headerDoneApi(data['query'])
+ if 'normalized' in data['query']:
+ self._norm = dict([(x['from'],x['to']) for x in
data['query']['normalized']])
+ for vals in data['query']['pages'].values():
+ self.oneDoneApi(vals)
+ else: #read pages via Special:Export
+ while True:
+ try:
+ data = self.getData()
+ except (socket.error, httplib.BadStatusLine, ServerError):
+ # Print the traceback of the caught exception
+ s = ''.join(traceback.format_exception(*sys.exc_info()))
+ if not isinstance(s, unicode):
+ s = s.decode('utf-8')
+ output(u'%s\nDBG> got network error in _GetAll.run. '
\
+ 'Sleeping for %d seconds...' % (s,
self.sleeptime))
+ self.sleep()
+ else:
+ if "<title>Wiki does not exist</title>" in
data:
+ raise NoSuchSite(u'Wiki %s does not exist yet' %
self.site)
+ elif "</mediawiki>" not in data[-20:]:
+ # HTML error Page got thrown because of an internal
+ # error when fetching a revision.
+ output(u'Received incomplete XML data. ' \
+ 'Sleeping for %d seconds...' % self.sleeptime)
+ self.sleep()
+ elif "<siteinfo>" not in data: # This probably
means we got a 'temporary unaivalable'
+ output(u'Got incorrect export page. ' \
+ 'Sleeping for %d seconds...' % self.sleeptime)
+ self.sleep()
+ else:
+ break
+ R = re.compile(r"\s*<\?xml([^>]*)\?>(.*)",re.DOTALL)
+ m = R.match(data)
+ if m:
+ data = m.group(2)
+ handler = xmlreader.MediaWikiXmlHandler()
+ handler.setCallback(self.oneDone)
+ handler.setHeaderCallback(self.headerDone)
+ #f = open("backup.txt", "w")
+ #f.write(data)
+ #f.close()
+ try:
+ xml.sax.parseString(data, handler)
+ except (xml.sax._exceptions.SAXParseException, ValueError), err:
+ debugDump( 'SaxParseBug', self.site, err, data )
+ raise
+ except PageNotFound:
+ return
+ # All of the ones that have not been found apparently do not exist
+
+ for pl in self.pages:
+ if not hasattr(pl,'_contents') and not
hasattr(pl,'_getexception'):
+ pl._getexception = NoPage
+
+ def oneDone(self, entry):
+ title = entry.title
+ username = entry.username
+ ipedit = entry.ipedit
+ timestamp = entry.timestamp
+ text = entry.text
+ editRestriction = entry.editRestriction
+ moveRestriction = entry.moveRestriction
+ revisionId = entry.revisionid
+
+ page = Page(self.site, title)
+ successful = False
+ for page2 in self.pages:
+ if page2.sectionFreeTitle() == page.sectionFreeTitle():
+ if not (hasattr(page2,'_contents') or \
+ hasattr(page2, '_getexception')) or self.force:
+ page2.editRestriction = entry.editRestriction
+ page2.moveRestriction = entry.moveRestriction
+ if editRestriction == 'autoconfirmed':
+ page2._editrestriction = True
+ page2._permalink = entry.revisionid
+ page2._userName = username
+ page2._ipedit = ipedit
+ page2._revisionId = revisionId
+ page2._editTime = timestamp
+ page2._versionhistory = [
+ (revisionId,
+ time.strftime("%Y-%m-%dT%H:%M:%SZ",
+ time.strptime(str(timestamp),
+ "%Y%m%d%H%M%S")),
+ username, entry.comment)]
+ section = page2.section()
+ # Store the content
+ page2._contents = text
+ m = self.site.redirectRegex().match(text)
+ if m:
+ ## output(u"%s is a redirect" %
page2.title(asLink=True))
+ redirectto = m.group(1)
+ if section and not "#" in redirectto:
+ redirectto += "#" + section
+ page2._getexception = IsRedirectPage
+ page2._redirarg = redirectto
+
+ # This is used for checking deletion conflict.
+ # Use the data loading time.
+ page2._startTime = time.strftime('%Y%m%d%H%M%S',
+ time.gmtime())
+ if section:
+ m = re.search("=+[ ']*%s[ ']*=+" %
re.escape(section), text)
+ if not m:
+ try:
+ page2._getexception
+ output(u"WARNING: Section not found: %s" %
page2)
+ except AttributeError:
+ # There is no exception yet
+ page2._getexception = SectionError
+ successful = True
+ # Note that there is no break here. The reason is that there
+ # might be duplicates in the pages list.
+ if not successful:
+ output(u"BUG>> title %s (%s) not found in list" % (title,
page))
+ output(u'Expected one of: %s'
+ % u','.join([unicode(page2) for page2 in self.pages]))
+ raise PageNotFound
+
+ def headerDone(self, header):
+ # Verify version
+ version = header.generator
+ p = re.compile('^MediaWiki (.+)$')
+ m = p.match(version)
+ if m:
+ version = m.group(1)
+ # only warn operator when versionnumber has been changed
+ versionnumber = self.site.family.versionnumber
+ if version != self.site.version() and \
+ versionnumber(self.site.lang,
+ version=version) != versionnumber(self.site.lang):
+ output(u'WARNING: Family file %s contains version number %s, but it
should be %s'
+ % (self.site.family.name, self.site.version(), version))
+
+ # Verify case
+ if self.site.nocapitalize:
+ case = 'case-sensitive'
+ else:
+ case = 'first-letter'
+ if case != header.case.strip():
+ output(u'WARNING: Family file %s contains case %s, but it should be
%s' % (self.site.family.name, case, header.case.strip()))
+
+ # Verify namespaces
+ lang = self.site.lang
+ ids = header.namespaces.keys()
+ ids.sort()
+ for id in ids:
+ nshdr = header.namespaces[id]
+ if self.site.family.isDefinedNSLanguage(id, lang):
+ ns = self.site.namespace(id) or u''
+ if ns != nshdr:
+ try:
+ dflt = self.site.family.namespace('_default', id)
+ except KeyError:
+ dflt = u''
+ if not ns and not dflt:
+ flag = u"is not set, but should be '%s'" %
nshdr
+ elif dflt == ns:
+ flag = u"is set to default ('%s'), but should be
'%s'" % (ns, nshdr)
+ elif dflt == nshdr:
+ flag = u"is '%s', but should be removed (default
value '%s')" % (ns, nshdr)
+ else:
+ flag = u"is '%s', but should be '%s'" %
(ns, nshdr)
+ output(u"WARNING: Outdated family file %s:
namespace['%s'][%i] %s" % (self.site.family.name, lang, id, flag))
+ #self.site.family.namespaces[id][lang] = nshdr
+ else:
+ output(u"WARNING: Missing namespace in family file %s:
namespace['%s'][%i] (it is set to '%s')" % (self.site.family.name,
lang, id, nshdr))
+ for id in self.site.family.namespaces:
+ if self.site.family.isDefinedNSLanguage(id, lang) and id not in
header.namespaces:
+ output(u"WARNING: Family file %s includes
namespace['%s'][%i], but it should be removed (namespace doesn't exist in the
site)" % (self.site.family.name, lang, id))
+
+ def getData(self):
+ address = self.site.export_address()
+ pagenames = [page.sectionFreeTitle() for page in self.pages]
+ # We need to use X convention for requested page titles.
+ if self.site.lang == 'eo':
+ pagenames = [encodeEsperantoX(pagetitle) for pagetitle in pagenames]
+ pagenames = u'\r\n'.join(pagenames)
+ if type(pagenames) is not unicode:
+ output(u'Warning: xmlreader.WikipediaXMLHandler.getData() got non-unicode
page names. Please report this.')
+ print pagenames
+ # convert Unicode string to the encoding used on that wiki
+ pagenames = pagenames.encode(self.site.encoding())
+ predata = {
+ 'action': 'submit',
+ 'pages': pagenames,
+ 'curonly': 'True',
+ }
+ # Slow ourselves down
+ get_throttle(requestsize = len(self.pages))
+ # Now make the actual request to the server
+ now = time.time()
+ response, data = self.site.postForm(address, predata)
+ # The XML parser doesn't expect a Unicode string, but an encoded one,
+ # so we'll encode it back.
+ data = data.encode(self.site.encoding())
+ #get_throttle.setDelay(time.time() - now)
+ return data
+
+ def oneDoneApi(self, data):
+ title = data['title']
+ if not ('missing' in data or 'invalid' in data):
+ revisionId = data['lastrevid']
+ rev = None
+ try:
+ rev = data['revisions']
+ except KeyError:
+ raise KeyError(
+ u'NOTE: Last revision of [[%s]] not found' % title)
+ else:
+ username = rev[0]['user']
+ ipedit = 'anon' in rev[0]
+ timestamp = rev[0]['timestamp']
+ text = rev[0]['*']
+ editRestriction = ''
+ moveRestriction = ''
+ for revs in data['protection']:
+ if revs['type'] == 'edit':
+ editRestriction = revs['level']
+ elif revs['type'] == 'move':
+ moveRestriction = revs['level']
+
+ page = Page(self.site, title)
+ successful = False
+ for page2 in self.pages:
+ if hasattr(self, '_norm') and page2.sectionFreeTitle() in
self._norm:
+ page2._title = self._norm[page2.sectionFreeTitle()]
+
+ if page2.sectionFreeTitle() == page.sectionFreeTitle():
+ if 'missing' in data:
+ page2._getexception = NoPage
+ successful = True
+ break
+
+ if 'invalid' in data:
+ page2._getexception = BadTitle
+ successful = True
+ break
+
+ if not (hasattr(page2,'_contents') or
hasattr(page2,'_getexception')) or self.force:
+ page2.editRestriction = editRestriction
+ page2.moveRestriction = moveRestriction
+ if editRestriction == 'autoconfirmed':
+ page2._editrestriction = True
+ page2._permalink = revisionId
+ if rev:
+ page2._userName = username
+ page2._ipedit = ipedit
+ page2._editTime = timestamp
+ page2._contents = text
+ else:
+ raise KeyError(
+ u'BUG?>>: Last revision of [[%s]] not found'
+ % title)
+ page2._revisionId = revisionId
+ section = page2.section()
+ if 'redirect' in data:
+ ## output(u"%s is a redirect" %
page2.title(asLink=True))
+ m = self.site.redirectRegex().match(text)
+ redirectto = m.group(1)
+ if section and not "#" in redirectto:
+ redirectto += "#" + section
+ page2._getexception = IsRedirectPage
+ page2._redirarg = redirectto
+
+ # This is used for checking deletion conflict.
+ # Use the data loading time.
+ page2._startTime = time.strftime('%Y%m%d%H%M%S',
time.gmtime())
+ if section:
+ m = re.search("=+[ ']*%s[ ']*=+" %
re.escape(section), text)
+ if not m:
+ try:
+ page2._getexception
+ output(u"WARNING: Section not found: %s"
+ % page2)
+ except AttributeError:
+ # There is no exception yet
+ page2._getexception = SectionError
+ successful = True
+ # Note that there is no break here. The reason is that there
+ # might be duplicates in the pages list.
+ if not successful:
+ output(u"BUG>> title %s (%s) not found in list" % (title,
page))
+ output(u'Expected one of: %s'
+ % u','.join([unicode(page2) for page2 in self.pages]))
+ raise PageNotFound
+
+ def headerDoneApi(self, header):
+ p = re.compile('^MediaWiki (.+)$')
+ m = p.match(header['general']['generator'])
+ if m:
+ version = m.group(1)
+ # only warn operator when versionnumber has been changed
+ versionnumber = self.site.family.versionnumber
+ if version != self.site.version() and \
+ versionnumber(self.site.lang,
+ version=version) != versionnumber(self.site.lang):
+ output(u'WARNING: Family file %s contains version number %s, but it
should be %s'
+ % (self.site.family.name, self.site.version(), version))
+
+ # Verify case
+ if self.site.nocapitalize:
+ case = 'case-sensitive'
+ else:
+ case = 'first-letter'
+ if case != header['general']['case'].strip():
+ output(u'WARNING: Family file %s contains case %s, but it should be
%s' % (self.site.family.name, case, header.case.strip()))
+
+ # Verify namespaces
+ lang = self.site.lang
+ ids = header['namespaces'].keys()
+ ids.sort()
+ for id in ids:
+ nshdr = header['namespaces'][id]['*']
+ id = header['namespaces'][id]['id']
+ if self.site.family.isDefinedNSLanguage(id, lang):
+ ns = self.site.namespace(id) or u''
+ if ns != nshdr:
+ try:
+ dflt = self.site.family.namespace('_default', id)
+ except KeyError:
+ dflt = u''
+ if not ns and not dflt:
+ flag = u"is not set, but should be '%s'" %
nshdr
+ elif dflt == ns:
+ flag = u"is set to default ('%s'), but should be
'%s'" % (ns, nshdr)
+ elif dflt == nshdr:
+ flag = u"is '%s', but should be removed (default
value '%s')" % (ns, nshdr)
+ else:
+ flag = u"is '%s', but should be '%s'" %
(ns, nshdr)
+ output(u"WARNING: Outdated family file %s:
namespace['%s'][%i] %s" % (self.site.family.name, lang, id, flag))
+ #self.site.family.namespaces[id][lang] = nshdr
+ else:
+ output(u"WARNING: Missing namespace in family file %s:
namespace['%s'][%i] (it is set to '%s')" % (self.site.family.name,
lang, id, nshdr))
+ for id in self.site.family.namespaces:
+ if self.site.family.isDefinedNSLanguage(id, lang) and u'%i' % id not
in header['namespaces']:
+ output(u"WARNING: Family file %s includes
namespace['%s'][%i], but it should be removed (namespace doesn't exist in the
site)" % (self.site.family.name, lang, id ) )
+
+ def getDataApi(self):
+ pagenames = [page.sectionFreeTitle() for page in self.pages]
+ params = {
+ 'action': 'query',
+ 'meta':'siteinfo',
+ 'prop': ['info', 'revisions'],
+ 'titles': pagenames,
+ 'siprop': ['general', 'namespaces'],
+ 'rvprop': ['content', 'timestamp', 'user',
'comment', 'size'],#'ids',
+ 'inprop': ['protection', 'subjectid'], #,
'talkid', 'url', 'readable'
+ }
+
+ # Slow ourselves down
+ get_throttle(requestsize = len(self.pages))
+ # Now make the actual request to the server
+ now = time.time()
+
+ #get_throttle.setDelay(time.time() - now)
+ return query.GetData(params, self.site)
+
+def getall(site, pages, throttle=True, force=False):
+ """Bulk-retrieve a group of pages from site
+
+ Arguments: site = Site object
+ pages = iterable that yields Page objects
+
+ """
+ # TODO: why isn't this a Site method?
+ pages = list(pages) # if pages is an iterator, we need to make it a list
+ output(u'Getting %d page%s %sfrom %s...'
+ % (len(pages), (u'', u's')[len(pages) != 1],
+ (u'', u'via API ')[site.has_api() and debug], site))
+ limit = config.special_page_limit / 4 # default is 500/4, but It might have good
point for server.
+ if len(pages) > limit:
+ # separate export pages for bulk-retrieve
+
+ for pagg in range(0, len(pages), limit):
+ if pagg == range(0, len(pages), limit)[-1]: #latest retrieve
+ k = pages[pagg:]
+ output(u'Getting pages %d - %d of %d...' % (pagg + 1, len(pages),
len(pages)))
+ _GetAll(site, k, throttle, force).run()
+ pages[pagg:] = k
+ else:
+ k = pages[pagg:pagg + limit]
+ output(u'Getting pages %d - %d of %d...' % (pagg + 1, pagg +
limit, len(pages)))
+ _GetAll(site, k, throttle, force).run()
+ pages[pagg:pagg + limit] = k
+ get_throttle(requestsize = len(pages) / 10) # one time to retrieve is 7.7
sec.
+ else:
+ _GetAll(site, pages, throttle, force).run()
+
+
+# Library functions
+
+def setAction(s):
+ """Set a summary to use for changed page
submissions"""
+ global action
+ action = s
+
+# Default action
+setAction('Wikipedia python library')
+
+def setUserAgent(s):
+ """Set a User-agent: header passed to the HTTP
server"""
+ global useragent
+ useragent = s
+
+# Default User-agent
+setUserAgent('PythonWikipediaBot/1.0')
+
+def url2link(percentname, insite, site):
+ """Convert urlname of a wiki page into interwiki link format.
+
+ 'percentname' is the page title as given by Page.urlname();
+ 'insite' specifies the target Site;
+ 'site' is the Site on which the page is found.
+
+ """
+ # Note: this is only needed if linking between wikis that use different
+ # encodings, so it is now largely obsolete. [CONFIRM]
+ percentname = percentname.replace('_', ' ')
+ x = url2unicode(percentname, site = site)
+ return unicode2html(x, insite.encoding())
+
+def decodeEsperantoX(text):
+ """
+ Decode Esperanto text encoded using the x convention.
+
+ E.g., Cxefpagxo and CXefpagXo will both be converted to Ĉefpaĝo.
+ Note that to encode non-Esperanto words like Bordeaux, one uses a
+ double x, i.e. Bordeauxx or BordeauxX.
+
+ """
+ chars = {
+ u'c': u'ĉ',
+ u'C': u'Ĉ',
+ u'g': u'ĝ',
+ u'G': u'Ĝ',
+ u'h': u'ĥ',
+ u'H': u'Ĥ',
+ u'j': u'ĵ',
+ u'J': u'Ĵ',
+ u's': u'ŝ',
+ u'S': u'Ŝ',
+ u'u': u'ŭ',
+ u'U': u'Ŭ',
+ }
+ for latin, esperanto in chars.iteritems():
+ # A regular expression that matches a letter combination which IS
+ # encoded using x-convention.
+ xConvR = re.compile(latin + '[xX]+')
+ pos = 0
+ result = ''
+ # Each matching substring will be regarded exactly once.
+ while True:
+ match = xConvR.search(text[pos:])
+ if match:
+ old = match.group()
+ if len(old) % 2 == 0:
+ # The first two chars represent an Esperanto letter.
+ # Following x's are doubled.
+ new = esperanto + ''.join([old[2 * i]
+ for i in xrange(1, len(old)/2)])
+ else:
+ # The first character stays latin; only the x's are doubled.
+ new = latin + ''.join([old[2 * i + 1]
+ for i in xrange(0, len(old)/2)])
+ result += text[pos : match.start() + pos] + new
+ pos += match.start() + len(old)
+ else:
+ result += text[pos:]
+ text = result
+ break
+ return text
+
+def encodeEsperantoX(text):
+ """
+ Convert standard wikitext to the Esperanto x-encoding.
+
+ Double X-es where necessary so that we can submit a page to an Esperanto
+ wiki. Again, we have to keep stupid stuff like cXxXxxX in mind. Maybe
+ someone wants to write about the Sony Cyber-shot DSC-Uxx camera series on
+ eo: ;)
+ """
+ # A regular expression that matches a letter combination which is NOT
+ # encoded in x-convention.
+ notXConvR = re.compile('[cghjsuCGHJSU][xX]+')
+ pos = 0
+ result = ''
+ while True:
+ match = notXConvR.search(text[pos:])
+ if match:
+ old = match.group()
+ # the first letter stays; add an x after each X or x.
+ new = old[0] + ''.join([old[i] + 'x' for i in xrange(1,
len(old))])
+ result += text[pos : match.start() + pos] + new
+ pos += match.start() + len(old)
+ else:
+ result += text[pos:]
+ text = result
+ break
+ return text
+
+######## Unicode library functions ########
+
+def UnicodeToAsciiHtml(s):
+ """Convert unicode to a bytestring using HTML
entities."""
+ html = []
+ for c in s:
+ cord = ord(c)
+ if 31 < cord < 128:
+ html.append(c)
+ else:
+ html.append('&#%d;'%cord)
+ return ''.join(html)
+
+def url2unicode(title, site, site2 = None):
+ """Convert url-encoded text to unicode using site's encoding.
+
+ If site2 is provided, try its encodings as well. Uses the first encoding
+ that doesn't cause an error.
+
+ """
+ # create a list of all possible encodings for both hint sites
+ encList = [site.encoding()] + list(site.encodings())
+ if site2 and site2 <> site:
+ encList.append(site2.encoding())
+ encList += list(site2.encodings())
+ firstException = None
+ # try to handle all encodings (will probably retry utf-8)
+ for enc in encList:
+ try:
+ t = title.encode(enc)
+ t = urllib.unquote(t)
+ return unicode(t, enc)
+ except UnicodeError, ex:
+ if not firstException:
+ firstException = ex
+ pass
+ # Couldn't convert, raise the original exception
+ raise firstException
+
+def unicode2html(x, encoding):
+ """
+ Ensure unicode string is encodable, or else convert to ASCII for HTML.
+
+ Arguments are a unicode string and an encoding. Attempt to encode the
+ string into the desired format; if that doesn't work, encode the unicode
+ into html &#; entities. If it does work, return it unchanged.
+
+ """
+ try:
+ x.encode(encoding)
+ except UnicodeError:
+ x = UnicodeToAsciiHtml(x)
+ return x
+
+def html2unicode(text, ignore = []):
+ """Return text, replacing HTML entities by equivalent unicode
characters."""
+ # This regular expression will match any decimal and hexadecimal entity and
+ # also entities that might be named entities.
+ entityR = re.compile(
+
r'&(?:amp;)?(#(?P<decimal>\d+)|#x(?P<hex>[0-9a-fA-F]+)|(?P<name>[A-Za-z]+));')
+ # These characters are Html-illegal, but sadly you *can* find some of
+ # these and converting them to unichr(decimal) is unsuitable
+ convertIllegalHtmlEntities = {
+ 128 : 8364, # €
+ 130 : 8218, # ‚
+ 131 : 402, # ƒ
+ 132 : 8222, # „
+ 133 : 8230, # …
+ 134 : 8224, # †
+ 135 : 8225, # ‡
+ 136 : 710, # ˆ
+ 137 : 8240, # ‰
+ 138 : 352, # Š
+ 139 : 8249, # ‹
+ 140 : 338, # Œ
+ 142 : 381, # Ž
+ 145 : 8216, # ‘
+ 146 : 8217, # ’
+ 147 : 8220, # “
+ 148 : 8221, # ”
+ 149 : 8226, # •
+ 150 : 8211, # –
+ 151 : 8212, # —
+ 152 : 732, # ˜
+ 153 : 8482, # ™
+ 154 : 353, # š
+ 155 : 8250, # ›
+ 156 : 339, # œ
+ 158 : 382, # ž
+ 159 : 376 # Ÿ
+ }
+ #ensuring that illegal   and , which have no known
values,
+ #don't get converted to unichr(129), unichr(141) or unichr(157)
+ ignore = set(ignore) | set([129, 141, 157])
+ result = u''
+ i = 0
+ found = True
+ while found:
+ text = text[i:]
+ match = entityR.search(text)
+ if match:
+ unicodeCodepoint = None
+ if match.group('decimal'):
+ unicodeCodepoint = int(match.group('decimal'))
+ elif match.group('hex'):
+ unicodeCodepoint = int(match.group('hex'), 16)
+ elif match.group('name'):
+ name = match.group('name')
+ if name in htmlentitydefs.name2codepoint:
+ # We found a known HTML entity.
+ unicodeCodepoint = htmlentitydefs.name2codepoint[name]
+ result += text[:match.start()]
+ try:
+ unicodeCodepoint = convertIllegalHtmlEntities[unicodeCodepoint]
+ except KeyError:
+ pass
+ if unicodeCodepoint and unicodeCodepoint not in ignore and (WIDEBUILD or
unicodeCodepoint < 65534):
+ result += unichr(unicodeCodepoint)
+ else:
+ # Leave the entity unchanged
+ result += text[match.start():match.end()]
+ i = match.end()
+ else:
+ result += text
+ found = False
+ return result
+
+# Warning! _familyCache does not necessarily have to be consistent between
+# two statements. Always ensure that a local reference is created when
+# accessing Family objects
+_familyCache = weakref.WeakValueDictionary()
+def Family(fam=None, fatal=True, force=False):
+ """Import the named family.
+
+ @param fam: family name (if omitted, uses the configured default)
+ @type fam: str
+ @param fatal: if True, the bot will stop running if the given family is
+ unknown. If False, it will only raise a ValueError exception.
+ @param fatal: bool
+ @return: a Family instance configured for the named family.
+
+ """
+ if fam is None:
+ fam = config.family
+
+ family = _familyCache.get(fam)
+ if family and not force:
+ return family
+
+ try:
+ # search for family module in the 'families' subdirectory
+ sys.path.append(config.datafilepath('families'))
+ myfamily = __import__('%s_family' % fam)
+ except ImportError:
+ if fatal:
+ output(u"""\
+Error importing the %s family. This probably means the family
+does not exist. Also check your configuration file."""
+ % fam)
+ import traceback
+ traceback.print_stack()
+ sys.exit(1)
+ else:
+ raise ValueError("Family %s does not exist" % repr(fam))
+
+ family = myfamily.Family()
+ _familyCache[fam] = family
+ return family
+
+
+class Site(object):
+ """A MediaWiki site. Do not instantiate directly; use getSite()
function.
+
+ Constructor takes three arguments; only code is mandatory:
+ see __init__() param
+
+ Methods:
+
+ language: This Site's language code.
+ family: This Site's Family object.
+ sitename: A string representing this Site.
+ languages: A list of all languages contained in this site's Family.
+ validLanguageLinks: A list of language codes that can be used in interwiki
+ links.
+
+ loggedInAs: return current username, or None if not logged in.
+ forceLogin: require the user to log in to the site
+ messages: return True if there are new messages on the site
+ cookies: return user's cookies as a string
+
+ getUrl: retrieve an URL from the site
+ urlEncode: Encode a query to be sent using an http POST request.
+ postForm: Post form data to an address at this site.
+ postData: Post encoded form data to an http address at this site.
+
+ namespace(num): Return local name of namespace 'num'.
+ normalizeNamespace(value): Return preferred name for namespace 'value' in
+ this Site's language.
+ namespaces: Return list of canonical namespace names for this Site.
+ getNamespaceIndex(name): Return the int index of namespace 'name', or None
+ if invalid.
+
+ redirect: Return the localized redirect tag for the site.
+ redirectRegex: Return compiled regular expression matching on redirect
+ pages.
+ mediawiki_message: Retrieve the text of a specified MediaWiki message
+ has_mediawiki_message: True if this site defines specified MediaWiki
+ message
+ has_api: True if this site's family provides api interface
+
+ shared_image_repository: Return tuple of image repositories used by this
+ site.
+ category_on_one_line: Return True if this site wants all category links
+ on one line.
+ interwiki_putfirst: Return list of language codes for ordering of
+ interwiki links.
+ linkto(title): Return string in the form of a wikilink to 'title'
+ isInterwikiLink(s): Return True if 's' is in the form of an interwiki
+ link.
+ getSite(lang): Return Site object for wiki in same family, language
+ 'lang'.
+ version: Return MediaWiki version string from Family file.
+ versionnumber: Return int identifying the MediaWiki version.
+ live_version: Return version number read from Special:Version.
+ checkCharset(charset): Warn if charset doesn't match family file.
+ server_time: returns server time (currently userclock depending)
+
+ getParsedString: Parses the string with API and returns html content.
+ getExpandedString: Expands the string with API and returns wiki content.
+
+ linktrail: Return regex for trailing chars displayed as part of a link.
+ disambcategory: Category in which disambiguation pages are listed.
+
+ Methods that yield Page objects derived from a wiki's Special: pages
+ (note, some methods yield other information in a tuple along with the
+ Pages; see method docs for details) --
+
+ search(query): query results from Special:Search
+ allpages(): Special:Allpages
+ prefixindex(): Special:Prefixindex
+ protectedpages(): Special:ProtectedPages
+ newpages(): Special:Newpages
+ newimages(): Special:Log&type=upload
+ longpages(): Special:Longpages
+ shortpages(): Special:Shortpages
+ categories(): Special:Categories (yields Category objects)
+ deadendpages(): Special:Deadendpages
+ ancientpages(): Special:Ancientpages
+ lonelypages(): Special:Lonelypages
+ recentchanges(): Special:Recentchanges
+ unwatchedpages(): Special:Unwatchedpages (sysop accounts only)
+ uncategorizedcategories(): Special:Uncategorizedcategories (yields
+ Category objects)
+ uncategorizedpages(): Special:Uncategorizedpages
+ uncategorizedimages(): Special:Uncategorizedimages (yields
+ ImagePage objects)
+ uncategorizedtemplates(): Special:UncategorizedTemplates
+ unusedcategories(): Special:Unusuedcategories (yields Category)
+ unusedfiles(): Special:Unusedimages (yields ImagePage)
+ randompage: Special:Random
+ randomredirectpage: Special:RandomRedirect
+ withoutinterwiki: Special:Withoutinterwiki
+ linksearch: Special:Linksearch
+
+ Convenience methods that provide access to properties of the wiki Family
+ object; all of these are read-only and return a unicode string unless
+ noted --
+
+ encoding: The current encoding for this site.
+ encodings: List of all historical encodings for this site.
+ category_namespace: Canonical name of the Category namespace on this
+ site.
+ category_namespaces: List of all valid names for the Category
+ namespace.
+ image_namespace: Canonical name of the Image namespace on this site.
+ template_namespace: Canonical name of the Template namespace on this
+ site.
+ protocol: Protocol ('http' or 'https') for access to this site.
+ hostname: Host portion of site URL.
+ path: URL path for index.php on this Site.
+ dbName: MySQL database name.
+
+ Methods that return addresses to pages on this site (usually in
+ Special: namespace); these methods only return URL paths, they do not
+ interact with the wiki --
+
+ export_address: Special:Export.
+ query_address: URL path + '?' for query.php
+ api_address: URL path + '?' for api.php
+ apipath: URL path for api.php
+ move_address: Special:Movepage.
+ delete_address(s): Delete title 's'.
+ undelete_view_address(s): Special:Undelete for title 's'
+ undelete_address: Special:Undelete.
+ protect_address(s): Protect title 's'.
+ unprotect_address(s): Unprotect title 's'.
+ put_address(s): Submit revision to page titled 's'.
+ get_address(s): Retrieve page titled 's'.
+ nice_get_address(s): Short URL path to retrieve page titled 's'.
+ edit_address(s): Edit form for page titled 's'.
+ purge_address(s): Purge cache and retrieve page 's'.
+ block_address: Block an IP address.
+ unblock_address: Unblock an IP address.
+ blocksearch_address(s): Search for blocks on IP address 's'.
+ linksearch_address(s): Special:Linksearch for target 's'.
+ search_address(q): Special:Search for query 'q'.
+ allpages_address(s): Special:Allpages.
+ newpages_address: Special:Newpages.
+ longpages_address: Special:Longpages.
+ shortpages_address: Special:Shortpages.
+ unusedfiles_address: Special:Unusedimages.
+ categories_address: Special:Categories.
+ deadendpages_address: Special:Deadendpages.
+ ancientpages_address: Special:Ancientpages.
+ lonelypages_address: Special:Lonelypages.
+ protectedpages_address: Special:ProtectedPages
+ unwatchedpages_address: Special:Unwatchedpages.
+ uncategorizedcategories_address: Special:Uncategorizedcategories.
+ uncategorizedimages_address: Special:Uncategorizedimages.
+ uncategorizedpages_address: Special:Uncategorizedpages.
+ uncategorizedtemplates_address: Special:UncategorizedTemplates.
+ unusedcategories_address: Special:Unusedcategories.
+ withoutinterwiki_address: Special:Withoutinterwiki.
+ references_address(s): Special:Whatlinksere for page 's'.
+ allmessages_address: Special:Allmessages.
+ upload_address: Special:Upload.
+ double_redirects_address: Special:Doubleredirects.
+ broken_redirects_address: Special:Brokenredirects.
+ random_address: Special:Random.
+ randomredirect_address: Special:Random.
+ login_address: Special:Userlogin.
+ captcha_image_address(id): Special:Captcha for image 'id'.
+ watchlist_address: Special:Watchlist editor.
+ contribs_address(target): Special:Contributions for user 'target'.
+
+ """
+
+ @deprecate_arg("persistent_http", None)
+ def __init__(self, code, fam=None, user=None):
+ """
+ @param code: the site's language code
+ @type code: str
+ @param fam: wiki family name (optional)
+ @type fam: str or Family
+ @param user: bot user name (optional)
+ @type user: str
+
+ """
+ self.__code = code.lower()
+ if isinstance(fam, basestring) or fam is None:
+ self.__family = Family(fam, fatal = False)
+ else:
+ self.__family = fam
+
+ # if we got an outdated language code, use the new one instead.
+ if self.__code in self.__family.obsolete:
+ if self.__family.obsolete[self.__code] is not None:
+ self.__code = self.__family.obsolete[self.__code]
+ else:
+ # no such language anymore
+ raise NoSuchSite("Language %s in family %s is obsolete"
+ % (self.__code, self.__family.name))
+ if self.__code not in self.languages():
+ if self.__code == 'zh-classic' \
+ and 'zh-classical' in self.languages():
+ self.__code = 'zh-classical'
+ # database hack (database is varchar[10], so zh-classical
+ # is cut to zh-classic)
+ elif self.__family.name in self.__family.langs.keys() \
+ or len(self.__family.langs) == 1:
+ self.__code = self.__family.name
+ else:
+ raise NoSuchSite("Language %s does not exist in family %s"
+ % (self.__code, self.__family.name))
+
+ self._mediawiki_messages = {}
+ self._info = {}
+ self._userName = [None, None]
+ self.nocapitalize = self.code in self.family.nocapitalize
+ self.user = user
+ self._userData = [False, False]
+ self._isLoggedIn = [None, None]
+ self._isBlocked = [None, None]
+ self._messages = [None, None]
+ self._rights = [None, None]
+ self._token = [None, None]
+ self._patrolToken = [None, None]
+ self._cookies = [None, None]
+ # Calculating valid languages took quite long, so we calculate it once
+ # in initialization instead of each time it is used.
+ self._validlanguages = []
+ for language in self.languages():
+ if not language[0].upper() + language[1:] in self.namespaces():
+ self._validlanguages.append(language)
+
+ def __call__(self):
+ """Since the Page.site() method has a property decorator, return
the
+ site object for backwards-compatibility if Page.site() call is still
+ used instead of Page.site as recommended.
+
+ """
+## # DEPRECATED warning. Should be uncommented if scripts are actualized
+## pywikibot.output('Page.site() method is DEPRECATED, '
+## 'use Page.site instead.')
+ return self
+
+ @property
+ def family(self):
+ """The Family object for this Site's wiki
family."""
+
+ return self.__family
+
+ @property
+ def code(self):
+ """The identifying code for this Site.
+
+ By convention, this is usually an ISO language code, but it does
+ not have to be.
+
+ """
+ return self.__code
+
+ @property
+ def lang(self):
+ """The ISO language code for this Site.
+
+ Presumed to be equal to the wiki prefix, but this can be overridden.
+
+ """
+ return self.__code
+
+ def __cmp__(self, other):
+ """Perform equality and inequality tests on Site
objects."""
+
+ if not isinstance(other, Site):
+ return 1
+ if self.family.name == other.family.name:
+ return cmp(self.code ,other.code)
+ return cmp(self.family.name, other.family.name)
+
+ def _userIndex(self, sysop = False):
+ """Returns the internal index of the user."""
+ if sysop:
+ return 1
+ else:
+ return 0
+
+ def username(self, sysop = False):
+ return self._userName[self._userIndex(sysop = sysop)]
+
+ def sitename(self):
+ """Return string representing this Site's name and
code."""
+
+ return self.family.name+':'+self.code
+
+ def __repr__(self):
+ return '%s:%s' % (self.family.name, self.code)
+
+ def __hash__(self):
+ return hash(repr(self))
+
+ def linktrail(self):
+ """Return regex for trailing chars displayed as part of a link.
+
+ Returns a string, not a compiled regular expression object.
+
+ This reads from the family file, and ''not'' from
+ [[MediaWiki:Linktrail]], because the MW software currently uses a
+ built-in linktrail from its message files and ignores the wiki
+ value.
+
+ """
+ return self.family.linktrail(self.code)
+
+ def languages(self):
+ """Return list of all valid language codes for this site's
Family."""
+
+ return self.family.iwkeys
+
+ def validLanguageLinks(self):
+ """Return list of language codes that can be used in interwiki
links."""
+ return self._validlanguages
+
+ def namespaces(self):
+ """Return list of canonical namespace names for this
Site."""
+
+ # n.b.: this does not return namespace numbers; to determine which
+ # numeric namespaces the framework recognizes for this Site (which
+ # may or may not actually exist on the wiki), use
+ # self.family.namespaces.keys()
+
+ if self in _namespaceCache:
+ return _namespaceCache[self]
+ else:
+ nslist = []
+ for n in self.family.namespaces:
+ try:
+ ns = self.family.namespace(self.lang, n)
+ except KeyError:
+ # No default namespace defined
+ continue
+ if ns is not None:
+ nslist.append(self.family.namespace(self.lang, n))
+ _namespaceCache[self] = nslist
+ return nslist
+
+ def redirect(self, default=False):
+ """Return the localized redirect tag for the site.
+
+ """
+ # return the magic word without the preceding '#' character
+ if default or self.versionnumber() <= 13:
+ return u'REDIRECT'
+ else:
+ return self.getmagicwords('redirect')[0].lstrip("#")
+
+ def loggedInAs(self, sysop = False):
+ """Return the current username if logged in, otherwise return
None.
+
+ Checks if we're logged in by loading a page and looking for the login
+ link. We assume that we're not being logged out during a bot run, so
+ loading the test page is only required once.
+
+ """
+ index = self._userIndex(sysop)
+ if self._isLoggedIn[index] is None:
+ # Load the details only if you don't know the login status.
+ # Don't load them just because the other details aren't known.
+ self._load(sysop = sysop)
+ if self._isLoggedIn[index]:
+ return self._userName[index]
+ else:
+ return None
+
+ def forceLogin(self, sysop = False):
+ """Log the user in if not already logged in."""
+ if not self.loggedInAs(sysop = sysop):
+ loginMan = login.LoginManager(site = self, sysop = sysop)
+ #loginMan.logout()
+ if loginMan.login(retry = True):
+ index = self._userIndex(sysop)
+ self._isLoggedIn[index] = True
+ self._userName[index] = loginMan.username
+ # We know nothing about the new user (but its name)
+ # Old info is about the anonymous user
+ self._userData[index] = False
+
+ def checkBlocks(self, sysop = False):
+ """Check if the user is blocked, and raise an exception if
so."""
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ if self._isBlocked[index]:
+ # User blocked
+ raise UserBlocked('User is blocked in site %s' % self)
+
+ def isBlocked(self, sysop = False):
+ """Check if the user is blocked."""
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ if self._isBlocked[index]:
+ # User blocked
+ return True
+ else:
+ return False
+
+ def _getBlock(self, sysop = False):
+ """Get user block data from the API."""
+ try:
+ params = {
+ 'action': 'query',
+ 'meta': 'userinfo',
+ 'uiprop': 'blockinfo',
+ }
+ data = query.GetData(params, self)
+ if not data or 'error' in data:
+ return False
+ if self.versionnumber() == 11: # fix for version 1.11 API.
+ data = data['userinfo']
+ else:
+ data = data['query']['userinfo']
+ return 'blockedby' in data
+ except NotImplementedError:
+ return False
+
+ def isAllowed(self, right, sysop = False):
+ """Check if the user has a specific right.
+ Among possible rights:
+ * Actions: edit, move, delete, protect, upload
+ * User levels: autoconfirmed, sysop, bot, empty string (always true)
+ """
+ if right == '' or right is None:
+ return True
+ else:
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ # Handle obsolete editusercssjs permission
+ if right in ['editusercss', 'edituserjs'] \
+ and right not in self._rights[index]:
+ return 'editusercssjs' in self._rights[index]
+ return right in self._rights[index]
+
+ def server_time(self):
+ """returns a datetime object representing server
time"""
+ # It is currently user-clock depending
+ return self.family.server_time()
+
+ def messages(self, sysop = False):
+ """Returns true if the user has new messages, and false
otherwise."""
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ return self._messages[index]
+
+ def cookies(self, sysop = False):
+ """Return a string containing the user's current
cookies."""
+ self._loadCookies(sysop = sysop)
+ index = self._userIndex(sysop)
+ if self._cookies[index]:
+ #convert cookies dictionary data to string.
+ outputDatas = ""
+ for k, v in self._cookies[index].iteritems():
+ if v:
+ outputDatas += "%s=%s; " % (k,v)
+ else:
+ # protection for value ''
+ outputDatas += "%s=none; " % k
+ return outputDatas
+ else:
+ return None
+
+ def _loadCookies(self, sysop = False):
+ """
+ Retrieve session cookies for login
+ if family datas define the cross projects, this function will search
+ the central login file made by self or cross available project
+ functioin will read the cookiedata if got one of them is exist
+ """
+ index = self._userIndex(sysop)
+ if self._cookies[index] is not None:
+ return
+ try:
+ if sysop:
+ try:
+ username = config.sysopnames[self.family.name][self.lang]
+ except KeyError:
+ raise NoUsername("""\
+You tried to perform an action that requires admin privileges, but you haven't
+entered your sysop name in your user-config.py. Please add
+sysopnames['%s']['%s']='name' to your
user-config.py"""
+ % (self.family.name, self.lang))
+ else:
+ username = config.usernames[self.family.name][self.lang]
+ except KeyError:
+ self._cookies[index] = None
+ self._isLoggedIn[index] = False
+ else:
+ # check central login data if cross_projects is available.
+ localFn = '%s-%s-%s-login.data' % (self.family.name, self.lang,
username)
+ localPa = config.datafilepath('login-data', localFn)
+ if self.family.cross_projects:
+ for proj in [self.family.name] + self.family.cross_projects:
+ #find all central data in all cross_projects
+ centralFn = '%s-%s-central-login.data' % (proj, username)
+ centralPa = config.datafilepath('login-data', centralFn)
+ if os.path.exists(centralPa):
+ self._cookies[index] = self._readCookies(centralFn)
+ break
+
+ if os.path.exists(localPa):
+ #read and dump local logindata into self._cookies[index]
+ # if self._cookies[index] is not availabe, read the local data and set
the dictionary.
+ if type(self._cookies[index]) == dict:
+ for k, v in self._readCookies(localFn).iteritems():
+ if k not in self._cookies[index]:
+ self._cookies[index][k] = v
+ else:
+ self._cookies[index] = dict([(k,v) for k,v in
self._readCookies(localFn).iteritems()])
+ #self._cookies[index] = query.CombineParams(self._cookies[index],
self._readCookies(localFn))
+ elif not os.path.exists(localPa) and not self.family.cross_projects:
+ #keep anonymous mode if not login and centralauth not enable
+ self._cookies[index] = None
+ self._isLoggedIn[index] = False
+
+ def _readCookies(self, filename):
+ """read login cookie file and return a
dictionary."""
+ try:
+ f = open( config.datafilepath('login-data', filename), 'r')
+ ck = re.compile("(.*?)=(.*?)\r?\n")
+ data = dict([(x[0],x[1]) for x in ck.findall(f.read())])
+ #data = dict(ck.findall(f.read()))
+ f.close()
+ return data
+ except IOError:
+ return None
+
+ def _setupCookies(self, datas, sysop = False):
+ """save the cookie dictionary to files
+ if cross_project enable, savefiles will separate two, centraldata and
localdata.
+ """
+ index = self._userIndex(sysop)
+ if not self._cookies[index]:
+ self._cookies[index] = datas
+ cache = {0:"",1:""} #0 is central auth, 1 is local.
+ if not self.username(sysop):
+ if not self._cookies[index]:
+ return
+ elif self.family.cross_projects_cookie_username in self._cookies[index]:
+ # Using centralauth to cross login data, it's not necessary to
forceLogin, but Site() didn't know it.
+ # So we need add centralauth username data into siteattribute
+ self._userName[index] =
self._cookies[index][self.family.cross_projects_cookie_username]
+
+
+ for k, v in datas.iteritems():
+ #put key and values into save cache
+ if self.family.cross_projects and k in self.family.cross_projects_cookies:
+ cache[0] += "%s=%s\n" % (k,v)
+ else:
+ cache[1] += "%s=%s\n" % (k,v)
+
+ # write the data.
+ if self.family.cross_projects and cache[0]:
+ filename = '%s-%s-central-login.data' % (self.family.name,
self.username(sysop))
+ f = open(config.datafilepath('login-data', filename), 'w')
+ f.write(cache[0])
+ f.close()
+
+ filename = '%s-%s-%s-login.data' % (self.family.name, self.lang,
self.username(sysop))
+ f = open(config.datafilepath('login-data', filename), 'w')
+ f.write(cache[1])
+ f.close()
+
+ def _removeCookies(self, name):
+ # remove cookies.
+ # ToDo: remove all local datas if cross_projects enable.
+ #
+ if self.family.cross_projects:
+ file = config.datafilepath('login-data',
'%s-%s-central-login.data' % (self.family.name, name))
+ if os.path.exists(file):
+ os.remove( file )
+ file = config.datafilepath('login-data', '%s-%s-%s-login.data' %
(self.family.name, self.lang, name))
+ if os.path.exists(file):
+ os.remove(file)
+
+ def updateCookies(self, datas, sysop = False):
+ """Check and update the current cookies datas and save back to
files."""
+ index = self._userIndex(sysop)
+ if not self._cookies[index]:
+ self._setupCookies(datas, sysop)
+
+ for k, v in datas.iteritems():
+ if k in self._cookies[index]:
+ if v != self._cookies[index][k]:
+ self._cookies[index][k] = v
+ else:
+ self._cookies[index][k] = v
+
+ self._setupCookies(self._cookies[index], sysop)
+
+ def urlEncode(self, query):
+ """Encode a query so that it can be sent using an http POST
request."""
+ if not query:
+ return None
+ if hasattr(query, 'iteritems'):
+ iterator = query.iteritems()
+ else:
+ iterator = iter(query)
+ l = []
+ wpEditToken = None
+ for key, value in iterator:
+ if isinstance(key, unicode):
+ key = key.encode('utf-8')
+ if isinstance(value, unicode):
+ value = value.encode('utf-8')
+ key = urllib.quote(key)
+ value = urllib.quote(value)
+ if key == 'wpEditToken':
+ wpEditToken = value
+ continue
+ l.append(key + '=' + value)
+
+ # wpEditToken is explicitly added as last value.
+ # If a premature connection abort occurs while putting, the server will
+ # not have received an edit token and thus refuse saving the page
+ if wpEditToken is not None:
+ l.append('wpEditToken=' + wpEditToken)
+ return '&'.join(l)
+
+ def solveCaptcha(self, data):
+ if type(data) == dict: # API Mode result
+ if 'edit' in data and data['edit']['result'] !=
u"Success":
+ data = data['edit']
+ if "captcha" in data:
+ data = data['captcha']
+ captype = data['type']
+ id = data['id']
+ if captype in ['simple', 'math', 'question']:
+ answer = input('What is the answer to the captcha "%s"
?' % data['question'])
+ elif captype == 'image':
+ url = self.protocol() + '://' + self.hostname() +
self.captcha_image_address(id)
+ answer = ui.askForCaptcha(url)
+ else: #no captcha id result, maybe ReCaptcha.
+ raise CaptchaError('We have been prompted for a ReCaptcha, but
pywikipedia does not yet support ReCaptchas')
+ return {'id':id, 'answer':answer}
+ return None
+ else:
+ captchaW = re.compile('<label
for="wpCaptchaWord">(?P<question>[^<]*)</label>')
+ captchaR = re.compile('<input type="hidden"
name="wpCaptchaId" id="wpCaptchaId"
value="(?P<id>\d+)" />')
+ match = captchaR.search(data)
+ if match:
+ id = match.group('id')
+ match = captchaW.search(data)
+ if match:
+ answer = input('What is the answer to the captcha "%s"
?' % match.group('question'))
+ else:
+ if not config.solve_captcha:
+ raise CaptchaError(id)
+ url = self.protocol() + '://' + self.hostname() +
self.captcha_image_address(id)
+ answer = ui.askForCaptcha(url)
+ return {'id':id, 'answer':answer}
+ Recaptcha = re.compile('<script type="text/javascript"
src="http://api\.recaptcha\.net/[^"]*"></script>…)
+ if Recaptcha.search(data):
+ raise CaptchaError('We have been prompted for a ReCaptcha, but
pywikipedia does not yet support ReCaptchas')
+ return None
+
+ def postForm(self, address, predata, sysop = False, cookies = None):
+ """Post http form data to the given address at this site.
+
+ address - the absolute path without hostname.
+ predata - a dict or any iterable that can be converted to a dict,
+ containing keys and values for the http form.
+ cookies - the cookies to send with the form. If None, send self.cookies
+
+ Return a (response, data) tuple, where response is the HTTP
+ response object and data is a Unicode string containing the
+ body of the response.
+
+ """
+ if ('action' in predata) and pywikibot.simulate and \
+ (predata['action'] in pywikibot.config.actions_to_block) and \
+ (address not in [self.export_address()]):
+ pywikibot.output(u'\03{lightyellow}SIMULATION: %s action
blocked.\03{default}'%\
+ predata['action'])
+ import StringIO
+ f_dummy = StringIO.StringIO()
+ f_dummy.__dict__.update({u'code': 0, u'msg': u''})
+ return f_dummy, u''
+
+ data = self.urlEncode(predata)
+ try:
+ if cookies:
+ return self.postData(address, data, sysop=sysop,
+ cookies=cookies)
+ else:
+ return self.postData(address, data, sysop=sysop,
+ cookies=self.cookies(sysop = sysop))
+ except socket.error, e:
+ raise ServerError(e)
+
+ def postData(self, address, data,
+ contentType = 'application/x-www-form-urlencoded',
+ sysop = False, compress = True, cookies = None):
+ """Post encoded data to the given http address at this site.
+
+ address is the absolute path without hostname.
+ data is an ASCII string that has been URL-encoded.
+
+ Returns a (response, data) tuple where response is the HTTP
+ response object and data is a Unicode string containing the
+ body of the response.
+ """
+
+ if address[-1] == "?":
+ address = address[:-1]
+
+ headers = {
+ 'User-agent': useragent,
+ 'Content-Length': str(len(data)),
+ 'Content-type':contentType,
+ }
+ if cookies:
+ headers['Cookie'] = cookies
+
+ if compress:
+ headers['Accept-encoding'] = 'gzip'
+ #print '%s' % headers
+
+ url = '%s://%s%s' % (self.protocol(), self.hostname(), address)
+ # Try to retrieve the page until it was successfully loaded (just in
+ # case the server is down or overloaded).
+ # Wait for retry_idle_time minutes (growing!) between retries.
+ retry_idle_time = 1
+ retry_attempt = 0
+ while True:
+ try:
+ request = urllib2.Request(url, data, headers)
+ f = MyURLopener.open(request)
+
+ # read & info can raise socket.error
+ text = f.read()
+ headers = f.info()
+ break
+ except KeyboardInterrupt:
+ raise
+ except urllib2.HTTPError, e:
+ if e.code in [401, 404]:
+ raise PageNotFound(u'Page %s could not be retrieved. Check your
family file ?' % url)
+ # just check for HTTP Status 500 (Internal Server Error)?
+ elif e.code in [500, 502, 504]:
+ output(u'HTTPError: %s %s' % (e.code, e.msg))
+ if config.retry_on_fail:
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise MaxTriesExceededError()
+ output(u"WARNING: Could not open '%s'.\nMaybe the
server is down. Retrying in %i minutes..."
+ % (url, retry_idle_time))
+ time.sleep(retry_idle_time * 60)
+ # Next time wait longer, but not longer than half an hour
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ continue
+ raise
+ else:
+ output(u"Result: %s %s" % (e.code, e.msg))
+ raise
+ except Exception, e:
+ output(u'%s' %e)
+ if pywikibot.verbose:
+ import traceback
+ traceback.print_exc()
+
+ if config.retry_on_fail:
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise MaxTriesExceededError()
+ output(u"WARNING: Could not open '%s'. Maybe the server
or\n your connection is down. Retrying in %i minutes..."
+ % (url, retry_idle_time))
+ time.sleep(retry_idle_time * 60)
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ continue
+ raise
+
+ # check cookies return or not, if return, send its to update.
+ if hasattr(f, 'sheaders'):
+ ck = f.sheaders
+ else:
+ ck = f.info().getallmatchingheaders('set-cookie')
+ if ck:
+ Reat=re.compile(': (.*?)=(.*?);')
+ tmpc = {}
+ for d in ck:
+ m = Reat.search(d)
+ if m: tmpc[m.group(1)] = m.group(2)
+ if self.cookies(sysop):
+ self.updateCookies(tmpc, sysop)
+
+ resContentType = headers.get('content-type', '')
+ contentEncoding = headers.get('content-encoding', '')
+
+ # Ensure that all sent data is received
+ # In rare cases we found a douple Content-Length in the header.
+ # We need to split it to get a value
+ content_length = int(headers.get('content-length',
'0').split(',')[0])
+ if content_length != len(text) and 'content-length' in headers:
+ output(
+ u'Warning! len(text) does not match content-length: %s != %s'
+ % (len(text), content_length))
+ return self.postData(address, data, contentType, sysop, compress,
+ cookies)
+
+ if compress and contentEncoding == 'gzip':
+ text = decompress_gzip(text)
+
+ R = re.compile('charset=([^\'\";]+)')
+ m = R.search(resContentType)
+ if m:
+ charset = m.group(1)
+ else:
+ if verbose:
+ output(u"WARNING: No character set found.")
+ # UTF-8 as default
+ charset = 'utf-8'
+ # Check if this is the charset we expected
+ self.checkCharset(charset)
+ # Convert HTML to Unicode
+ try:
+ text = unicode(text, charset, errors = 'strict')
+ except UnicodeDecodeError, e:
+ print e
+ output(u'ERROR: Invalid characters found on %s://%s%s, replaced by
\\ufffd.'
+ % (self.protocol(), self.hostname(), address))
+ # We use error='replace' in case of bad encoding.
+ text = unicode(text, charset, errors = 'replace')
+
+ # If a wiki page, get user data
+ self._getUserDataOld(text, sysop = sysop)
+
+ return f, text
+
+ #(a)deprecated("pywikibot.comms.http.request") # in 'trunk' not
yet...
+ def getUrl(self, path, retry = None, sysop = False, data = None, compress = True,
+ no_hostname = False, cookie_only=False, refer=None, back_response=False):
+ """
+ Low-level routine to get a URL from the wiki. Tries to login if it is
+ another wiki.
+
+ Parameters:
+ path - The absolute path, without the hostname.
+ retry - If True, retries loading the page when a network error
+ occurs.
+ sysop - If True, the sysop account's cookie will be used.
+ data - An optional dict providing extra post request parameters.
+ cookie_only - Only return the cookie the server sent us back
+
+ Returns the HTML text of the page converted to unicode.
+ """
+ from pywikibot.comms import http
+
+ f, text = http.request(self, path, retry, sysop, data, compress,
+ no_hostname, cookie_only, refer, back_response = True)
+
+ # If a wiki page, get user data
+ self._getUserDataOld(text, sysop = sysop)
+
+ if back_response:
+ return f, text
+
+ return text
+
+ def _getUserData(self, text, sysop = False, force = True):
+ """
+ Get the user data from an API query dict.
+
+ Parameters:
+ * text - the page text
+ * sysop - is the user a sysop?
+ """
+
+ index = self._userIndex(sysop)
+ # Check for blocks
+
+ if 'blockedby' in text and not self._isBlocked[index]:
+ # Write a warning if not shown earlier
+ if sysop:
+ account = 'Your sysop account'
+ else:
+ account = 'Your account'
+ output(u'\nWARNING: %s on %s is blocked by %s.\nReason: %s\nEditing using
this account will stop the run.\n'
+ % (account, self, text['blockedby'],
text['blockreason']))
+ self._isBlocked[index] = 'blockedby' in text
+
+ # Check for new messages, the data must had key 'messages' in dict.
+ if 'messages' in text:
+ if not self._messages[index]:
+ # User has *new* messages
+ if sysop:
+ output(u'NOTE: You have new messages in your sysop account on
%s' % self)
+ else:
+ output(u'NOTE: You have new messages on %s' % self)
+ self._messages[index] = True
+ else:
+ self._messages[index] = False
+
+ # Don't perform other checks if the data was already loaded
+ if self._userData[index] and not force:
+ return
+
+ # Get username.
+ # The data in anonymous mode had key 'anon'
+ # if 'anon' exist, username is IP address, not to collect it right now
+ if not 'anon' in text:
+ self._isLoggedIn[index] = True
+ self._userName[index] = text['name']
+ else:
+ self._isLoggedIn[index] = False
+ self._userName[index] = None
+
+ # Get user groups and rights
+ if 'groups' in text:
+ self._rights[index] = []
+ for group in text['groups']:
+ # Convert dictionaries to list items (bug 3311663)
+ if isinstance(group, dict):
+ self._rights[index].extend(group.keys())
+ else:
+ self._rights[index].append(group)
+ self._rights[index].extend(text['rights'])
+ # Warnings
+ # Don't show warnings for not logged in users, they will just fail to
+ # do any action
+ if self._isLoggedIn[index]:
+ if 'bot' not in self._rights[index] and
config.notify_unflagged_bot:
+ # Sysop + bot flag = Sysop flag in MediaWiki < 1.7.1?
+ if sysop:
+ output(u'Note: Your sysop account on %s does not have a bot
flag. Its edits will be visible in the recent changes.' % self)
+ else:
+ output(u'WARNING: Your account on %s does not have a bot
flag. Its edits will be visible in the recent changes and it may get blocked.' %
self)
+ if sysop and 'sysop' not in self._rights[index]:
+ output(u'WARNING: Your sysop account on %s does not seem to have
sysop rights. You may not be able to perform any sysop-restricted actions using it.' %
self)
+ else:
+ # 'groups' is not exists, set default rights
+ self._rights[index] = []
+ if self._isLoggedIn[index]:
+ # Logged in user
+ self._rights[index].append('user')
+ # Assume bot, and thus autoconfirmed
+ self._rights[index].extend(['bot', 'autoconfirmed'])
+ if sysop:
+ # Assume user reported as a sysop indeed has the sysop rights
+ self._rights[index].append('sysop')
+ # Assume the user has the default rights if API not query back
+ self._rights[index].extend(['read', 'createaccount',
'edit', 'upload', 'createpage', 'createtalk',
'move', 'upload'])
+ #remove Duplicate rights
+ self._rights[index] = list(set(self._rights[index]))
+
+ # Get token
+ if 'preferencestoken' in text:
+ self._token[index] = text['preferencestoken']
+ if self._rights[index] is not None:
+ # Token and rights are loaded - user data is now loaded
+ self._userData[index] = True
+ elif self.versionnumber() < 14:
+ # uiprop 'preferencestoken' is start from 1.14, if 1.8~13, we need to
use other way to get token
+ params = {
+ 'action': 'query',
+ 'prop': 'info',
+ 'titles':'Non-existing page',
+ 'intoken': 'edit',
+ }
+ data = query.GetData(params, self,
sysop=sysop)['query']['pages'].values()[0]
+ if 'edittoken' in data:
+ self._token[index] = data['edittoken']
+ self._userData[index] = True
+ else:
+ output(u'WARNING: Token not found on %s. You will not be able to edit
any page.' % self)
+ else:
+ if not self._isBlocked[index]:
+ output(u'WARNING: Token not found on %s. You will not be able to edit
any page.' % self)
+
+ def _getUserDataOld(self, text, sysop = False, force = True):
+ """
+ Get the user data from a wiki page data.
+
+ Parameters:
+ * text - the page text
+ * sysop - is the user a sysop?
+ """
+
+ index = self._userIndex(sysop)
+
+ if '<div id="globalWrapper">' not in text:
+ # Not a wiki page
+ return
+ # Check for blocks - but only if version is 1.11 (userinfo is available)
+ # and the user data was not yet loaded
+ if self.versionnumber() >= 11 and (not self._userData[index] or force):
+ blocked = self._getBlock(sysop = sysop)
+ if blocked and not self._isBlocked[index]:
+ # Write a warning if not shown earlier
+ if sysop:
+ account = 'Your sysop account'
+ else:
+ account = 'Your account'
+ output(u'WARNING: %s on %s is blocked. Editing using this account
will stop the run.' % (account, self))
+ self._isBlocked[index] = blocked
+
+ # Check for new messages
+ if '<div class="usermessage">' in text:
+ if not self._messages[index]:
+ # User has *new* messages
+ if sysop:
+ output(u'NOTE: You have new messages in your sysop account on
%s' % self)
+ else:
+ output(u'NOTE: You have new messages on %s' % self)
+ self._messages[index] = True
+ else:
+ self._messages[index] = False
+ # Don't perform other checks if the data was already loaded
+ if self._userData[index] and not force:
+ return
+
+ # Search for the the user page link at the top.
+ # Note that the link of anonymous users (which doesn't exist at all
+ # in Wikimedia sites) has the ID pt-anonuserpage, and thus won't be
+ # found here.
+ userpageR = re.compile('<li id="pt-userpage".*?><a
href=".+?".*?>(?P<username>.+?)</a></li>')
+ m = userpageR.search(text)
+ if m:
+ self._isLoggedIn[index] = True
+ self._userName[index] = m.group('username')
+ else:
+ self._isLoggedIn[index] = False
+ # No idea what is the user name, and it isn't important
+ self._userName[index] = None
+
+ if self.family.name == 'wikitravel': # fix for Wikitravel's user
page link.
+ self = self.family.user_page_link(self,index)
+
+ # Check user groups, if possible (introduced in 1.10)
+ groupsR = re.compile(r'var wgUserGroups = \[\"(.+)\"\];')
+ m = groupsR.search(text)
+ checkLocal = True
+ if default_code in self.family.cross_allowed: # if current languages in cross
allowed list, check global bot flag.
+ globalgroupsR = re.compile(r'var wgGlobalGroups =
\[\"(.+)\"\];')
+ mg = globalgroupsR.search(text)
+ if mg: # the account had global permission
+ globalRights = mg.group(1)
+ globalRights = globalRights.split('","')
+ self._rights[index] = globalRights
+ if self._isLoggedIn[index]:
+ if 'Global_bot' in globalRights: # This account has the
global bot flag, no need to check local flags.
+ checkLocal = False
+ else:
+ output(u'Your bot account does not have global the bot flag,
checking local flag.')
+ else:
+ if verbose: output(u'Note: this language does not allow global
bots.')
+ if m and checkLocal:
+ rights = m.group(1)
+ rights = rights.split('", "')
+ if '*' in rights:
+ rights.remove('*')
+ self._rights[index] = rights
+ # Warnings
+ # Don't show warnings for not logged in users, they will just fail to
+ # do any action
+ if self._isLoggedIn[index]:
+ if 'bot' not in self._rights[index] and
config.notify_unflagged_bot:
+ # Sysop + bot flag = Sysop flag in MediaWiki < 1.7.1?
+ if sysop:
+ output(u'Note: Your sysop account on %s does not have a bot
flag. Its edits will be visible in the recent changes.' % self)
+ else:
+ output(u'WARNING: Your account on %s does not have a bot
flag. Its edits will be visible in the recent changes and it may get blocked.' %
self)
+ if sysop and 'sysop' not in self._rights[index]:
+ output(u'WARNING: Your sysop account on %s does not seem to have
sysop rights. You may not be able to perform any sysop-restricted actions using it.' %
self)
+ else:
+ # We don't have wgUserGroups, and can't check the rights
+ self._rights[index] = []
+ if self._isLoggedIn[index]:
+ # Logged in user
+ self._rights[index].append('user')
+ # Assume bot, and thus autoconfirmed
+ self._rights[index].extend(['bot', 'autoconfirmed'])
+ if sysop:
+ # Assume user reported as a sysop indeed has the sysop rights
+ self._rights[index].append('sysop')
+ # Assume the user has the default rights
+ self._rights[index].extend(['read', 'createaccount',
'edit', 'upload', 'createpage', 'createtalk',
'move', 'upload'])
+ if 'bot' in self._rights[index] or 'sysop' in
self._rights[index]:
+ self._rights[index].append('apihighlimits')
+ if 'sysop' in self._rights[index]:
+ self._rights[index].extend(['delete', 'undelete',
'block', 'protect', 'import', 'deletedhistory',
'unwatchedpages'])
+
+ # Search for a token
+ tokenR = re.compile(r"\<input type='hidden'
value=\"(.*?)\" name=\"wpEditToken\"")
+ tokenloc = tokenR.search(text)
+ if tokenloc:
+ self._token[index] = tokenloc.group(1)
+ if self._rights[index] is not None:
+ # In this case, token and rights are loaded - user data is now loaded
+ self._userData[index] = True
+ else:
+ # Token not found
+ # Possible reason for this is the user is blocked, don't show a
+ # warning in this case, otherwise do show a warning
+ # Another possible reason is that the page cannot be edited - ensure
+ # there is a textarea and the tab "view source" is not shown
+ if u'<textarea' in text and u'<li
id="ca-viewsource"' not in text and not self._isBlocked[index]:
+ # Token not found
+ output(u'WARNING: Token not found on %s. You will not be able to edit
any page.' % self)
+
+ def siteinfo(self, key = 'general', force = False, dump = False):
+ """Get Mediawiki Site informations by API
+ dump - return all siteinfo datas
+
+ some siprop params is huge data for MediaWiki, they take long times to read by
testment.
+ these params could get, but only one by one.
+
+ """
+ # protection for key in other datatype
+ if type(key) not in [str, unicode]:
+ key = 'general'
+
+ if self._info and key in self._info and not force:
+ if dump:
+ return self._info
+ else:
+ return self._info[key]
+
+ params = {
+ 'action':'query',
+ 'meta':'siteinfo',
+ 'siprop':['general', 'namespaces', ],
+ }
+ #ver 1.10 handle
+ if self.versionnumber() > 10:
+ params['siprop'].extend(['statistics', ])
+ if key in ['specialpagealiases', 'interwikimap',
'namespacealiases', 'usergroups', ]:
+ if verbose: print 'getting huge siprop %s...' % key
+ params['siprop'] = [key]
+
+ #ver 1.13 handle
+ if self.versionnumber() > 13:
+ if key not in ['specialpagealiases', 'interwikimap',
'namespacealiases', 'usergroups', ]:
+ params['siprop'].extend(['fileextensions',
'rightsinfo', ])
+ if key in ['magicwords', 'extensions', ]:
+ if verbose: print 'getting huge siprop %s...' % key
+ params['siprop'] = [key]
+ try:
+ data = query.GetData(params, self)['query']
+ except NotImplementedError:
+ return None
+
+ if not hasattr(self, '_info'):
+ self._info = data
+ else:
+ if key == 'magicwords':
+ if self.versionnumber() <= 13:
+ return None #Not implemented
+ self._info[key]={}
+ for entry in data[key]:
+ self._info[key][entry['name']] = entry['aliases']
+ else:
+ for k, v in data.iteritems():
+ self._info[k] = v
+ #data pre-process
+ if dump:
+ return self._info
+ else:
+ return self._info.get(key)
+
+ def mediawiki_message(self, key, forceReload = False):
+ """Return the MediaWiki message text for key "key"
"""
+ # Allmessages is retrieved once for all per created Site object
+ if (not self._mediawiki_messages) or forceReload:
+ api = self.has_api()
+ if verbose:
+ output(u"Retrieving mediawiki messages from
Special:Allmessages")
+ # Only MediaWiki r27393/1.12 and higher support XML output for
Special:Allmessages
+ if self.versionnumber() < 12:
+ usePHP = True
+ else:
+ usePHP = False
+ elementtree = True
+ try:
+ try:
+ from xml.etree.cElementTree import XML # 2.5
+ except ImportError:
+ try:
+ from cElementTree import XML
+ except ImportError:
+ from elementtree.ElementTree import XML
+ except ImportError:
+ if verbose:
+ output(u'Elementtree was not found, using BeautifulSoup
instead')
+ elementtree = False
+
+ if config.use_diskcache and not api:
+ import diskcache
+ _dict = lambda x : diskcache.CachedReadOnlyDictI(x, prefix =
"msg-%s-%s-" % (self.family.name, self.lang))
+ else:
+ _dict = dict
+
+ retry_idle_time = 1
+ retry_attempt = 0
+ while True:
+ if api and self.versionnumber() >= 12 or self.versionnumber() >=
16:
+ params = {
+ 'action': 'query',
+ 'meta': 'allmessages',
+ 'ammessages': key,
+ }
+ datas = query.GetData(params,
self)['query']['allmessages'][0]
+ if "missing" in datas:
+ raise KeyError("message is not exist.")
+ elif datas['name'] not in self._mediawiki_messages:
+ self._mediawiki_messages[datas['name']] =
datas['*']
+ #self._mediawiki_messages = _dict([(tag['name'].lower(),
tag['*'])
+ # for tag in datas if not 'missing' in tag])
+ elif usePHP:
+ phppage =
self.getUrl(self.get_address("Special:Allmessages") + "&ot=php")
+ Rphpvals = re.compile(r"(?ms)'([^']*)' =>
'(.*?[^\\])',")
+ # Previous regexp don't match empty messages. Fast workaround...
+ phppage = re.sub("(?m)^('.*?' =>)
'',", r"\1 ' ',", phppage)
+ self._mediawiki_messages = _dict([(name.strip().lower(),
+ html2unicode(message.replace("\\'",
"'")))
+ for (name, message) in Rphpvals.findall(phppage)])
+ else:
+ xml = self.getUrl(self.get_address("Special:Allmessages") +
"&ot=xml")
+ # xml structure is :
+ # <messages lang="fr">
+ # <message name="about">À propos</message>
+ # ...
+ # </messages>
+ if elementtree:
+ decode = xml.encode(self.encoding())
+
+ # Skip extraneous data such as PHP warning or extra
+ # whitespaces added from some MediaWiki extensions
+ xml_dcl_pos = decode.find('<?xml')
+ if xml_dcl_pos > 0:
+ decode = decode[xml_dcl_pos:]
+
+ tree = XML(decode)
+ self._mediawiki_messages =
_dict([(tag.get('name').lower(), tag.text)
+ for tag in tree.getiterator('message')])
+ else:
+ tree = BeautifulStoneSoup(xml)
+ self._mediawiki_messages =
_dict([(tag.get('name').lower(), html2unicode(tag.string))
+ for tag in tree.findAll('message') if
tag.string])
+
+ if not self._mediawiki_messages:
+ # No messages could be added.
+ # We assume that the server is down.
+ # Wait some time, then try again.
+ output(u'WARNING: No messages found in Special:Allmessages. Maybe
the server is down. Retrying in %i minutes...' % retry_idle_time)
+ time.sleep(retry_idle_time * 60)
+ # Next time wait longer, but not longer than half an hour
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise ServerError()
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ continue
+ break
+
+ if self.family.name == 'wikitravel': # fix for Wikitravel's
mediawiki message setting
+ self = self.family.mediawiki_message(self)
+
+ key = key.lower()
+ try:
+ return self._mediawiki_messages[key]
+ except KeyError:
+ if not forceReload:
+ return self.mediawiki_message(key, True)
+ else:
+ raise KeyError("MediaWiki key '%s' does not exist on
%s" % (key, self))
+
+ def has_mediawiki_message(self, key):
+ """Return True if this site defines a MediaWiki message for
'key'."""
+ #return key in self._mediawiki_messages
+ try:
+ v = self.mediawiki_message(key)
+ return True
+ except KeyError:
+ return False
+
+ def has_api(self):
+ """Return True if this sites family has api
interface."""
+ try:
+ if config.use_api:
+ x = self.apipath()
+ del x
+ return True
+ except NotImplementedError:
+ pass
+ return False
+
+ def _load(self, sysop = False, force = False):
+ """
+ Loads user data.
+ This is only done if we didn't do get any page yet and the information
+ is requested, otherwise we should already have this data.
+
+ Parameters:
+ * sysop - Get sysop user data?
+ """
+ index = self._userIndex(sysop)
+ if self._userData[index] and not force:
+ return
+ if verbose:
+ output(u'Getting information for site %s' % self)
+
+ # Get data
+ # API Userinfo is available from version 1.11
+ # preferencetoken available from 1.14
+ if self.has_api() and self.versionnumber() >= 11:
+ #Query userinfo
+ params = {
+ 'action': 'query',
+ 'meta': 'userinfo',
+ 'uiprop':
['blockinfo','groups','rights','hasmsg'],
+ }
+ if self.versionnumber() >= 12:
+ params['uiprop'].append('ratelimits')
+ if self.versionnumber() >= 14:
+ params['uiprop'].append('preferencestoken')
+
+ data = query.GetData(params, self, sysop=sysop)
+
+ # Show the API error code instead making an index error
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+
+ if self.versionnumber() == 11:
+ text = data['userinfo']
+ else:
+ text = data['query']['userinfo']
+
+ self._getUserData(text, sysop = sysop, force = force)
+ else:
+ url = self.edit_address('Non-existing_page')
+ text = self.getUrl(url, sysop = sysop)
+
+ self._getUserDataOld(text, sysop = sysop, force = force)
+
+ def search(self, key, number=10, namespaces=None):
+ """
+ Yield search results for query.
+ Use API when enabled use_api and version >= 1.11,
+ or use Special:Search.
+ """
+ if self.has_api() and self.versionnumber() >= 11:
+ #Yield search results (using api) for query.
+ params = {
+ 'action': 'query',
+ 'list': 'search',
+ 'srsearch': key,
+ }
+ if number:
+ params['srlimit'] = number
+ if namespaces:
+ params['srnamespace'] = namespaces
+
+ offset = 0
+ while offset < number or not number:
+ params['sroffset'] = offset
+ data = query.GetData(params, self)
+ if 'error'in data:
+ raise NotImplementedError('%s' %
data['error']['info'])
+ data = data['query']
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+ if not data['search']:
+ break
+ for s in data['search']:
+ offset += 1
+ page = Page(self, s['title'])
+ if self.versionnumber() >= 16:
+ yield page, s['snippet'], '', s['size'],
s['wordcount'], s['timestamp']
+ else:
+ yield page, '', '', '', '',
''
+ else:
+ #Yield search results (using Special:Search page) for query.
+ throttle = True
+ path = self.search_address(urllib.quote_plus(key.encode('utf-8')),
+ n=number, ns=namespaces)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(ur'<li><a href=".+?"
title="(?P<title>.+?)">.+?</a>',
+ re.DOTALL)
+ for m in entryR.finditer(html):
+ page = Page(self, m.group('title'))
+ yield page, '', '', '', '', ''
+
+ # TODO: avoid code duplication for the following methods
+
+ def logpages(self, number = 50, mode = '', title = None, user = None, repeat
= False,
+ namespace = [], start = None, end = None, tag = None, newer = False,
dump = False):
+
+ if not self.has_api() or self.versionnumber() < 11 or \
+ mode not in ('block', 'protect', 'rights',
'delete', 'upload',
+ 'move', 'import', 'patrol',
'merge', 'suppress',
+ 'review', 'stable', 'gblblock',
'renameuser',
+ 'globalauth', 'gblrights', 'abusefilter',
'newusers'):
+ raise NotImplementedError, mode
+ params = {
+ 'action' : 'query',
+ 'list' : 'logevents',
+ 'letype' : mode,
+ 'lelimit' : int(number),
+ 'ledir' : 'older',
+ 'leprop' : ['ids', 'title', 'type',
'user', 'timestamp', 'comment', 'details',],
+ }
+
+ if number > config.special_page_limit:
+ params['lelimit'] = config.special_page_limit
+ if number > 5000 and self.isAllowed('apihighlimits'):
+ params['lelimit'] = 5000
+ if newer:
+ params['ledir'] = 'newer'
+ if user:
+ params['leuser'] = user
+ if title:
+ params['letitle'] = title
+ if start:
+ params['lestart'] = start
+ if end:
+ params['leend'] = end
+ if tag and self.versionnumber() >= 16: # tag support from mw:r58399
+ params['letag'] = tag
+
+ nbresults = 0
+ while True:
+ result = query.GetData(params, self)
+ if 'error' in result or 'warnings' in result:
+ output('%s' % result)
+ raise Error
+ for c in result['query']['logevents']:
+ if (not namespace or c['ns'] in namespace) and \
+ not 'actionhidden' in c.keys():
+ if dump:
+ # dump result only.
+ yield c
+ else:
+ if c['ns'] == 6:
+ p_ret = ImagePage(self, c['title'])
+ else:
+ p_ret = Page(self, c['title'],
defaultNamespace=c['ns'])
+
+ yield (p_ret, c['user'],
+ parsetime2stamp(c['timestamp']),
+ c['comment'], )
+
+ nbresults += 1
+ if nbresults >= number:
+ break
+ if 'query-continue' in result and nbresults < number:
+ params['lestart'] =
result['query-continue']['logevents']['lestart']
+ elif repeat:
+ nbresults = 0
+ try:
+ params.pop('lestart')
+ except KeyError:
+ pass
+ else:
+ break
+ return
+
+ def newpages(self, number = 10, get_redirect = False, repeat = False, namespace = 0,
rcshow = ['!bot','!redirect'], user = None, returndict = False):
+ """Yield new articles (as Page objects) from Special:Newpages.
+
+ Starts with the newest article and fetches the number of articles
+ specified in the first argument. If repeat is True, it fetches
+ Newpages again. If there is no new page, it blocks until there is
+ one, sleeping between subsequent fetches of Newpages.
+
+ The objects yielded are dependent on parmater returndict.
+ When true, it yields a tuple composed of a Page object and a dict of attributes.
+ When false, it yields a tuple composed of the Page object,
+ timestamp (unicode), length (int), an empty unicode string, username
+ or IP address (str), comment (unicode).
+
+ """
+ # TODO: in recent MW versions Special:Newpages takes a namespace parameter,
+ # and defaults to 0 if not specified.
+ # TODO: Detection of unregistered users is broken
+ # TODO: Repeat mechanism doesn't make much sense as implemented;
+ # should use both offset and limit parameters, and have an
+ # option to fetch older rather than newer pages
+ seen = set()
+ while True:
+ if self.has_api() and self.versionnumber() >= 10:
+ params = {
+ 'action': 'query',
+ 'list': 'recentchanges',
+ 'rctype': 'new',
+ 'rcnamespace': namespace,
+ 'rclimit': int(number),
+ 'rcprop':
['ids','title','timestamp','sizes','user','comment'],
+ 'rcshow': rcshow,
+ }
+ if user: params['rcuser'] = user
+ data = query.GetData(params,
self)['query']['recentchanges']
+
+ for np in data:
+ if np['pageid'] not in seen:
+ seen.add(np['pageid'])
+ page = Page(self, np['title'],
defaultNamespace=np['ns'])
+ if returndict:
+ yield page, np
+ else:
+ yield page, np['timestamp'], np['newlen'],
u'', np['user'], np['comment']
+ else:
+ path = self.newpages_address(n=number, namespace=namespace)
+ # The throttling is important here, so always enabled.
+ get_throttle()
+ html = self.getUrl(path)
+
+ entryR = re.compile('<li[^>]*>(?P<date>.+?) \S*?<a
href=".+?"'
+ '
title="(?P<title>.+?)">.+?</a>.+?[\(\[](?P<length>[\d,.]+)[^\)\]]*[\)\]]'
+ ' .?<a href=".+?"
title=".+?:(?P<username>.+?)">')
+ for m in entryR.finditer(html):
+ date = m.group('date')
+ title = m.group('title')
+ title = title.replace('"', '"')
+ length = int(re.sub("[,.]", "",
m.group('length')))
+ loggedIn = u''
+ username = m.group('username')
+ comment = u''
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, date, length, loggedIn, username, comment
+ if not repeat:
+ break
+
+ def longpages(self, number = 10, repeat = False):
+ """Yield Pages from Special:Longpages.
+
+ Return values are a tuple of Page object, length(int).
+
+ """
+ #TODO: should use offset and limit parameters; 'repeat' as now
+ # implemented is fairly useless
+ # this comment applies to all the XXXXpages methods following, as well
+ seen = set()
+ path = self.longpages_address(n=number)
+ entryR = re.compile(ur'<li>\(<a href=".+?"
title=".+?">.+?</a>\) .<a href=".+?"
title="(?P<title>.+?)">.+?</a>
.\[(?P<length>[\d.,]+).*?\]</li>', re.UNICODE)
+
+ while True:
+ get_throttle()
+ html = self.getUrl(path)
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ length = int(re.sub('[.,]', '',
m.group('length')))
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, length
+ if not repeat:
+ break
+
+ def shortpages(self, number = 10, repeat = False):
+ """Yield Pages and lengths from
Special:Shortpages."""
+ throttle = True
+ seen = set()
+ path = self.shortpages_address(n = number)
+ entryR = re.compile(ur'<li>\(<a href=".+?"
title=".+?">.+?</a>\) .<a href=".+?"
title="(?P<title>.+?)">.+?</a>
.\[(?P<length>[\d.,]+).*?\]</li>', re.UNICODE)
+
+ while True:
+ get_throttle()
+ html = self.getUrl(path)
+
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ length = int(re.sub('[., ]', '',
m.group('length')))
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, length
+ if not repeat:
+ break
+
+ def categories(self, number=10, repeat=False):
+ """Yield Category objects from
Special:Categories"""
+ import catlib
+ seen = set()
+ while True:
+ path = self.categories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a>.*?</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def deadendpages(self, number = 10, repeat = False):
+ """Yield Page objects retrieved from
Special:Deadendpages."""
+ seen = set()
+ while True:
+ path = self.deadendpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def ancientpages(self, number = 10, repeat = False):
+ """Yield Pages, datestamps from
Special:Ancientpages."""
+ seen = set()
+ while True:
+ path = self.ancientpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile('<li><a href=".+?"
title="(?P<title>.+?)">.+?</a>
(?P<date>.+?)</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ date = m.group('date')
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, date
+ if not repeat:
+ break
+
+ def lonelypages(self, number = 10, repeat = False):
+ """Yield Pages retrieved from
Special:Lonelypages."""
+ throttle = True
+ seen = set()
+ while True:
+ path = self.lonelypages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def unwatchedpages(self, number = 10, repeat = False):
+ """Yield Pages from Special:Unwatchedpages (requires Admin
privileges)."""
+ seen = set()
+ while True:
+ path = self.unwatchedpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path, sysop = True)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a>.+?</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def uncategorizedcategories(self, number = 10, repeat = False):
+ """Yield Categories from
Special:Uncategorizedcategories."""
+ import catlib
+ seen = set()
+ while True:
+ path = self.uncategorizedcategories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def newimages(self, number = 100, lestart = None, leend = None, leuser = None,
letitle = None, repeat = False):
+ """
+ Yield ImagePages from APIs, call:
action=query&list=logevents&letype=upload&lelimit=500
+
+ Options directly from APIs:
+ ---
+ Parameters:
+ Default: ids|title|type|user|timestamp|comment|details
+ lestart - The timestamp to start enumerating from.
+ leend - The timestamp to end enumerating.
+ ledir - In which direction to enumerate.
+ One value: newer, older
+ Default: older
+ leuser - Filter entries to those made by the given user.
+ letitle - Filter entries to those related to a page.
+ lelimit - How many total event entries to return.
+ No more than 500 (5000 for bots) allowed.
+ Default: 10
+ """
+
+ for o, u, t, c in self.logpages(number = number, mode = 'upload', title =
letitle, user = leuser,
+ repeat = repeat, start = lestart, end = leend):
+ yield o, t, u, c
+ return
+
+ def recentchanges(self, number=100, rcstart=None, rcend=None, rcshow=None,
+ rcdir='older', rctype='edit|new', namespace=None,
+ includeredirects=True, repeat=False, user=None,
+ returndict=False):
+ """
+ Yield recent changes as Page objects
+ uses API call:
action=query&list=recentchanges&rctype=edit|new&rclimit=500
+
+ Starts with the newest change and fetches the number of changes
+ specified in the first argument. If repeat is True, it fetches
+ again.
+
+ Options directly from APIs:
+ ---
+ Parameters:
+ rcstart - The timestamp to start enumerating from.
+ rcend - The timestamp to end enumerating.
+ rcdir - In which direction to enumerate.
+ One value: newer, older
+ Default: older
+ rcnamespace - Filter log entries to only this namespace(s)
+ Values (separate with '|'):
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+ rcprop - Include additional pieces of information
+ Values (separate with '|'):
+ user, comment, flags, timestamp, title, ids, sizes,
+ redirect, patrolled, loginfo
+ Default: title|timestamp|ids
+ rcshow - Show only items that meet this criteria.
+ For example, to see only minor edits done by
+ logged-in users, set show=minor|!anon
+ Values (separate with '|'):
+ minor, !minor, bot, !bot, anon, !anon,
+ redirect, !redirect, patrolled, !patrolled
+ rclimit - How many total changes to return.
+ No more than 500 (5000 for bots) allowed.
+ Default: 10
+ rctype - Which types of changes to show.
+ Values (separate with '|'): edit, new, log
+
+ The objects yielded are dependent on parmater returndict.
+ When true, it yields a tuple composed of a Page object and a dict of attributes.
+ When false, it yields a tuple composed of the Page object,
+ timestamp (unicode), length (int), an empty unicode string, username
+ or IP address (str), comment (unicode).
+
+ # TODO: Detection of unregistered users is broken
+ """
+ if rctype is None:
+ rctype = 'edit|new'
+ params = {
+ 'action' : 'query',
+ 'list' : 'recentchanges',
+ 'rcdir' : rcdir,
+ 'rctype' : rctype,
+ 'rcprop' : ['user', 'comment',
'timestamp', 'title', 'ids',
+ 'loginfo', 'sizes'], #', 'flags',
'redirect', 'patrolled'],
+ 'rcnamespace' : namespace,
+ 'rclimit' : int(number),
+ }
+ if user: params['rcuser'] = user
+ if rcstart: params['rcstart'] = rcstart
+ if rcend: params['rcend'] = rcend
+ if rcshow: params['rcshow'] = rcshow
+ if rctype: params['rctype'] = rctype
+
+ while True:
+ data = query.GetData(params, self, encodeTitle = False)
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+ try:
+ rcData = data['query']['recentchanges']
+ except KeyError:
+ raise ServerError("The APIs don't return data, the site may be
down")
+
+ for i in rcData:
+ page = Page(self, i['title'], defaultNamespace=i['ns'])
+ if returndict:
+ yield page, i
+ else:
+ comment = ''
+ if 'comment' in i:
+ comment = i['comment']
+ yield page, i['timestamp'], i['newlen'], True,
i['user'], comment
+ if not repeat:
+ break
+
+ def patrol(self, rcid, token = None):
+ if not self.has_api() or self.versionnumber() < 12:
+ raise Exception('patrol: no API: not implemented')
+
+ if not token:
+ token = self.getPatrolToken()
+
+ params = {
+ 'action': 'patrol',
+ 'rcid': rcid,
+ 'token': token,
+ }
+
+ result = query.GetData(params, self)
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+
+ return True
+
+ def uncategorizedimages(self, number = 10, repeat = False):
+ """Yield ImagePages from
Special:Uncategorizedimages."""
+ seen = set()
+ ns = self.image_namespace()
+ entryR = re.compile(
+ '<a href=".+?"
title="(?P<title>%s:.+?)">.+?</a>' % ns)
+ while True:
+ path = self.uncategorizedimages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = ImagePage(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def uncategorizedpages(self, number = 10, repeat = False):
+ """Yield Pages from Special:Uncategorizedpages."""
+ seen = set()
+ while True:
+ path = self.uncategorizedpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def uncategorizedtemplates(self, number = 10, repeat = False):
+ """Yield Pages from
Special:UncategorizedTemplates."""
+ seen = set()
+ while True:
+ path = self.uncategorizedtemplates_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def unusedcategories(self, number = 10, repeat = False):
+ """Yield Category objects from
Special:Unusedcategories."""
+ import catlib
+ seen = set()
+ while True:
+ path = self.unusedcategories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile('<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def wantedcategories(self, number=10, repeat=False):
+ """Yield Category objects from
Special:wantedcategories."""
+ import catlib
+ seen = set()
+ while True:
+ path = self.wantedcategories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" class="new"
title="(?P<title>.+?) \(page does not exist\)">.+?</a>
.+?\)</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def unusedfiles(self, number = 10, repeat = False, extension = None):
+ """Yield ImagePage objects from
Special:Unusedimages."""
+ seen = set()
+ ns = self.image_namespace()
+ entryR = re.compile(
+ '<a href=".+?"
title="(?P<title>%s:.+?)">.+?</a>' % ns)
+ while True:
+ path = self.unusedfiles_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ for m in entryR.finditer(html):
+ fileext = None
+ title = m.group('title')
+ if extension:
+ fileext = title[len(title)-3:]
+ if title not in seen and fileext == extension:
+ ## Check whether the media is used in a Proofread page
+ # code disabled because it slows this method down, and
+ # because it is unclear what it's supposed to do.
+ #basename = title[6:]
+ #page = Page(self, 'Page:' + basename)
+
+ #if not page.exists():
+ seen.add(title)
+ image = ImagePage(self, title)
+ yield image
+ if not repeat:
+ break
+
+ def withoutinterwiki(self, number=10, repeat=False):
+ """Yield Pages without language links from
Special:Withoutinterwiki."""
+ seen = set()
+ while True:
+ path = self.withoutinterwiki_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile('<li><a href=".+?"
title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def randompage(self, redirect = False):
+ if self.has_api() and self.versionnumber() >= 12:
+ params = {
+ 'action': 'query',
+ 'list': 'random',
+ #'rnnamespace': '0',
+ 'rnlimit': '1',
+ #'': '',
+ }
+ if redirect:
+ params['rnredirect'] = 1
+
+ data = query.GetData(params, self)
+ return Page(self,
data['query']['random'][0]['title'])
+ else:
+ if redirect:
+ """Yield random redirect page via
Special:RandomRedirect."""
+ html = self.getUrl(self.randomredirect_address())
+ else:
+ """Yield random page via Special:Random"""
+ html = self.getUrl(self.random_address())
+ m = re.search('var wgPageName = "(?P<title>.+?)";',
html)
+ if m is not None:
+ return Page(self, m.group('title'))
+
+ def randomredirectpage(self):
+ return self.randompage(redirect = True)
+
+ def allpages(self, start='!', namespace=None, includeredirects=True,
+ throttle=True):
+ """
+ Yield all Pages in alphabetical order.
+
+ Parameters:
+ start Start at this page. By default, it starts at '!', and yields
+ all pages.
+ namespace Yield all pages in this namespace; defaults to 0.
+ MediaWiki software will only return pages in one namespace
+ at a time.
+
+ If includeredirects is False, redirects will not be found.
+
+ It is advised not to use this directly, but to use the
+ AllpagesPageGenerator from pagegenerators.py instead.
+
+ """
+ if namespace is None:
+ page = Page(self, start)
+ namespace = page.namespace()
+ start = page.title(withNamespace=False)
+
+ if not self.has_api():
+ for page in self._allpagesOld(start, namespace, includeredirects, throttle):
+ yield page
+ return
+
+ params = {
+ 'action' : 'query',
+ 'list' : 'allpages',
+ 'aplimit' : config.special_page_limit,
+ 'apnamespace': namespace,
+ 'apfrom' : start
+ }
+
+ if not includeredirects:
+ params['apfilterredir'] = 'nonredirects'
+ elif includeredirects == 'only':
+ params['apfilterredir'] = 'redirects'
+
+ while True:
+ if throttle:
+ get_throttle()
+ data = query.GetData(params, self)
+ if verbose:
+ print 'DEBUG allpages>>> data.keys()', data.keys()
+ if 'warnings' in data:
+ warning = data['warnings']['allpages']['*']
+ raise RuntimeError("API query warning: %s" % warning)
+ if 'error' in data:
+ raise RuntimeError("API query error: %s" % data)
+ if not 'allpages' in data['query']:
+ raise RuntimeError("API query error, no pages found: %s" %
data)
+ count = 0
+ for p in data['query']['allpages']:
+ count += 1
+ yield Page(self, p['title'])
+ if count >= config.special_page_limit:
+ break
+ if 'query-continue' in data and count <
params['aplimit']:
+ # get the continue key for backward compatibility with pre 1.20wmf8
+ contKey = data['query-continue']['allpages'].keys()[0]
+ params[contKey] =
data['query-continue']['allpages'][contKey]
+ else:
+ break
+
+ def _allpagesOld(self, start='!', namespace=0, includeredirects=True,
+ throttle=True):
+ """
+ Yield all Pages from Special:Allpages.
+
+ This method doesn't work with MediaWiki 1.14 because of a change to
+ Special:Allpages. It is only left here for compatibility with older
+ MediaWiki versions, which don't support the API.
+
+ Parameters:
+ start Start at this page. By default, it starts at '!', and yields
+ all pages.
+ namespace Yield all pages in this namespace; defaults to 0.
+ MediaWiki software will only return pages in one namespace
+ at a time.
+
+ If includeredirects is False, redirects will not be found.
+ If includeredirects equals the string 'only', only redirects
+ will be found. Note that this has not been tested on older
+ versions of the MediaWiki code.
+
+ It is advised not to use this directly, but to use the
+ AllpagesPageGenerator from pagegenerators.py instead.
+
+ """
+ monobook_error = True
+ if start == '':
+ start='!'
+
+ while True:
+ # encode Non-ASCII characters in hexadecimal format (e.g. %F6)
+ start = start.encode(self.encoding())
+ start = urllib.quote(start)
+ # load a list which contains a series of article names (always 480)
+ path = self.allpages_address(start, namespace)
+ output(u'Retrieving Allpages special page for %s from %s, namespace
%i' % (repr(self), start, namespace))
+ returned_html = self.getUrl(path)
+ # Try to find begin and end markers
+ try:
+ # In 1.4, another table was added above the navigational links
+ if self.versionnumber() >= 4:
+ begin_s = '</table><hr /><table'
+ end_s = '</table'
+ else:
+ begin_s = '<table'
+ end_s = '</table'
+ ibegin = returned_html.index(begin_s)
+ iend = returned_html.index(end_s,ibegin + 3)
+ except ValueError:
+ if monobook_error:
+ raise ServerError("Couldn't extract allpages special page.
Make sure you're using MonoBook skin.")
+ else:
+ # No list of wikilinks
+ break
+ monobook_error = False
+ # remove the irrelevant sections
+ returned_html = returned_html[ibegin:iend]
+ if self.versionnumber()==2:
+ R = re.compile('/wiki/(.*?)\"
*class=[\'\"]printable')
+ elif self.versionnumber()<5:
+ # Apparently the special code for redirects was added in 1.5
+ R = re.compile('title ?=\"(.*?)\"')
+ elif not includeredirects:
+ R = re.compile('\<td(?: width="33%")?\>\<a
href=\"\S*\" +title ?="(.*?)"')
+ elif includeredirects == 'only':
+ R = re.compile('\<td(?:
width="33%")?>\<[^\<\>]*allpagesredirect\"\>\<a
href=\"\S*\" +title ?="(.*?)"')
+ else:
+ R = re.compile('title ?=\"(.*?)\"')
+ # Count the number of useful links on this page
+ n = 0
+ for hit in R.findall(returned_html):
+ # count how many articles we found on the current page
+ n = n + 1
+ if self.versionnumber()==2:
+ yield Page(self, url2link(hit, site = self, insite = self))
+ else:
+ yield Page(self, hit)
+ # save the last hit, so that we know where to continue when we
+ # finished all articles on the current page. Append a '!' so
that
+ # we don't yield a page twice.
+ start = Page(self, hit).title(withNamespace=False) + '!'
+ # A small shortcut: if there are less than 100 pages listed on this
+ # page, there is certainly no next. Probably 480 would do as well,
+ # but better be safe than sorry.
+ if n < 100:
+ if (not includeredirects) or includeredirects == 'only':
+ # Maybe there were only so few because the rest is or is not a
redirect
+ R = re.compile('title ?=\"(.*?)\"')
+ allLinks = R.findall(returned_html)
+ if len(allLinks) < 100:
+ break
+ elif n == 0:
+ # In this special case, no pages of the requested type
+ # were found, and "start" will remain and be
double-encoded.
+ # Use the last page as the start of the next page.
+ start = Page(self,
+ allLinks[-1]).title(
+ withNamespace=False) + '!'
+ else:
+ break
+ #else:
+ # # Don't send a new request if "Next page (pagename)"
isn't present
+ # Rnonext =
re.compile(r'title="(Special|%s):.+?">%s</a></td></tr></table>'
% (
+ # self.mediawiki_message('nstab-special'),
+ #
re.escape(self.mediawiki_message('nextpage')).replace('\$1',
'.*?')))
+ # if not Rnonext.search(full_returned_html):
+ # break
+
+ def prefixindex(self, prefix, namespace=0, includeredirects=True):
+ """Yield all pages with a given prefix.
+
+ Parameters:
+ prefix The prefix of the pages.
+ namespace Namespace number; defaults to 0.
+ MediaWiki software will only return pages in one namespace
+ at a time.
+
+ If includeredirects is False, redirects will not be found.
+ If includeredirects equals the string 'only', only redirects
+ will be found. Note that this has not been tested on older
+ versions of the MediaWiki code.
+
+ It is advised not to use this directly, but to use the
+ PrefixingPageGenerator from pagegenerators.py instead.
+ """
+ for page in self.allpages(start = prefix, namespace = namespace, includeredirects
= includeredirects):
+ if page.title(withNamespace=False).startswith(prefix):
+ yield page
+ else:
+ break
+
+ def protectedpages(self, namespace = None, type = 'edit', lvl = 0):
+ """ Yield all the protected pages, using Special:ProtectedPages
+ * namespace is a namespace number
+ * type can be 'edit' or 'move
+ * lvl : protection level, can be 0, 'autoconfirmed', or
'sysop'
+ """
+ # Avoid problems of encoding and stuff like that, let it divided please
+ url = self.protectedpages_address()
+ url += '&type=%s&level=%s' % (type, lvl)
+ if namespace is not None: # /!\ if namespace seems simpler, but returns false
when ns=0
+
+ url += '&namespace=%s' % namespace
+ parser_text = self.getUrl(url)
+ while 1:
+ #<li><a href="/wiki/Pagina_principale" title="Pagina
principale">Pagina principale</a> <small>(6.522 byte)</small>
(protetta)</li>
+ m = re.findall(r'<li><a href=".*?"
title=".*?">(.*?)</a>.*?<small>\((.*?)\)</small>.*?\((.*?)\)</li>',
parser_text)
+ for data in m:
+ title = data[0]
+ size = data[1]
+ status = data[2]
+ yield Page(self, title)
+ nextpage = re.findall(r'<.ul>\(.*?\).*?\(.*?\).*?\(<a
href="(.*?)".*?</a>\) +?\(<a href=', parser_text)
+ if nextpage != []:
+ parser_text = self.getUrl(nextpage[0].replace('&',
'&'))
+ continue
+ else:
+ break
+
+ def linksearch(self, siteurl, limit=500):
+ """Yield Pages from results of Special:Linksearch for
'siteurl'."""
+ cache = []
+ R = re.compile('title
?=\"([^<>]*?)\">[^<>]*</a></li>')
+ urlsToRetrieve = [siteurl]
+ if not siteurl.startswith('*.'):
+ urlsToRetrieve.append('*.' + siteurl)
+
+ if self.has_api() and self.versionnumber() >= 11:
+ output(u'Querying API exturlusage...')
+ for url in urlsToRetrieve:
+ params = {
+ 'action': 'query',
+ 'list' : 'exturlusage',
+ 'eulimit': limit,
+ 'euquery': url,
+ }
+ count = 0
+ while True:
+ data = query.GetData(params, self)
+ if data['query']['exturlusage'] == []:
+ break
+ for pages in data['query']['exturlusage']:
+ count += 1
+ if not siteurl in pages['title']:
+ # the links themselves have similar form
+ if pages['pageid'] not in cache:
+ cache.append(pages['pageid'])
+ yield Page(self, pages['title'],
defaultNamespace=pages['ns'])
+ if count >= limit:
+ break
+
+ if 'query-continue' in data and count < limit:
+ params['euoffset'] =
data[u'query-continue'][u'exturlusage'][u'euoffset']
+ else:
+ break
+ else:
+ output(u'Querying [[Special:Linksearch]]...')
+ for url in urlsToRetrieve:
+ offset = 0
+ while True:
+ path = self.linksearch_address(url, limit=limit, offset=offset)
+ get_throttle()
+ html = self.getUrl(path)
+ #restricting the HTML source :
+ #when in the source, this div marks the beginning of the input
+ loc = html.find('<div
class="mw-spcontent">')
+ if loc > -1:
+ html = html[loc:]
+ #when in the source, marks the end of the linklist
+ loc = html.find('<div class="printfooter">')
+ if loc > -1:
+ html = html[:loc]
+
+ #our regex fetches internal page links and the link they contain
+ links = R.findall(html)
+ if not links:
+ #no more page to be fetched for that link
+ break
+ for title in links:
+ if not siteurl in title:
+ # the links themselves have similar form
+ if title in cache:
+ continue
+ else:
+ cache.append(title)
+ yield Page(self, title)
+ offset += limit
+
+ def linkto(self, title, othersite = None):
+ """Return unicode string in the form of a wikilink to
'title'
+
+ Use optional Site argument 'othersite' to generate an interwiki link
+ from the other site to the current site.
+
+ """
+ if othersite and othersite.lang != self.lang:
+ return u'[[%s:%s]]' % (self.lang, title)
+ else:
+ return u'[[%s]]' % title
+
+ def isInterwikiLink(self, s):
+ """Return True if s is in the form of an interwiki link.
+
+ Interwiki links have the form "foo:bar" or ":foo:bar" where
foo is a
+ known language code or family. Called recursively if the first part
+ of the link refers to this site's own family and/or language.
+
+ """
+ s = s.replace("_", " ").strip("
").lstrip(":")
+ if not ':' in s:
+ return False
+ first, rest = s.split(':',1)
+ # interwiki codes are case-insensitive
+ first = first.lower().strip(" ")
+ # commons: forwards interlanguage links to wikipedia:, etc.
+ if self.family.interwiki_forward:
+ interlangTargetFamily = Family(self.family.interwiki_forward)
+ else:
+ interlangTargetFamily = self.family
+ if self.getNamespaceIndex(first):
+ return False
+ if first in interlangTargetFamily.langs:
+ if first == self.lang:
+ return self.isInterwikiLink(rest)
+ else:
+ return True
+ if first in self.family.get_known_families(site = self):
+ if first == self.family.name:
+ return self.isInterwikiLink(rest)
+ else:
+ return True
+ return False
+
+ def getmagicwords(self, word):
+ """Return list of localized "word" magic words for the
site."""
+ if self.versionnumber() <= 13:
+ raise NotImplementedError
+ return self.siteinfo('magicwords').get(word)
+
+ def redirectRegex(self):
+ """Return a compiled regular expression matching on redirect
pages.
+
+ Group 1 in the regex match object will be the target title.
+
+ """
+ #NOTE: this is needed, since the API can give false positives!
+ default = 'REDIRECT'
+ keywords = self.versionnumber() > 13 and
self.getmagicwords('redirect')
+ if keywords:
+ pattern = r'(?:' + '|'.join(keywords) + ')'
+ else:
+ # no localized keyword for redirects
+ pattern = r'#%s' % default
+ if self.versionnumber() > 12:
+ # in MW 1.13 (at least) a redirect directive can follow whitespace
+ prefix = r'\s*'
+ else:
+ prefix = r'[\r\n]*'
+ # A redirect starts with hash (#), followed by a keyword, then
+ # arbitrary stuff, then a wikilink. The wikilink may contain
+ # a label, although this is not useful.
+ return re.compile(prefix + pattern
+ + '\s*:?\s*\[\[(.+?)(?:\|.*?)?\]\]',
+ re.IGNORECASE | re.UNICODE | re.DOTALL)
+
+ def pagenamecodes(self, default=True):
+ """Return list of localized PAGENAME tags for the
site."""
+ return self.versionnumber() > 13 and self.getmagicwords('pagename') \
+ or u'PAGENAME'
+
+ def pagename2codes(self, default=True):
+ """Return list of localized PAGENAMEE tags for the
site."""
+ return self.versionnumber() > 13 and self.getmagicwords('pagenamee')
\
+ or u'PAGENAMEE'
+
+ def resolvemagicwords(self, wikitext):
+ """Replace the {{ns:xx}} marks in a wikitext with the namespace
names"""
+
+ defaults = []
+ for namespace in self.family.namespaces.itervalues():
+ value = namespace.get('_default', None)
+ if value:
+ if isinstance(value, list):
+ defaults.append(value[0])
+ else:
+ defaults.append(value)
+
+ named = re.compile(u'{{ns:(' + '|'.join(defaults) +
')}}', re.I)
+
+ def replacenamed(match):
+ return self.normalizeNamespace(match.group(1))
+
+ wikitext = named.sub(replacenamed, wikitext)
+
+ numbered = re.compile('{{ns:(-?\d{1,2})}}', re.I)
+
+ def replacenumbered(match):
+ return self.namespace(int(match.group(1)))
+
+ return numbered.sub(replacenumbered, wikitext)
+
+ # The following methods are for convenience, so that you can access
+ # methods of the Family class easier.
+ def encoding(self):
+ """Return the current encoding for this site."""
+ return self.family.code2encoding(self.lang)
+
+ def encodings(self):
+ """Return a list of all historical encodings for this
site."""
+ return self.family.code2encodings(self.lang)
+
+ def category_namespace(self):
+ """Return the canonical name of the Category namespace on this
site."""
+ # equivalent to self.namespace(14)?
+ return self.family.category_namespace(self.lang)
+
+ def category_namespaces(self):
+ """Return a list of all valid names for the Category
namespace."""
+ return self.family.category_namespaces(self.lang)
+
+ def category_redirects(self):
+ return self.family.category_redirects(self.lang)
+
+ def image_namespace(self, fallback = '_default'):
+ """Return the canonical name of the Image namespace on this
site."""
+ # equivalent to self.namespace(6)?
+ return self.family.image_namespace(self.lang, fallback)
+
+ def template_namespace(self, fallback = '_default'):
+ """Return the canonical name of the Template namespace on this
site."""
+ # equivalent to self.namespace(10)?
+ return self.family.template_namespace(self.lang, fallback)
+
+ def export_address(self):
+ """Return URL path for Special:Export."""
+ return self.family.export_address(self.lang)
+
+ def query_address(self):
+ """Return URL path + '?' for query.php (if enabled on this
Site)."""
+ return self.family.query_address(self.lang)
+
+ def api_address(self):
+ """Return URL path + '?' for api.php (if enabled on this
Site)."""
+ return self.family.api_address(self.lang)
+
+ def apipath(self):
+ """Return URL path for api.php (if enabled on this
Site)."""
+ return self.family.apipath(self.lang)
+
+ def scriptpath(self):
+ """Return URL prefix for scripts on this site ({{SCRIPTPATH}}
value)"""
+ return self.family.scriptpath(self.lang)
+
+ def protocol(self):
+ """Return protocol ('http' or 'https') for access
to this site."""
+ return self.family.protocol(self.lang)
+
+ def hostname(self):
+ """Return host portion of site URL."""
+ return self.family.hostname(self.lang)
+
+ def path(self):
+ """Return URL path for index.php on this Site."""
+ return self.family.path(self.lang)
+
+ def dbName(self):
+ """Return MySQL database name."""
+ return self.family.dbName(self.lang)
+
+ def move_address(self):
+ """Return URL path for Special:Movepage."""
+ return self.family.move_address(self.lang)
+
+ def delete_address(self, s):
+ """Return URL path to delete title 's'."""
+ return self.family.delete_address(self.lang, s)
+
+ def undelete_view_address(self, s, ts=''):
+ """Return URL path to view Special:Undelete for title 's'
+
+ Optional argument 'ts' returns path to view specific deleted version.
+
+ """
+ return self.family.undelete_view_address(self.lang, s, ts)
+
+ def undelete_address(self):
+ """Return URL path to Special:Undelete."""
+ return self.family.undelete_address(self.lang)
+
+ def protect_address(self, s):
+ """Return URL path to protect title
's'."""
+ return self.family.protect_address(self.lang, s)
+
+ def unprotect_address(self, s):
+ """Return URL path to unprotect title
's'."""
+ return self.family.unprotect_address(self.lang, s)
+
+ def put_address(self, s):
+ """Return URL path to submit revision to page titled
's'."""
+ return self.family.put_address(self.lang, s)
+
+ def get_address(self, s):
+ """Return URL path to retrieve page titled
's'."""
+ title = s.replace(' ', '_')
+ return self.family.get_address(self.lang, title)
+
+ def nice_get_address(self, s):
+ """Return shorter URL path to retrieve page titled
's'."""
+ return self.family.nice_get_address(self.lang, s)
+
+ def edit_address(self, s):
+ """Return URL path for edit form for page titled
's'."""
+ return self.family.edit_address(self.lang, s)
+
+ def watch_address(self, s):
+ """Return URL path for watching the titled
's'."""
+ return self.family.watch_address(self.lang, s)
+
+ def unwatch_address(self, s):
+ """Return URL path for unwatching the titled
's'."""
+ return self.family.unwatch_address(self.lang, s)
+
+ def purge_address(self, s):
+ """Return URL path to purge cache and retrieve page
's'."""
+ return self.family.purge_address(self.lang, s)
+
+ def block_address(self):
+ """Return path to block an IP address."""
+ return self.family.block_address(self.lang)
+
+ def unblock_address(self):
+ """Return path to unblock an IP address."""
+ return self.family.unblock_address(self.lang)
+
+ def blocksearch_address(self, s):
+ """Return path to search for blocks on IP address
's'."""
+ return self.family.blocksearch_address(self.lang, s)
+
+ def linksearch_address(self, s, limit=500, offset=0):
+ """Return path to Special:Linksearch for target
's'."""
+ return self.family.linksearch_address(self.lang, s, limit=limit, offset=offset)
+
+ def search_address(self, q, n=50, ns=0):
+ """Return path to Special:Search for query
'q'."""
+ return self.family.search_address(self.lang, q, n, ns)
+
+ def allpages_address(self, s, ns = 0):
+ """Return path to Special:Allpages."""
+ return self.family.allpages_address(self.lang, start=s, namespace = ns)
+
+ def log_address(self, n=50, mode = '', user = ''):
+ """Return path to Special:Log."""
+ return self.family.log_address(self.lang, n, mode, user)
+
+ def newpages_address(self, n=50, namespace=0):
+ """Return path to Special:Newpages."""
+ return self.family.newpages_address(self.lang, n, namespace)
+
+ def longpages_address(self, n=500):
+ """Return path to Special:Longpages."""
+ return self.family.longpages_address(self.lang, n)
+
+ def shortpages_address(self, n=500):
+ """Return path to Special:Shortpages."""
+ return self.family.shortpages_address(self.lang, n)
+
+ def unusedfiles_address(self, n=500):
+ """Return path to Special:Unusedimages."""
+ return self.family.unusedfiles_address(self.lang, n)
+
+ def categories_address(self, n=500):
+ """Return path to Special:Categories."""
+ return self.family.categories_address(self.lang, n)
+
+ def deadendpages_address(self, n=500):
+ """Return path to Special:Deadendpages."""
+ return self.family.deadendpages_address(self.lang, n)
+
+ def ancientpages_address(self, n=500):
+ """Return path to Special:Ancientpages."""
+ return self.family.ancientpages_address(self.lang, n)
+
+ def lonelypages_address(self, n=500):
+ """Return path to Special:Lonelypages."""
+ return self.family.lonelypages_address(self.lang, n)
+
+ def protectedpages_address(self, n=500):
+ """Return path to Special:ProtectedPages"""
+ return self.family.protectedpages_address(self.lang, n)
+
+ def unwatchedpages_address(self, n=500):
+ """Return path to Special:Unwatchedpages."""
+ return self.family.unwatchedpages_address(self.lang, n)
+
+ def uncategorizedcategories_address(self, n=500):
+ """Return path to
Special:Uncategorizedcategories."""
+ return self.family.uncategorizedcategories_address(self.lang, n)
+
+ def uncategorizedimages_address(self, n=500):
+ """Return path to Special:Uncategorizedimages."""
+ return self.family.uncategorizedimages_address(self.lang, n)
+
+ def uncategorizedpages_address(self, n=500):
+ """Return path to Special:Uncategorizedpages."""
+ return self.family.uncategorizedpages_address(self.lang, n)
+
+ def uncategorizedtemplates_address(self, n=500):
+ """Return path to Special:Uncategorizedpages."""
+ return self.family.uncategorizedtemplates_address(self.lang, n)
+
+ def unusedcategories_address(self, n=500):
+ """Return path to Special:Unusedcategories."""
+ return self.family.unusedcategories_address(self.lang, n)
+
+ def wantedcategories_address(self, n=500):
+ """Return path to Special:Wantedcategories."""
+ return self.family.wantedcategories_address(self.lang, n)
+
+ def withoutinterwiki_address(self, n=500):
+ """Return path to Special:Withoutinterwiki."""
+ return self.family.withoutinterwiki_address(self.lang, n)
+
+ def references_address(self, s):
+ """Return path to Special:Whatlinksere for page
's'."""
+ return self.family.references_address(self.lang, s)
+
+ def allmessages_address(self):
+ """Return path to Special:Allmessages."""
+ return self.family.allmessages_address(self.lang)
+
+ def upload_address(self):
+ """Return path to Special:Upload."""
+ return self.family.upload_address(self.lang)
+
+ def double_redirects_address(self, default_limit = True):
+ """Return path to Special:Doubleredirects."""
+ return self.family.double_redirects_address(self.lang, default_limit)
+
+ def broken_redirects_address(self, default_limit = True):
+ """Return path to Special:Brokenredirects."""
+ return self.family.broken_redirects_address(self.lang, default_limit)
+
+ def random_address(self):
+ """Return path to Special:Random."""
+ return self.family.random_address(self.lang)
+
+ def randomredirect_address(self):
+ """Return path to Special:RandomRedirect."""
+ return self.family.randomredirect_address(self.lang)
+
+ def login_address(self):
+ """Return path to Special:Userlogin."""
+ return self.family.login_address(self.lang)
+
+ def captcha_image_address(self, id):
+ """Return path to Special:Captcha for image
'id'."""
+ return self.family.captcha_image_address(self.lang, id)
+
+ def watchlist_address(self):
+ """Return path to Special:Watchlist editor."""
+ return self.family.watchlist_address(self.lang)
+
+ def contribs_address(self, target, limit=500, offset=''):
+ """Return path to Special:Contributions for user
'target'."""
+ return self.family.contribs_address(self.lang,target,limit,offset)
+
+ def globalusers_address(self, target='', limit=500, offset='',
group=''):
+ """Return path to Special:GlobalUsers for user 'target'
and/or group 'group'."""
+ return self.family.globalusers_address(self.lang, target, limit, offset, group)
+
+ def version(self):
+ """Return MediaWiki version number as a string."""
+ return self.family.version(self.lang)
+
+ def versionnumber(self):
+ """Return an int identifying MediaWiki version.
+
+ Currently this is implemented as returning the minor version
+ number; i.e., 'X' in version '1.X.Y'
+
+ """
+ return self.family.versionnumber(self.lang)
+
+ def live_version(self):
+ """Return the 'real' version number found on
[[Special:Version]]
+
+ Return value is a tuple (int, int, str) of the major and minor
+ version numbers and any other text contained in the version.
+
+ """
+ global htmldata
+ if not hasattr(self, "_mw_version"):
+ PATTERN = r"^(?:: )?([0-9]+)\.([0-9]+)(.*)$"
+ versionpage = self.getUrl(self.get_address("Special:Version"))
+ htmldata = BeautifulSoup(versionpage, convertEntities="html")
+ # try to find the live version
+ versionlist = []
+ # 1st try is for mw < 1.17wmf1
+ versionlist.append(lambda: htmldata.findAll(
+ text="MediaWiki")[1].parent.nextSibling
)
+ # 2nd try is for mw >=1.17wmf1
+ versionlist.append(lambda: htmldata.body.table.findAll(
+ 'td')[1].contents[0] )
+ # 3rd uses family file which is not live
+ versionlist.append(lambda: self.family.version(self.lang) )
+ for versionfunc in versionlist:
+ try:
+ versionstring = versionfunc()
+ except:
+ continue
+ m = re.match(PATTERN, str(versionstring).strip())
+ if m:
+ break
+ else:
+ raise Error(u'Cannot find any live version!')
+ self._mw_version = (int(m.group(1)), int(m.group(2)), m.group(3))
+ return self._mw_version
+
+ def checkCharset(self, charset):
+ """Warn if charset returned by wiki doesn't match family
file."""
+ fromFamily = self.encoding()
+ assert fromFamily.lower() == charset.lower(), \
+ "charset for %s changed from %s to %s" \
+ % (repr(self), fromFamily, charset)
+ if fromFamily.lower() != charset.lower():
+ raise ValueError(
+"code2encodings has wrong charset for %s. It should be %s, but is %s"
+ % (repr(self), charset, self.encoding()))
+
+ def shared_image_repository(self):
+ """Return a tuple of image repositories used by this
site."""
+ return self.family.shared_image_repository(self.lang)
+
+ def category_on_one_line(self):
+ """Return True if this site wants all category links on one
line."""
+ return self.lang in self.family.category_on_one_line
+
+ def interwiki_putfirst(self):
+ """Return list of language codes for ordering of interwiki
links."""
+ return self.family.interwiki_putfirst.get(self.lang, None)
+
+ def interwiki_putfirst_doubled(self, list_of_links):
+ # TODO: is this even needed? No family in the framework has this
+ # dictionary defined!
+ if self.lang in self.family.interwiki_putfirst_doubled:
+ if len(list_of_links) >=
self.family.interwiki_putfirst_doubled[self.lang][0]:
+ list_of_links2 = []
+ for lang in list_of_links:
+ list_of_links2.append(lang.language())
+ list = []
+ for lang in self.family.interwiki_putfirst_doubled[self.lang][1]:
+ try:
+ list.append(list_of_links[list_of_links2.index(lang)])
+ except ValueError:
+ pass
+ return list
+ else:
+ return False
+ else:
+ return False
+
+ def getSite(self, code):
+ """Return Site object for language 'code' in this
Family."""
+ return getSite(code = code, fam = self.family, user=self.user)
+
+ def namespace(self, num, all = False):
+ """Return string containing local name of namespace
'num'.
+
+ If optional argument 'all' is true, return a tuple of all recognized
+ values for this namespace.
+
+ """
+ return self.family.namespace(self.lang, num, all = all)
+
+ def normalizeNamespace(self, value):
+ """Return canonical name for namespace 'value' in this
Site's language.
+
+ 'Value' should be a string or unicode.
+ If no match, return 'value' unmodified.
+
+ """
+ if not self.nocapitalize:
+ # make sure first letter gets normalized; there is at least
+ # one case ("İ") in which s.lower().upper() != s
+ value = value[0].lower().upper() + value[1:]
+ return self.family.normalizeNamespace(self.lang, value)
+
+ def getNamespaceIndex(self, namespace):
+ """Given a namespace name, return its int index, or None if
invalid."""
+ return self.family.getNamespaceIndex(self.lang, namespace)
+
+ def language(self):
+ """Return Site's language code."""
+ return self.lang
+
+ def fam(self):
+ """Return Family object for this Site."""
+ return self.family
+
+ def disambcategory(self):
+ """Return Category in which disambig pages are
listed."""
+ import catlib
+ try:
+ return catlib.Category(self,
+ self.namespace(14)+':'+self.family.disambcatname[self.lang])
+ except KeyError:
+ raise NoPage
+
+ def getToken(self, getalways = True, getagain = False, sysop = False):
+ index = self._userIndex(sysop)
+ if getagain or (getalways and self._token[index] is None):
+ output(u'Getting a token.')
+ self._load(sysop = sysop, force = True)
+ if self._token[index] is not None:
+ return self._token[index]
+ else:
+ return False
+
+ def getPatrolToken(self, sysop = False):
+ index = self._userIndex(sysop)
+
+ if self._patrolToken[index] is None:
+ output(u'Getting a patrol token.')
+ params = {
+ 'action' : 'query',
+ 'list' : 'recentchanges',
+ 'rcshow' : '!patrolled',
+ 'rctoken' : 'patrol',
+ 'rclimit' : 1,
+ }
+ data = query.GetData(params, self, encodeTitle = False)
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+ try:
+ rcData = data['query']['recentchanges']
+ except KeyError:
+ raise ServerError("The APIs don't return data, the site may be
down")
+
+ self._patrolToken[index] = rcData[0]['patroltoken']
+
+ return self._patrolToken[index]
+
+ def getFilesFromAnHash(self, hash_found = None):
+ """ Function that uses APIs to give the images that has the same
hash. Useful
+ to find duplicates or nowcommons.
+
+ NOTE: it returns also the image itself, if you don't want it, just
+ filter the list returned.
+
+ NOTE 2: it returns the image WITHOUT the image namespace.
+ """
+ if self.versionnumber() < 12:
+ return None
+
+ if hash_found is None: # If the hash is none return None and not continue
+ return None
+ # Now get all the images with the same hash
+ #action=query&format=xml&list=allimages&aisha1=%s
+ image_namespace = "%s:" % self.image_namespace() # Image:
+ params = {
+ 'action' :'query',
+ 'list' :'allimages',
+ 'aisha1' :hash_found,
+ }
+ allimages = query.GetData(params, self, encodeTitle =
False)['query']['allimages']
+ files = list()
+ for imagedata in allimages:
+ image = imagedata[u'name']
+ files.append(image)
+ return files
+
+ def getParsedString(self, string, keeptags = [u'*']):
+ """Parses the string with API and returns html content.
+
+ @param string: String that should be parsed.
+ @type string: string
+ @param keeptags: Defines which tags (wiki, HTML) should NOT be removed.
+ @type keeptags: list
+
+ Returns the string given, parsed through the wiki parser.
+ """
+
+ if not self.has_api():
+ raise Exception('parse: no API: not implemented')
+
+ # call the wiki to get info
+ params = {
+ u'action' : u'parse',
+ u'text' : string,
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Parsing string through the wiki parser via API.")
+
+ result = query.GetData(params, self)
+ r = result[u'parse'][u'text'][u'*']
+
+ # disable/remove comments
+ r = pywikibot.removeDisabledParts(r, tags = ['comments']).strip()
+
+ # disable/remove ALL tags
+ if not (keeptags == [u'*']):
+ r = removeHTMLParts(r, keeptags = keeptags).strip()
+
+ return r
+
+ def getExpandedString(self, string):
+ """Expands the string with API and returns wiki content.
+
+ @param string: String that should be expanded.
+ @type string: string
+
+ Returns the string given, expanded through the wiki parser.
+ """
+
+ if not self.has_api():
+ raise Exception('expandtemplates: no API: not implemented')
+
+ # call the wiki to get info
+ params = {
+ u'action' : u'expandtemplates',
+ u'text' : string,
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Expanding string through the wiki parser via API.")
+
+ result = query.GetData(params, self)
+ r = result[u'expandtemplates'][u'*']
+
+ return r
+
+# Caches to provide faster access
+_sites = {}
+_namespaceCache = {}
+
+@deprecate_arg("persistent_http", None)
+def getSite(code=None, fam=None, user=None, noLogin=False):
+ if code is None:
+ code = default_code
+ if fam is None:
+ fam = default_family
+ key = '%s:%s:%s' % (fam, code, user)
+ if key not in _sites:
+ _sites[key] = Site(code=code, fam=fam, user=user)
+ ret = _sites[key]
+ if not ret.family.isPublic(code) and not noLogin:
+ ret.forceLogin()
+ return ret
+
+def setSite(site):
+ global default_code, default_family
+ default_code = site.language()
+ default_family = site.family
+
+# Command line parsing and help
+
+def calledModuleName():
+ """Return the name of the module calling this function.
+
+ This is required because the -help option loads the module's docstring
+ and because the module name will be used for the filename of the log.
+
+ """
+ # get commandline arguments
+ called = sys.argv[0].strip()
+ if ".py" in called: # could end with .pyc, .pyw, etc. on some platforms
+ # clip off the '.py?' filename extension
+ called = called[:called.rindex('.py')]
+ return os.path.basename(called)
+
+def _decodeArg(arg):
+ # We may pass a Unicode string to a script upon importing and calling
+ # main() from another script.
+ if isinstance(arg,unicode):
+ return arg
+ if sys.platform == 'win32':
+ if config.console_encoding in ('cp437', 'cp850'):
+ # Western Windows versions give parameters encoded as windows-1252
+ # even though the console encoding is cp850 or cp437.
+ return unicode(arg, 'windows-1252')
+ elif config.console_encoding == 'cp852':
+ # Central/Eastern European Windows versions give parameters encoded
+ # as windows-1250 even though the console encoding is cp852.
+ return unicode(arg, 'windows-1250')
+ else:
+ return unicode(arg, config.console_encoding)
+ else:
+ # Linux uses the same encoding for both.
+ # I don't know how non-Western Windows versions behave.
+ return unicode(arg, config.console_encoding)
+
+def handleArgs(*args):
+ """Handle standard command line arguments, return the rest as a list.
+
+ Takes the commandline arguments, converts them to Unicode, processes all
+ global parameters such as -lang or -log. Returns a list of all arguments
+ that are not global. This makes sure that global arguments are applied
+ first, regardless of the order in which the arguments were given.
+
+ args may be passed as an argument, thereby overriding sys.argv
+
+ """
+ global default_code, default_family, verbose, debug, simulate
+ # get commandline arguments if necessary
+ if not args:
+ args = sys.argv[1:]
+ # get the name of the module calling this function. This is
+ # required because the -help option loads the module's docstring and because
+ # the module name will be used for the filename of the log.
+ moduleName = calledModuleName()
+ nonGlobalArgs = []
+ username = None
+ do_help = False
+ for arg in args:
+ arg = _decodeArg(arg)
+ if arg == '-help':
+ do_help = True
+ elif arg.startswith('-dir:'):
+ pass # config_dir = arg[5:] // currently handled in wikipediatools.py -
possibly before this routine is called.
+ elif arg.startswith('-family:'):
+ default_family = arg[8:]
+ elif arg.startswith('-lang:'):
+ default_code = arg[6:]
+ elif arg.startswith("-user:"):
+ username = arg[len("-user:") : ]
+ elif arg.startswith('-putthrottle:'):
+ config.put_throttle = int(arg[len("-putthrottle:") : ])
+ put_throttle.setDelay()
+ elif arg.startswith('-pt:'):
+ config.put_throttle = int(arg[len("-pt:") : ])
+ put_throttle.setDelay()
+ elif arg.startswith("-maxlag:"):
+ config.maxlag = int(arg[len("-maxlag:") : ])
+ elif arg == '-log':
+ setLogfileStatus(True)
+ elif arg.startswith('-log:'):
+ setLogfileStatus(True, arg[5:])
+ elif arg == '-nolog':
+ setLogfileStatus(False)
+ elif arg in ['-verbose', '-v']:
+ verbose += 1
+ elif arg == '-daemonize':
+ import daemonize
+ daemonize.daemonize()
+ elif arg.startswith('-daemonize:'):
+ import daemonize
+ daemonize.daemonize(redirect_std = arg[11:])
+ elif arg in ['-cosmeticchanges', '-cc']:
+ config.cosmetic_changes = not config.cosmetic_changes
+ output(u'NOTE: option cosmetic_changes is %s\n' %
config.cosmetic_changes)
+ elif arg == '-simulate':
+ simulate = True
+ elif arg == '-dry':
+ output(u"Usage of -dry is deprecated; use -simulate instead.")
+ simulate = True
+ # global debug option for development purposes. Normally does nothing.
+ elif arg == '-debug':
+ debug = True
+ config.special_page_limit = 500
+ else:
+ # the argument is not global. Let the specific bot script care
+ # about it.
+ nonGlobalArgs.append(arg)
+
+ if username:
+ config.usernames[default_family][default_code] = username
+
+ # TEST for bug #3081100
+ if unicode_error:
+ output("""
+
+================================================================================
+\03{lightyellow}WARNING:\03{lightred} your python version might trigger issue
#3081100\03{default}
+More information: See
https://sourceforge.net/support/tracker.php?aid=3081100
+\03{lightyellow}Please update python to 2.7.2+ if you are running on wikimedia
sites!\03{default}
+================================================================================
+
+""")
+ if verbose:
+ output(u'Pywikipediabot %s' % (version.getversion()))
+ output(u'Python %s' % sys.version)
+
+ if do_help:
+ showHelp()
+ sys.exit(0)
+ return nonGlobalArgs
+
+def showHelp(moduleName=None):
+ # the parameter moduleName is deprecated and should be left out.
+ moduleName = moduleName or calledModuleName()
+ try:
+ moduleName = moduleName[moduleName.rindex("\\")+1:]
+ except ValueError: # There was no \ in the module name, so presumably no problem
+ pass
+
+ globalHelp = u'''
+Global arguments available for all bots:
+
+-dir:PATH Read the bot's configuration data from directory given by
+ PATH, instead of from the default directory.
+
+-lang:xx Set the language of the wiki you want to work on, overriding
+ the configuration in user-config.py. xx should be the
+ language code.
+
+-family:xyz Set the family of the wiki you want to work on, e.g.
+ wikipedia, wiktionary, wikitravel, ...
+ This will override the configuration in user-config.py.
+
+-user:xyz Log in as user 'xyz' instead of the default username.
+
+-daemonize:xyz Immediately return control to the terminal and redirect
+ stdout and stderr to xyz (only use for bots that require
+ no input from stdin).
+
+-help Show this help text.
+
+-log Enable the logfile, using the default filename
+ "%s.log"
+ Logs will be stored in the logs subdirectory.
+
+-log:xyz Enable the logfile, using 'xyz' as the filename.
+
+-maxlag Sets a new maxlag parameter to a number of seconds. Defer bot
+ edits during periods of database server lag. Default is set by
+ config.py
+
+-nolog Disable the logfile (if it is enabled by default).
+
+-putthrottle:n Set the minimum time (in seconds) the bot will wait between
+-pt:n saving pages.
+
+-verbose Have the bot provide additional output that may be
+-v useful in debugging.
+
+-cosmeticchanges Toggles the cosmetic_changes setting made in config.py or
+-cc user_config.py to its inverse and overrules it. All other
+ settings and restrictions are untouched.
+
+-simulate Disables writing to the server. Useful for testing and
+(-dry) debugging of new code (if given, doesn't do any real
+ changes, but only shows what would have been changed).
+ DEPRECATED: please use -simulate instead of -dry
+''' % moduleName
+ output(globalHelp, toStdout=True)
+ try:
+ exec('import %s as module' % moduleName)
+ helpText = module.__doc__.decode('utf-8')
+ if hasattr(module, 'docuReplacements'):
+ for key, value in module.docuReplacements.iteritems():
+ helpText = helpText.replace(key, value.strip('\n\r'))
+ output(helpText, toStdout=True)
+ except:
+ output(u'Sorry, no help available for %s' % moduleName)
+
+#########################
+# Interpret configuration
+#########################
+
+# search for user interface module in the 'userinterfaces' subdirectory
+sys.path.append(config.datafilepath('userinterfaces'))
+exec "import %s_interface as uiModule" % config.userinterface
+ui = uiModule.UI()
+verbose = 0
+debug = False
+simulate = False
+
+# TEST for bug #3081100
+unicode_error = __import__('unicodedata').normalize(
+ 'NFC',
+ u'\u092e\u093e\u0930\u094d\u0915
\u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
+ ) != u'\u092e\u093e\u0930\u094d\u0915
\u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
+if unicode_error:
+ print u'unicode test: triggers problem #3081100'
+
+default_family = config.family
+default_code = config.mylang
+logfile = None
+# Check
+
+# if the default family+wiki is a non-public one,
+# getSite will try login in. We don't want that, the module
+# is not yet loaded.
+getSite(noLogin=True)
+
+# Set socket timeout
+socket.setdefaulttimeout(config.socket_timeout)
+
+def writeToCommandLogFile():
+ """
+ Save the name of the called module along with all parameters to
+ logs/commands.log so that the user can look it up later to track errors
+ or report bugs.
+ """
+ modname = os.path.basename(sys.argv[0])
+ # put quotation marks around all parameters
+ args = [_decodeArg(modname)] + [_decodeArg('"%s"' % s) for s in
sys.argv[1:]]
+ commandLogFilename = config.datafilepath('logs', 'commands.log')
+ try:
+ commandLogFile = codecs.open(commandLogFilename, 'a', 'utf-8')
+ except IOError:
+ commandLogFile = codecs.open(commandLogFilename, 'w', 'utf-8')
+ # add a timestamp in ISO 8601 formulation
+ isoDate = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
+ commandLogFile.write("%s r%s Python %s "
+ % (isoDate, version.getversiondict()['rev'],
+ sys.version.split()[0]))
+ s = u' '.join(args)
+ commandLogFile.write(s + os.linesep)
+ commandLogFile.close()
+
+def setLogfileStatus(enabled, logname = None):
+ global logfile
+ if enabled:
+ if not logname:
+ logname = '%s.log' % calledModuleName()
+ logfn = config.datafilepath('logs', logname)
+ try:
+ logfile = codecs.open(logfn, 'a', 'utf-8')
+ except IOError:
+ logfile = codecs.open(logfn, 'w', 'utf-8')
+ else:
+ # disable the log file
+ logfile = None
+
+if '*' in config.log or calledModuleName() in config.log:
+ setLogfileStatus(True)
+
+writeToCommandLogFile()
+
+colorTagR = re.compile('\03{.*?}', re.UNICODE)
+
+def log(text):
+ """Write the given text to the logfile."""
+ if logfile:
+ # remove all color markup
+ plaintext = colorTagR.sub('', text)
+ # save the text in a logfile (will be written in utf-8)
+ logfile.write(plaintext)
+ logfile.flush()
+
+output_lock = threading.Lock()
+input_lock = threading.Lock()
+output_cache = []
+
+def output(text, decoder=None, newline=True, toStdout=False, **kwargs):
+ """Output a message to the user via the userinterface.
+
+ Works like print, but uses the encoding used by the user's console
+ (console_encoding in the configuration file) instead of ASCII.
+ If decoder is None, text should be a unicode string. Otherwise it
+ should be encoded in the given encoding.
+
+ If newline is True, a linebreak will be added after printing the text.
+
+ If toStdout is True, the text will be sent to standard output,
+ so that it can be piped to another process. All other text will
+ be sent to stderr. See:
http://en.wikipedia.org/wiki/Pipeline_%28Unix%29
+
+ text can contain special sequences to create colored output. These
+ consist of the escape character \03 and the color name in curly braces,
+ e. g. \03{lightpurple}. \03{default} resets the color.
+
+ """
+ output_lock.acquire()
+ try:
+ if decoder:
+ text = unicode(text, decoder)
+ elif type(text) is not unicode:
+ if verbose and sys.platform != 'win32':
+ print "DBG> BUG: Non-unicode (%s) passed to wikipedia.output
without decoder!" % type(text)
+ print traceback.print_stack()
+ print "DBG> Attempting to recover, but please report this
problem"
+ try:
+ text = unicode(text, 'utf-8')
+ except UnicodeDecodeError:
+ text = unicode(text, 'iso8859-1')
+ if newline:
+ text += u'\n'
+ log(text)
+ if input_lock.locked():
+ cache_output(text, toStdout = toStdout)
+ else:
+ ui.output(text, toStdout = toStdout)
+ finally:
+ output_lock.release()
+
+def cache_output(*args, **kwargs):
+ output_cache.append((args, kwargs))
+
+def flush_output_cache():
+ while(output_cache):
+ (args, kwargs) = output_cache.pop(0)
+ ui.output(*args, **kwargs)
+
+# User input functions
+
+def input(question, password = False):
+ """Ask the user a question, return the user's answer.
+
+ Parameters:
+ * question - a unicode string that will be shown to the user. Don't add a
+ space after the question mark/colon, this method will do this
+ for you.
+ * password - if True, hides the user's input (for password entry).
+
+ Returns a unicode string.
+
+ """
+ input_lock.acquire()
+ try:
+ data = ui.input(question, password)
+ finally:
+ flush_output_cache()
+ input_lock.release()
+
+ return data
+
+def inputChoice(question, answers, hotkeys, default = None):
+ """Ask the user a question with several options, return the user's
choice.
+
+ The user's input will be case-insensitive, so the hotkeys should be
+ distinctive case-insensitively.
+
+ Parameters:
+ * question - a unicode string that will be shown to the user. Don't add a
+ space after the question mark, this method will do this
+ for you.
+ * answers - a list of strings that represent the options.
+ * hotkeys - a list of one-letter strings, one for each answer.
+ * default - an element of hotkeys, or None. The default choice that will
+ be returned when the user just presses Enter.
+
+ Returns a one-letter string in lowercase.
+
+ """
+ input_lock.acquire()
+ try:
+ data = ui.inputChoice(question, answers, hotkeys, default).lower()
+ finally:
+ flush_output_cache()
+ input_lock.release()
+
+ return data
+
+
+page_put_queue = Queue.Queue(config.max_queue_size)
+def async_put():
+ """Daemon; take pages from the queue and try to save them on the
wiki."""
+ while True:
+ (page, newtext, comment, watchArticle,
+ minorEdit, force, callback) = page_put_queue.get()
+ if page is None:
+ # an explicit end-of-Queue marker is needed for compatibility
+ # with Python 2.4; in 2.5, we could use the Queue's task_done()
+ # and join() methods
+ return
+ try:
+ page.put(newtext, comment, watchArticle, minorEdit, force)
+ error = None
+ except Exception, error:
+ pass
+ if callback is not None:
+ callback(page, error)
+ # if callback is provided, it is responsible for exception handling
+ continue
+ if isinstance(error, SpamfilterError):
+ output(u"Saving page %s prevented by spam filter: %s"
+ % (page, error.url))
+ elif isinstance(error, PageNotSaved):
+ output(u"Saving page %s failed: %s" % (page, error))
+ elif isinstance(error, LockedPage):
+ output(u"Page %s is locked; not saved." % page)
+ elif isinstance(error, NoUsername):
+ output(u"Page %s not saved; sysop privileges required." % page)
+ elif error is not None:
+ tb = traceback.format_exception(*sys.exc_info())
+ output(u"Saving page %s failed:\n%s" % (page,
"".join(tb)))
+
+_putthread = threading.Thread(target=async_put)
+# identification for debugging purposes
+_putthread.setName('Put-Thread')
+_putthread.setDaemon(True)
+## Don't start the queue if it is not necessary.
+#_putthread.start()
+
+def stopme():
+ """This should be run when a bot does not interact with the Wiki, or
+ when it has stopped doing so. After a bot has run stopme() it will
+ not slow down other bots any more.
+ """
+ get_throttle.drop()
+
+def _flush():
+ """Wait for the page-putter to flush its queue.
+
+ Called automatically upon exiting from Python.
+
+ """
+ def remaining():
+ import datetime
+ remainingPages = page_put_queue.qsize() - 1
+ # -1 because we added a None element to stop the queue
+ remainingSeconds = datetime.timedelta(
+ seconds=(remainingPages * put_throttle.getDelay(True)))
+ return (remainingPages, remainingSeconds)
+
+ page_put_queue.put((None, None, None, None, None, None, None))
+
+ if page_put_queue.qsize() > 1:
+ output(u'Waiting for %i pages to be put. Estimated time remaining: %s'
+ % remaining())
+
+ while(_putthread.isAlive()):
+ try:
+ _putthread.join(1)
+ except KeyboardInterrupt:
+ answer = inputChoice(u"""\
+There are %i pages remaining in the queue. Estimated time remaining: %s
+Really exit?"""
+ % remaining(),
+ ['yes', 'no'], ['y',
'N'], 'N')
+ if answer == 'y':
+ return
+ try:
+ get_throttle.drop()
+ except NameError:
+ pass
+ if config.use_diskcache and not config.use_api:
+ for site in _sites.itervalues():
+ if site._mediawiki_messages:
+ try:
+ site._mediawiki_messages.delete()
+ except OSError:
+ pass
+
+import atexit
+atexit.register(_flush)
+
+def debugDump(name, site, error, data):
+ import time
+ name = unicode(name)
+ error = unicode(error)
+ site = unicode(repr(site).replace(u':',u'_'))
+ filename = '%s_%s__%s.dump' % (name, site, time.asctime())
+ filename = filename.replace('
','_').replace(':','-')
+ f = file(filename, 'wb') #trying to write it in binary
+ #f = codecs.open(filename, 'w', 'utf-8')
+ f.write(u'Error reported: %s\n\n' % error)
+ try:
+ f.write(data.encode("utf8"))
+ except UnicodeDecodeError:
+ f.write(data)
+ f.close()
+ output( u'ERROR: %s caused error %s. Dump %s created.' %
(name,error,filename) )
+
+get_throttle = Throttle()
+put_throttle = Throttle(write=True)
+
+def decompress_gzip(data):
+ # Use cStringIO if available
+ # TODO: rewrite gzip.py such that it supports unseekable fileobjects.
+ if data:
+ try:
+ from cStringIO import StringIO
+ except ImportError:
+ from StringIO import StringIO
+ import gzip
+ try:
+ data = gzip.GzipFile(fileobj = StringIO(data)).read()
+ except IOError:
+ raise
+ return data
+
+def parsetime2stamp(tz):
+ s = time.strptime(tz, "%Y-%m-%dT%H:%M:%SZ")
+ return int(time.strftime("%Y%m%d%H%M%S", s))
+
+
+#Redirect Handler for urllib2
+class U2RedirectHandler(urllib2.HTTPRedirectHandler):
+
+ def redirect_request(self, req, fp, code, msg, headers, newurl):
+ newreq = urllib2.HTTPRedirectHandler.redirect_request(self, req, fp, code, msg,
headers, newurl)
+ if (newreq.get_method() == "GET"):
+ for cl in "Content-Length", "Content-length",
"content-length", "CONTENT-LENGTH":
+ if newreq.has_header(cl):
+ del newreq.headers[cl]
+ return newreq
+
+ def http_error_301(self, req, fp, code, msg, headers):
+ result = urllib2.HTTPRedirectHandler.http_error_301(
+ self, req, fp, code, msg, headers)
+ result.code = code
+ result.sheaders = [v for v in headers.__str__().split('\n') if
v.startswith('Set-Cookie:')]
+ return result
+
+ def http_error_302(self, req, fp, code, msg, headers):
+ result = urllib2.HTTPRedirectHandler.http_error_302(
+ self, req, fp, code, msg, headers)
+ result.code = code
+ result.sheaders = [v for v in headers.__str__().split('\n') if
v.startswith('Set-Cookie:')]
+ return result
+
+# Site Cookies handler
+COOKIEFILE = config.datafilepath('login-data', 'cookies.lwp')
+cj = cookielib.LWPCookieJar()
+if os.path.isfile(COOKIEFILE):
+ cj.load(COOKIEFILE)
+
+cookieProcessor = urllib2.HTTPCookieProcessor(cj)
+
+
+MyURLopener = urllib2.build_opener(U2RedirectHandler)
+
+if config.proxy['host']:
+ proxyHandler = urllib2.ProxyHandler({'http':'http://%s/' %
config.proxy['host'] })
+
+ MyURLopener.add_handler(proxyHandler)
+ if config.proxy['auth']:
+ proxyAuth = urllib2.HTTPPasswordMgrWithDefaultRealm()
+ proxyAuth.add_password(None, config.proxy['host'],
config.proxy['auth'][0], config.proxy['auth'][1])
+ proxyAuthHandler = urllib2.ProxyBasicAuthHandler(proxyAuth)
+
+ MyURLopener.add_handler(proxyAuthHandler)
+
+if config.authenticate:
+ passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
+ for site in config.authenticate:
+ passman.add_password(None, site, config.authenticate[site][0],
config.authenticate[site][1])
+ authhandler = urllib2.HTTPBasicAuthHandler(passman)
+
+ MyURLopener.add_handler(authhandler)
+
+MyURLopener.addheaders = [('User-agent', useragent)]
+
+# This is a temporary part for the 2012 version survey
+#
http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/12473
+# Upon removing the connected lines from config.py should be removed, too.
+if not config.suppresssurvey:
+ output(
+"""
+\03{lightyellow}Dear Pywikipedia user!\03{default}
+Pywikibot has detected that you use this outdated version of Python:
+%s.
+We would like to hear your voice before ceasing support of this version.
+Please update to \03{lightyellow}Python 2.7.2\03{default} or higher if possible or visit
+http://www.mediawiki.org/wiki/Pywikipediabot/Survey2012 to tell us why we
+should support your version and to learn how to hide this message.
+After collecting opinions for a time we will decide and announce the deadline
+of deprecating use of old Python versions for Pywikipedia.
+""" % sys.version)
+
+if __name__ == '__main__':
+ import doctest
+ print 'Pywikipediabot %s' % version.getversion()
+ print 'Python %s' % sys.version
+ doctest.testmod()
+