http://www.mediawiki.org/wiki/Special:Code/pywikipedia/10531
Revision: 10531
Author: xqt
Date: 2012-09-16 17:25:31 +0000 (Sun, 16 Sep 2012)
Log Message:
-----------
License formatting from rewrite
Modified Paths:
--------------
trunk/pywikipedia/LICENSE
Modified: trunk/pywikipedia/LICENSE
===================================================================
--- trunk/pywikipedia/LICENSE 2012-09-16 17:16:23 UTC (rev 10530)
+++ trunk/pywikipedia/LICENSE 2012-09-16 17:25:31 UTC (rev 10531)
@@ -1,22 +1,25 @@
-Copyright (c) 2005-2012 The PyWikipediaBot team
+Copyright (c) 2004-2012 Pywikipedia bot team
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights to
-use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
-of the Software, and to permit persons to whom the Software is furnished to do
-so, subject to the following conditions:
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the "Software"), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
The English-language wordlist is extended from the SCOWL-70 list from
http://wordlist.sourceforge.net/. This source has as its licensing information:
http://www.mediawiki.org/wiki/Special:Code/pywikipedia/10530
Revision: 10530
Author: xqt
Date: 2012-09-16 17:16:23 +0000 (Sun, 16 Sep 2012)
Log Message:
-----------
rename licence information to LICENSE
Added Paths:
-----------
branches/rewrite/LICENSE
Removed Paths:
-------------
branches/rewrite/COPYING
Deleted: branches/rewrite/COPYING
===================================================================
--- branches/rewrite/COPYING 2012-09-16 17:05:20 UTC (rev 10529)
+++ branches/rewrite/COPYING 2012-09-16 17:16:23 UTC (rev 10530)
@@ -1,23 +0,0 @@
-Copyright (c) 2004-2012 Pywikipedia bot team
-
-Permission is hereby granted, free of charge, to any person
-obtaining a copy of this software and associated documentation
-files (the "Software"), to deal in the Software without
-restriction, including without limitation the rights to use,
-copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the
-Software is furnished to do so, subject to the following
-conditions:
-
-The above copyright notice and this permission notice shall be
-included in all copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
-EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
-OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
-NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
-HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
-WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
-FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
-OTHER DEALINGS IN THE SOFTWARE.
-
Copied: branches/rewrite/LICENSE (from rev 10525, branches/rewrite/COPYING)
===================================================================
--- branches/rewrite/LICENSE (rev 0)
+++ branches/rewrite/LICENSE 2012-09-16 17:16:23 UTC (rev 10530)
@@ -0,0 +1,23 @@
+Copyright (c) 2004-2012 Pywikipedia bot team
+
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the "Software"), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
+
http://www.mediawiki.org/wiki/Special:Code/pywikipedia/10529
Revision: 10529
Author: xqt
Date: 2012-09-16 17:05:20 +0000 (Sun, 16 Sep 2012)
Log Message:
-----------
content of CONTENTS revised
Modified Paths:
--------------
trunk/pywikipedia/CONTENTS
Modified: trunk/pywikipedia/CONTENTS
===================================================================
--- trunk/pywikipedia/CONTENTS 2012-09-16 13:48:36 UTC (rev 10528)
+++ trunk/pywikipedia/CONTENTS 2012-09-16 17:05:20 UTC (rev 10529)
@@ -9,40 +9,69 @@
To get started on proper usage of the bot framework, please refer to:
- http://meta.wikimedia.org/wiki/Using_the_python_wikipediabot
+ http://www.mediawiki.org/wiki/Manual:Pywikipediabot
The contents of the package are:
-=== Library routines ===
+=== readme and config files ===
+CONTENTS : THIS file
LICENSE : a reference to the MIT license
-wikipedia.py : The wikipedia library
-wiktionary.py : The wiktionary library
+README : Short info string used by PyWikipediaBot Nightlies
+setup.cfg : Setup file for automated tests of the package
+version : package version collected by PyWikipediaBot Nightlies,
+ otherwise omitted
config.py : Configuration module containing all defaults. Do not
change these! See below how to change values.
-titletranslate.py : rules and tricks to auto-translate wikipage titles
+fixes.py : Stores predefined replacements used by replace.py.
+generate_family_file.py: Creates a new family file.
+generate_user_files.py : Creates user-config.py or user-fixes.py.
+
+=== Library routines ===
+
+apispec.py : Library to handle special pages through API
+BeautifulSoup.py : is a Python HTML/XML parser designed for quick
+ turnaround projects like screen-scraping. See
+ more: http://www.crummy.com/software/BeautifulSoup
+botlist.py : Allows access to the site's bot user list.
+catlib.py : Library routines written especially to handle
+ category pages and recurse over category contents.
+daemonize.py : The process will fork to the background and return
+ control to the terminal
date.py : Date formats in various languages
+diskcache.py : Disk caching module
family.py : Abstract superclass for wiki families. Subclassed by
the classes in the 'families' subdirectory.
-catlib.py : Library routines written especially to handle
- category pages and recurse over category contents.
gui.py : Some GUI elements for solve_disambiguation.py
-mediawiki_messages.py : Access to the various translations of the MediaWiki
- software interface.
+logindata.py : Use of pywikipedia as a library.
+mysql_autoconnection.py: A small MySQL wrapper that catches dead MySQL
+ connections, and tries to reconnect them.
pagegenerators.py : Generator pages.
-userlib.py : Library to work with users, their pages and talk pages.
-apispec.py : Library to handle special pages through API
-BeautifulSoup.py : is a Python HTML/XML parser designed for quick
- turnaround projects like screen-scraping. See
- more:
- http://www.crummy.com/software/BeautifulSoup
+pageimport.py : Import pages from a certain wiki to another.
+query.py : API query library
+rciw.py : A IRC script to check for Recent Changes through IRC,
+ and to check for interwikis in those recently modified
+ articles.
+simple_family.py : Family file in conjunction with none-pywikipedia
+ config files
+titletranslate.py : rules and tricks to auto-translate wikipage titles
+userlib.py : Library to work with users, their pages and talk pages
+wikicomserver.py : This library allows the use of the pywikipediabot
+ directly from COM-aware applications.
+wikipedia.py : The wikipedia library
+wikipediatools.py : Returns package base directory
+wiktionary.py : The wiktionary library
+xmlreader.py : Reading and parsing XML dump files.
=== Utilities ===
+articlenos.py : Displays the ordinal number of the new articles being
+ created visible on the Recent Changes list.
basic.py : Is a template from which simple bots can be made.
-checkusage.py : Provides a way for users of the Wikimedia toolserver
- to check the use of images from Commons on
- other Wikimedia wikis.
+delinker.py : Delinks and replaces images.
+editarticle.py : Edit an article with your favourite editor. Run
+ the script with the "--help" option to get
+ detailed infortion on possiblities.
extract_wikilinks.py : Two bots to get all linked-to wiki pages from an
HTML-file. They differ in their output:
extract_names gives bare names (can be used for
@@ -57,6 +86,8 @@
output.
login.py : Log in to an account on your "home" wiki. or check the
login status
+maintainer.py : Shares tasks between workers.
+maintcont.py : The controller bot for maintainer.py
splitwarning.py : split an interwiki.log file into warning files for each
separate language. suggestion: Zip the created
files up, put them somewhere on the internet, and
@@ -64,87 +95,126 @@
mailinglist.
testfamily.py : Check whether you are logged in all known languages
in a family.
-xmltest.py : Read an XML file (e.g. the sax_parse_bug.txt sometimes
- created by interwiki.py), and if it contains an error,
- show a stacktrace with the location of the error.
-editarticle.py : Edit an article with your favourite editor. Run
- the script with the "--help" option to get
- detailed infortion on possiblities.
-sqldump.py : Extract information from local cur SQL dump
- files, like the ones at http://download.wikimedia.org
rcsort.py : A tool to see the recentchanges ordered by user instead
of by date.
-threadpool.py :
-xmlreader.py :
+upd-log.py : Update notification script
+version.py : Outputs Pywikipedia's revision number, Python's version
+ and OS used.
+warnfile.py : A robot that parses a warning file created by
+ interwiki.py on another language wiki, and
+ implements the suggested changes without verifying
+ them.
watchlist.py : Allows access to the bot account's watchlist.
-wikicomserver.py : This library allows the use of the pywikipediabot
- directly from COM-aware applications.
=== Robots ===
+add_text.py : Adds text at the top or end of pages
+archivebot.py : Archives discussion threads
+blockpagechecker.py : Deletes any protection templates that are on pages
+ which aren't actually protected.
capitalize_redirects.py: Script to create a redirect of capitalize articles.
casechecker.py : Script to enumerate all pages in the wikipedia and
find all titles with mixed Latin and Cyrillic
alphabets.
-category.py : add a category link to all pages mentioned on a page,
+catall.py : Add or change categories on a number of pages.
+category.py : Add a category link to all pages mentioned on a page,
change or remove category tags
category_redirect.py : Maintain category redirects and replace links to
redirected categories.
-catall.py : Add or change categories on a number of pages.
-catmove.pl : Need Perl programming language for this; takes a list
- of category moves or removes to make and uses
- category.py.
+censure.py : Bad word checker bot.
+cfd.py : Processes the categories for discussion working page.
+ It parses out the actions that need to be taken as a
+ result of CFD discussions and performs them.
+checkimages.py : Check recently uploaded files. Checks if a file
+ description is present and if there are other problems
+ in the image's description.
clean_sandbox.py : This bot makes the cleaned of the page of tests.
+commons_category_redirect.py: Cleans Commons:Category:Non-empty category
+ redirects by moving all the files, pages and categories
+ from redirected category to the target category.
commons_link.py : This robot include commons template to linking Commons
and your wiki project.
+commonscat.py : Adds {{commonscat}} to Wikipedia categories (or
+ articles), if other language wikipedia already has such
+ a template
copyright.py : This robot check copyright text in Google, Yahoo! and
Live Search.
+copyright_clean.py : Remove reports of copyright.py on wiki pages.
+ Uses YurikAPI.
+copyright_put .py : Put reports of copyright.py on wiki pages.
cosmetic_changes.py : Can do slight modifications to a wiki page source code
such that the code looks cleaner.
+create_categories.py : Program to batch create categories.
+data_ingestion.py : A generic bot to do batch uploading to Commons.
+deledpimage.py : Remove EDP images in non-article namespaces.
delete.py : This script can be used to delete pages en masse.
disambredir.py : Changing redirect names in disambiguation pages.
+djvutext.py : Extracts OCR text from djvu files and uploads onto
+ pages in the "Page" namespace on Wikisource.
featured.py : A robot to check feature articles.
-fixes.py : This is not a bot, perform one of the predefined
- replacements tasks, used for "replace.py
- -fix:replacement".
+fixing_redirects.py : Correct all redirect links of processed pages.
+flickrripper.py : Upload images from Flickr easily.
+followlive.py : follow new articles on a wikipedia and flag them
+ with a template.
image.py : This script can be used to change one image to another
or remove an image entirely.
+imagecopy.py : Copies images from a wikimedia wiki to Commons
+imagecopy_self.py : Copy self published files from the English Wikipedia to
+ Commons.
+imageharvest.py : Bot for getting multiple images from an external site.
+iamgerecat.py : Try to find categories for media on Commons.
imagetransfer.py : Given a wiki page, check the interwiki links for
images, and let the user choose among them for
images to upload.
+imageuncat.py : Adds uncat template to images without categories at
+ Commons
inline_images.py : This bot looks for images that are linked inline
(i.e., they are hosted from an external server and
hotlinked).
interwiki.py : A robot to check interwiki links on all pages (or
a range of pages) of a wiki.
interwiki_graph.py : Possible create graph with interwiki.py.
-imageharvest.py : Bot for getting multiple images from an external site.
isbn.py : Bot to convert all ISBN-10 codes to the ISBN-13
format.
+lonelypages.py : Place a template on pages which are not linked to by
+ other pages, and are therefore lonely
makecat.py : Given an existing or new category, find pages for that
category.
+match_images.py : Match two images based on histograms.
+misspelling.py : Similar to solve_disambiguation.py. It is supposed to
+ fix links that contain common spelling mistakes.
movepages.py : Bot page moves to another title.
+ndashredir.py : Creates hyphenated redirects to articles with n dash
+ or m dash in their title.
+noreferences.py : Searches for pages where <references /> is missing
+ although a <ref> tag is present, and in that case adds
+ a new references section.
nowcommons.py : This bot can delete images with NowCommons template.
-ndashredir.py : Creates hyphenated redirects to articles with n dash
- or m dash in their title.
pagefromfile.py : This bot takes its input from a file that contains a
number of pages to be put on the wiki.
+panoramiopicker.py : Upload images from Panoramio easily.
+patrol.py : Obtains a list pages and marks the edits as patrolled
+ based on a whitelist.
piper.py : Pipes article text through external program(s) on
STDIN and collects its STDOUT which is used as the
new article text if it differs from the original.
+protect.py : Protect and unprotect pages en masse.
redirect.py : Fix double redirects and broken redirects. Note:
solve_disambiguation also has functions which treat
redirects.
-refcheck.py : This script checks references to see if they are
- properly formatted.
+reflinks.py : Search for references which are only made of a link
+ without title and fetch the html title from the link to
+ use it as the title of the wiki link in the reference.
replace.py : Search articles for a text and replace it by another
text. Both text are set in two configurable
text files. The bot can either work on a set of given
pages or crawl an SQL dump.
+revertbot.py : Revert edits.
saveHTML.py : Downloads the HTML-pages of articles and images.
selflink.py : This bot goes over multiple pages of the home wiki,
searches for selflinks, and allows removing them.
solve_disambiguation.py: Interactive robot doing disambiguation.
+spamremove.py : Remove links that are being or have been spammed.
speedy_delete.py : This bot load a list of pages from the category of
candidates for speedy deletion and give the
user an interactive prompt to decide whether
@@ -156,10 +226,17 @@
separator, in the right order).
standardize_notes.py : Converts external links and notes/references to
: Footnote3 ref/note format. Rewrites References.
+statistics_in_wikitable.py: This bot renders statistics provided by
+ [[Special:Statistics]] in a table on a wiki page.
+ Thus it creates and updates a statistics wikitable.
+sum_disc.py : Summarize discussions spread over the whole wiki
+ including all namespaces
table2wiki.py : Semi-automatic converting HTML-tables to wiki-tables.
+tag_nowcommons.py : Tag files available at Commons with the Nowcommons
+ template
+template.py : change one template (that is {{...}}) into another.
templatecount.py : Display the list of pages transcluding a given list
of templates.
-template.py : change one template (that is {{...}}) into another.
touch.py : Bot goes over all pages of the home wiki, and edits
them without changing.
unlink.py : This bot unlinks a page on every page that links to it.
@@ -168,36 +245,34 @@
upload.py : upload an image to a wiki.
us-states.py : A robot to add redirects to cities for US state
abbreviations.
-warnfile.py : A robot that parses a warning file created by
- interwiki.py on another language wiki, and
- implements the suggested changes without verifying
- them.
weblinkchecker.py : Check if external links are still working.
welcome.py : Script to welcome new users.
=== Directories ===
+botlist : Contains cached bot users
+cache : Contains disk cached pages and data retrieved by
+ featured.y
category :
+commonsdelinker : Contains commons delinker bot maintained by Siebrand
copyright : Contains information retrieved by copyright.py
deadlinks : Contains information retrieved by weblinkchecker.py
disambiguations : If you run solve_disambiguation.py with the -primary
argument, the bot will save information here
families : Contains wiki-specific information like URLs,
languages, encodings etc.
-featured : Stored featured article in cache file.
+i18n : Contains i18n translations for bot edit summaries
interwiki_dumps : If the interwiki bot is interrupted, it will store
- a dump file here. This file will be read when using
+ a dump file here. These files will be read when using
the interwiki bot with -restore or -continue.
interwiki_graphs : Contains graphs for interwiki_graph.py
+login-data : login.py stores your cookies here (Your password won't
+ be stored as plaintext).
logs : Contains logfiles.
-mediawiki-messages : Information retrieved by mediawiki_messages.py will
- be stored here.
maintenance : contains maintenance scripts for the developing team
-login-data : login.py stores your cookies here (Your password won't
- be stored as plaintext).
pywikibot : Contains some libraries and control files
simplejson : A simple, fast, extensible JSON encoder and decoder
- used by query.py.
+ used by query.py. Needed for python release prev. 2.6
spelling : Contains dictionaries for spellcheck.py.
test : Some test stuff for the developing team
userinterfaces : Contains Tkinter, WxPython, terminal and
@@ -207,10 +282,6 @@
here.
wiktionary : Contains script to used for Wiktionary project.
-=== Unit tests ===
-
-wiktionarytest.py : Unit tests for wiktionary.py
-
External software can be used with PyWikipediaBot:
* Win32com library for use with wikicomserver.py
* Pydot, Pyparsing and Graphviz for use with interwiki_graph.py
@@ -231,28 +302,23 @@
python interwiki.py -help
-You need to have at least python version 2.4 or newer installed on your
-computer to be able to run any of the code in this package, but not 3.x,
-because pywikipediabot is still not updated to it! Support for older versions
-of python is not planned. (http://www.python.org/download/)
+You need to have at least python version 2.7.2 (http://www.python.org/download/)
+or newer installed on your computer to be able to run any of the code in this
+package, but not 3.x, because pywikipediabot is still not updated to it! Support
+for older versions of python is not planned. Some scripts could run with older
+python releases. Please refer the manual at mediawiki for further details and
+restrictions.
-
You do not need to "install" this package to be able to make use of
it. You can actually just run it from the directory where you unpacked
it or where you have your copy of the SVN sources.
-Before you run any of the programs, you need to create a file named
-user-config.py in your current directory. It needs at least two lines:
-The first line should set your real name; this will be used to identify you
-when the robot is making changes, in case you are not logged in. The
-second line sets the code of your home language. The file should look like:
-
-===========
-username='My name'
-mylang='xx'
-===========
-
-There are other variables that can be set in the configuration file, please
+The first time you run a script, the package creates a file named user-config.py
+in your current directory. It asks for the family and language code you are
+working on and at least for the bot's user name; this will be used to identify
+you when the robot is making changes, in case you are not logged in. You may
+choose to create a small or extended version of the config file with further
+informations. Other variables that can be set in the configuration file, please
check config.py for ideas.
After that, you are advised to create a username + password for the bot, and
http://www.mediawiki.org/wiki/Special:Code/pywikipedia/10528
Revision: 10528
Author: xqt
Date: 2012-09-16 13:48:36 +0000 (Sun, 16 Sep 2012)
Log Message:
-----------
old python 2.3 scripts
Added Paths:
-----------
archive/old python 2.3 scripts/
archive/old python 2.3 scripts/interwiki.py
archive/old python 2.3 scripts/wikipedia.py
Copied: archive/old python 2.3 scripts/interwiki.py (from rev 10463, trunk/pywikipedia/interwiki.py)
===================================================================
--- archive/old python 2.3 scripts/interwiki.py (rev 0)
+++ archive/old python 2.3 scripts/interwiki.py 2012-09-16 13:48:36 UTC (rev 10528)
@@ -0,0 +1,2585 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+"""
+Script to check language links for general pages. This works by downloading the
+page, and using existing translations plus hints from the command line to
+download the equivalent pages from other languages. All of such pages are
+downloaded as well and checked for interwiki links recursively until there are
+no more links that are encountered. A rationalization process then selects the
+right interwiki links, and if this is unambiguous, the interwiki links in the
+original page will be automatically updated and the modified page uploaded.
+
+These command-line arguments can be used to specify which pages to work on:
+
+&pagegenerators_help;
+
+ -days: Like -years, but runs through all date pages. Stops at
+ Dec 31. If the argument is given in the form -days:X,
+ it will start at month no. X through Dec 31. If the
+ argument is simply given as -days, it will run from
+ Jan 1 through Dec 31. E.g. for -days:9 it will run
+ from Sep 1 through Dec 31.
+
+ -years: run on all year pages in numerical order. Stop at year 2050.
+ If the argument is given in the form -years:XYZ, it
+ will run from [[XYZ]] through [[2050]]. If XYZ is a
+ negative value, it is interpreted as a year BC. If the
+ argument is simply given as -years, it will run from 1
+ through 2050.
+
+ This implies -noredirect.
+
+ -new: Work on the 100 newest pages. If given as -new:x, will work
+ on the x newest pages.
+ When multiple -namespace parameters are given, x pages are
+ inspected, and only the ones in the selected name spaces are
+ processed. Use -namespace:all for all namespaces. Without
+ -namespace, only article pages are processed.
+
+ This implies -noredirect.
+
+ -restore: restore a set of "dumped" pages the robot was working on
+ when it terminated. The dump file will be subsequently
+ removed.
+
+ -restore:all restore a set of "dumped" pages of all dumpfiles to a given
+ family remaining in the "interwiki-dumps" directory. All
+ these dump files will be subsequently removed. If restoring
+ process interrupts again, it saves all unprocessed pages in
+ one new dump file of the given site.
+
+ -continue: like restore, but after having gone through the dumped pages,
+ continue alphabetically starting at the last of the dumped
+ pages. The dump file will be subsequently removed.
+
+ -warnfile: used as -warnfile:filename, reads all warnings from the
+ given file that apply to the home wiki language,
+ and read the rest of the warning as a hint. Then
+ treats all the mentioned pages. A quicker way to
+ implement warnfile suggestions without verifying them
+ against the live wiki is using the warnfile.py
+ script.
+
+Additionaly, these arguments can be used to restrict the bot to certain pages:
+
+ -namespace:n Number or name of namespace to process. The parameter can be
+ used multiple times. It works in combination with all other
+ parameters, except for the -start parameter. If you e.g.
+ want to iterate over all categories starting at M, use
+ -start:Category:M.
+
+ -number: used as -number:#, specifies that the robot should process
+ that amount of pages and then stop. This is only useful in
+ combination with -start. The default is not to stop.
+
+ -until: used as -until:title, specifies that the robot should
+ process pages in wiki default sort order up to, and
+ including, "title" and then stop. This is only useful in
+ combination with -start. The default is not to stop.
+ Note: do not specify a namespace, even if -start has one.
+
+ -bracket only work on pages that have (in the home language)
+ parenthesis in their title. All other pages are skipped.
+ (note: without ending colon)
+
+ -skipfile: used as -skipfile:filename, skip all links mentioned in
+ the given file. This does not work with -number!
+
+ -skipauto use to skip all pages that can be translated automatically,
+ like dates, centuries, months, etc.
+ (note: without ending colon)
+
+ -lack: used as -lack:xx with xx a language code: only work on pages
+ without links to language xx. You can also add a number nn
+ like -lack:xx:nn, so that the bot only works on pages with
+ at least nn interwiki links (the default value for nn is 1).
+
+These arguments control miscellanous bot behaviour:
+
+ -quiet Use this option to get less output
+ (note: without ending colon)
+
+ -async Put page on queue to be saved to wiki asynchronously. This
+ enables loading pages during saving throtteling and gives a
+ better performance.
+ NOTE: For post-processing it always assumes that saving the
+ the pages was sucessful.
+ (note: without ending colon)
+
+ -summary: Set an additional action summary message for the edit. This
+ could be used for further explainings of the bot action.
+ This will only be used in non-autonomous mode.
+
+ -hintsonly The bot does not ask for a page to work on, even if none of
+ the above page sources was specified. This will make the
+ first existing page of -hint or -hinfile slip in as the start
+ page, determining properties like namespace, disambiguation
+ state, and so on. When no existing page is found in the
+ hints, the bot does nothing.
+ Hitting return without input on the "Which page to check:"
+ prompt has the same effect as using -hintsonly.
+ Options like -back, -same or -wiktionary are in effect only
+ after a page has been found to work on.
+ (note: without ending colon)
+
+These arguments are useful to provide hints to the bot:
+
+ -hint: used as -hint:de:Anweisung to give the robot a hint
+ where to start looking for translations. If no text
+ is given after the second ':', the name of the page
+ itself is used as the title for the hint, unless the
+ -hintnobracket command line option (see there) is also
+ selected.
+
+ There are some special hints, trying a number of languages
+ at once:
+ * all: All languages with at least ca. 100 articles.
+ * 10: The 10 largest languages (sites with most
+ articles). Analogous for any other natural
+ number.
+ * arab: All languages using the Arabic alphabet.
+ * cyril: All languages that use the Cyrillic alphabet.
+ * chinese: All Chinese dialects.
+ * latin: All languages using the Latin script.
+ * scand: All Scandinavian languages.
+
+ Names of families that forward their interlanguage links
+ to the wiki family being worked upon can be used (with
+ -family=wikipedia only), they are:
+ * commons: Interlanguage links of Mediawiki Commons.
+ * incubator: Links in pages on the Mediawiki Incubator.
+ * meta: Interlanguage links of named pages on Meta.
+ * species: Interlanguage links of the wikispecies wiki.
+ * strategy: Links in pages on Wikimedias strategy wiki.
+ * test: Take interwiki links from Test Wikipedia
+
+ Languages, groups and families having the same page title
+ can be combined, as -hint:5,scand,sr,pt,commons:New_York
+
+ -hintfile: similar to -hint, except that hints are taken from the given
+ file, enclosed in [[]] each, instead of the command line.
+
+ -askhints: for each page one or more hints are asked. See hint: above
+ for the format, one can for example give "en:something" or
+ "20:" as hint.
+
+ -same looks over all 'serious' languages for the same title.
+ -same is equivalent to -hint:all:
+ (note: without ending colon)
+
+ -wiktionary: similar to -same, but will ONLY accept names that are
+ identical to the original. Also, if the title is not
+ capitalized, it will only go through other wikis without
+ automatic capitalization.
+
+ -untranslated: works normally on pages with at least one interlanguage
+ link; asks for hints for pages that have none.
+
+ -untranslatedonly: same as -untranslated, but pages which already have a
+ translation are skipped. Hint: do NOT use this in
+ combination with -start without a -number limit, because
+ you will go through the whole alphabet before any queries
+ are performed!
+
+ -showpage when asking for hints, show the first bit of the text
+ of the page always, rather than doing so only when being
+ asked for (by typing '?'). Only useful in combination
+ with a hint-asking option like -untranslated, -askhints
+ or -untranslatedonly.
+ (note: without ending colon)
+
+ -noauto Do not use the automatic translation feature for years and
+ dates, only use found links and hints.
+ (note: without ending colon)
+
+ -hintnobracket used to make the robot strip everything in brackets,
+ and surrounding spaces from the page name, before it is
+ used in a -hint:xy: where the page name has been left out,
+ or -hint:all:, -hint:10:, etc. without a name, or
+ an -askhint reply, where only a language is given.
+
+These arguments define how much user confirmation is required:
+
+ -autonomous run automatically, do not ask any questions. If a question
+ -auto to an operator is needed, write the name of the page
+ to autonomous_problems.dat and continue on the next page.
+ (note: without ending colon)
+
+ -confirm ask for confirmation before any page is changed on the
+ live wiki. Without this argument, additions and
+ unambiguous modifications are made without confirmation.
+ (note: without ending colon)
+
+ -force do not ask permission to make "controversial" changes,
+ like removing a language because none of the found
+ alternatives actually exists.
+ (note: without ending colon)
+
+ -cleanup like -force but only removes interwiki links to non-existent
+ or empty pages.
+
+ -select ask for each link whether it should be included before
+ changing any page. This is useful if you want to remove
+ invalid interwiki links and if you do multiple hints of
+ which some might be correct and others incorrect. Combining
+ -select and -confirm is possible, but seems like overkill.
+ (note: without ending colon)
+
+These arguments specify in which way the bot should follow interwiki links:
+
+ -noredirect do not follow redirects nor category redirects.
+ (note: without ending colon)
+
+ -initialredirect work on its target if a redirect or category redirect is
+ entered on the command line or by a generator (note: without
+ ending colon). It is recommended to use this option with the
+ -movelog pagegenerator.
+
+ -neverlink: used as -neverlink:xx where xx is a language code:
+ Disregard any links found to language xx. You can also
+ specify a list of languages to disregard, separated by
+ commas.
+
+ -ignore: used as -ignore:xx:aaa where xx is a language code, and
+ aaa is a page title to be ignored.
+
+ -ignorefile: similar to -ignore, except that the pages are taken from
+ the given file instead of the command line.
+
+ -localright do not follow interwiki links from other pages than the
+ starting page. (Warning! Should be used very sparingly,
+ only when you are sure you have first gotten the interwiki
+ links on the starting page exactly right).
+ (note: without ending colon)
+
+ -hintsareright do not follow interwiki links to sites for which hints
+ on existing pages are given. Note that, hints given
+ interactively, via the -askhint command line option,
+ are only effective once they have been entered, thus
+ interwiki links on the starting page are followed
+ regardess of hints given when prompted.
+ (Warning! Should be used with caution!)
+ (note: without ending colon)
+
+ -back only work on pages that have no backlink from any other
+ language; if a backlink is found, all work on the page
+ will be halted. (note: without ending colon)
+
+The following arguments are only important for users who have accounts for
+multiple languages, and specify on which sites the bot should modify pages:
+
+ -localonly only work on the local wiki, not on other wikis in the
+ family I have a login at. (note: without ending colon)
+
+ -limittwo only update two pages - one in the local wiki (if logged-in)
+ and one in the top available one.
+ For example, if the local page has links to de and fr,
+ this option will make sure that only the local site and
+ the de: (larger) sites are updated. This option is useful
+ to quickly set two way links without updating all of the
+ wiki families sites.
+ (note: without ending colon)
+
+ -whenneeded works like limittwo, but other languages are changed in the
+ following cases:
+ * If there are no interwiki links at all on the page
+ * If an interwiki link must be removed
+ * If an interwiki link must be changed and there has been
+ a conflict for this page
+ Optionally, -whenneeded can be given an additional number
+ (for example -whenneeded:3), in which case other languages
+ will be changed if there are that number or more links to
+ change or add. (note: without ending colon)
+
+The following arguments influence how many pages the bot works on at once:
+
+ -array: The number of pages the bot tries to be working on at once.
+ If the number of pages loaded is lower than this number,
+ a new set of pages is loaded from the starting wiki. The
+ default is 100, but can be changed in the config variable
+ interwiki_min_subjects
+
+ -query: The maximum number of pages that the bot will load at once.
+ Default value is 60.
+
+Some configuration option can be used to change the working of this robot:
+
+interwiki_min_subjects: the minimum amount of subjects that should be processed
+ at the same time.
+
+interwiki_backlink: if set to True, all problems in foreign wikis will
+ be reported
+
+interwiki_shownew: should interwiki.py display every new link it discovers?
+
+interwiki_graph: output a graph PNG file on conflicts? You need pydot for
+ this: http://dkbza.org/pydot.html
+
+interwiki_graph_format: the file format for interwiki graphs
+
+without_interwiki: save file with local articles without interwikis
+
+All these options can be changed through the user-config.py configuration file.
+
+If interwiki.py is terminated before it is finished, it will write a dump file
+to the interwiki-dumps subdirectory. The program will read it if invoked with
+the "-restore" or "-continue" option, and finish all the subjects in that list.
+After finishing the dump file will be deleted. To run the interwiki-bot on all
+pages on a language, run it with option "-start:!", and if it takes so long
+that you have to break it off, use "-continue" next time.
+
+"""
+#
+# (C) Rob W.W. Hooft, 2003
+# (C) Daniel Herding, 2004
+# (C) Yuri Astrakhan, 2005-2006
+# (C) xqt, 2009-2012
+# (C) Pywikipedia bot team, 2007-2012
+#
+# Distributed under the terms of the MIT license.
+#
+__version__ = '$Id$'
+#
+
+import sys, copy, re, os
+import time
+import codecs
+import socket
+
+try:
+ set # introduced in Python 2.4: faster and future
+except NameError:
+ from sets import Set as set
+
+try: sorted ## Introduced in 2.4
+except NameError:
+ def sorted(seq, cmp=None, key=None, reverse=False):
+ """Copy seq and sort and return it.
+ >>> sorted([3, 1, 2])
+ [1, 2, 3]
+ """
+ seq2 = copy.copy(seq)
+ if key:
+ if cmp is None:
+ cmp = __builtins__.cmp
+ seq2.sort(lambda x,y: cmp(key(x), key(y)))
+ else:
+ if cmp is None:
+ seq2.sort()
+ else:
+ seq2.sort(cmp)
+ if reverse:
+ seq2.reverse()
+ return seq2
+
+import wikipedia as pywikibot
+import config
+import catlib
+import pagegenerators
+from pywikibot import i18n
+import titletranslate, interwiki_graph
+import webbrowser
+
+docuReplacements = {
+ '&pagegenerators_help;': pagegenerators.parameterHelp
+}
+
+class SaveError(pywikibot.Error):
+ """
+ An attempt to save a page with changed interwiki has failed.
+ """
+
+class LinkMustBeRemoved(SaveError):
+ """
+ An interwiki link has to be removed, but this can't be done because of user
+ preferences or because the user chose not to change the page.
+ """
+
+class GiveUpOnPage(pywikibot.Error):
+ """
+ The user chose not to work on this page and its linked pages any more.
+ """
+
+# Subpage templates. Must be in lower case,
+# whereas subpage itself must be case sensitive
+moved_links = {
+ 'bn' : (u'documentation', u'/doc'),
+ 'ca' : (u'ús de la plantilla', u'/ús'),
+ 'cs' : (u'dokumentace', u'/doc'),
+ 'da' : (u'dokumentation', u'/doc'),
+ 'de' : (u'dokumentation', u'/Meta'),
+ 'dsb': ([u'dokumentacija', u'doc'], u'/Dokumentacija'),
+ 'en' : ([u'documentation',
+ u'template documentation',
+ u'template doc',
+ u'doc',
+ u'documentation, template'], u'/doc'),
+ 'es' : ([u'documentación', u'documentación de plantilla'], u'/doc'),
+ 'eu' : (u'txantiloi dokumentazioa', u'/dok'),
+ 'fa' : ([u'documentation',
+ u'template documentation',
+ u'template doc',
+ u'doc',
+ u'توضیحات',
+ u'زیرصفحه توضیحات'], u'/doc'),
+ # fi: no idea how to handle this type of subpage at :Metasivu:
+ 'fi' : (u'mallineohje', None),
+ 'fr' : ([u'/documentation', u'documentation', u'doc_modèle',
+ u'documentation modèle', u'documentation modèle compliqué',
+ u'documentation modèle en sous-page',
+ u'documentation modèle compliqué en sous-page',
+ u'documentation modèle utilisant les parserfunctions en sous-page',
+ ],
+ u'/Documentation'),
+ 'hsb': ([u'dokumentacija', u'doc'], u'/Dokumentacija'),
+ 'hu' : (u'sablondokumentáció', u'/doc'),
+ 'id' : (u'template doc', u'/doc'),
+ 'ja' : (u'documentation', u'/doc'),
+ 'ka' : (u'თარგის ინფო', u'/ინფო'),
+ 'ko' : (u'documentation', u'/설명문서'),
+ 'ms' : (u'documentation', u'/doc'),
+ 'no' : (u'dokumentasjon', u'/dok'),
+ 'nn' : (u'dokumentasjon', u'/dok'),
+ 'pl' : (u'dokumentacja', u'/opis'),
+ 'pt' : ([u'documentação', u'/doc'], u'/doc'),
+ 'ro' : (u'documentaţie', u'/doc'),
+ 'ru' : (u'doc', u'/doc'),
+ 'sv' : (u'dokumentation', u'/dok'),
+ 'uk' : ([u'документація',
+ u'doc',
+ u'documentation'], u'/Документація'),
+ 'vi' : (u'documentation', u'/doc'),
+ 'zh' : ([u'documentation', u'doc'], u'/doc'),
+}
+
+# A list of template names in different languages.
+# Pages which contain these shouldn't be changed.
+ignoreTemplates = {
+ '_default': [u'delete'],
+ 'ar' : [u'قيد الاستخدام'],
+ 'cs' : [u'Pracuje_se'],
+ 'de' : [u'inuse', 'in use', u'in bearbeitung', u'inbearbeitung',
+ u'löschen', u'sla',
+ u'löschantrag', u'löschantragstext',
+ u'falschschreibung',
+ u'obsolete schreibung', 'veraltete schreibweise'],
+ 'en' : [u'inuse', u'softredirect'],
+ 'fa' : [u'در دست ویرایش ۲', u'حذف سریع'],
+ 'pdc': [u'lösche'],
+}
+
+class Global(object):
+ """
+ Container class for global settings.
+ Use of globals outside of this is to be avoided.
+ """
+ autonomous = False
+ confirm = False
+ always = False
+ select = False
+ followredirect = True
+ initialredirect = False
+ force = False
+ cleanup = False
+ remove = []
+ maxquerysize = 60
+ same = False
+ skip = set()
+ skipauto = False
+ untranslated = False
+ untranslatedonly = False
+ auto = True
+ neverlink = []
+ showtextlink = 0
+ showtextlinkadd = 300
+ localonly = False
+ limittwo = False
+ strictlimittwo = False
+ needlimit = 0
+ ignore = []
+ parenthesesonly = False
+ rememberno = False
+ followinterwiki = True
+ minsubjects = config.interwiki_min_subjects
+ nobackonly = False
+ askhints = False
+ hintnobracket = False
+ hints = []
+ hintsareright = False
+ contentsondisk = config.interwiki_contents_on_disk
+ lacklanguage = None
+ minlinks = 0
+ quiet = False
+ restoreAll = False
+ async = False
+ summary = u''
+
+ def readOptions(self, arg):
+ """ Read all commandline parameters for the global container """
+ if arg == '-noauto':
+ self.auto = False
+ elif arg.startswith('-hint:'):
+ self.hints.append(arg[6:])
+ elif arg.startswith('-hintfile'):
+ hintfilename = arg[10:]
+ if (hintfilename is None) or (hintfilename == ''):
+ hintfilename = pywikibot.input(u'Please enter the hint filename:')
+ f = codecs.open(hintfilename, 'r', config.textfile_encoding)
+ R = re.compile(ur'\[\[(.+?)(?:\]\]|\|)') # hint or title ends either before | or before ]]
+ for pageTitle in R.findall(f.read()):
+ self.hints.append(pageTitle)
+ f.close()
+ elif arg == '-force':
+ self.force = True
+ elif arg == '-cleanup':
+ self.cleanup = True
+ elif arg == '-same':
+ self.same = True
+ elif arg == '-wiktionary':
+ self.same = 'wiktionary'
+ elif arg == '-untranslated':
+ self.untranslated = True
+ elif arg == '-untranslatedonly':
+ self.untranslated = True
+ self.untranslatedonly = True
+ elif arg == '-askhints':
+ self.untranslated = True
+ self.untranslatedonly = False
+ self.askhints = True
+ elif arg == '-hintnobracket':
+ self.hintnobracket = True
+ elif arg == '-confirm':
+ self.confirm = True
+ elif arg == '-select':
+ self.select = True
+ elif arg == '-autonomous' or arg == '-auto':
+ self.autonomous = True
+ elif arg == '-noredirect':
+ self.followredirect = False
+ elif arg == '-initialredirect':
+ self.initialredirect = True
+ elif arg == '-localonly':
+ self.localonly = True
+ elif arg == '-limittwo':
+ self.limittwo = True
+ self.strictlimittwo = True
+ elif arg.startswith('-whenneeded'):
+ self.limittwo = True
+ self.strictlimittwo = False
+ try:
+ self.needlimit = int(arg[12:])
+ except KeyError:
+ pass
+ except ValueError:
+ pass
+ elif arg.startswith('-skipfile:'):
+ skipfile = arg[10:]
+ skipPageGen = pagegenerators.TextfilePageGenerator(skipfile)
+ for page in skipPageGen:
+ self.skip.add(page)
+ del skipPageGen
+ elif arg == '-skipauto':
+ self.skipauto = True
+ elif arg.startswith('-neverlink:'):
+ self.neverlink += arg[11:].split(",")
+ elif arg.startswith('-ignore:'):
+ self.ignore += [pywikibot.Page(None,p) for p in arg[8:].split(",")]
+ elif arg.startswith('-ignorefile:'):
+ ignorefile = arg[12:]
+ ignorePageGen = pagegenerators.TextfilePageGenerator(ignorefile)
+ for page in ignorePageGen:
+ self.ignore.append(page)
+ del ignorePageGen
+ elif arg == '-showpage':
+ self.showtextlink += self.showtextlinkadd
+ elif arg == '-graph':
+ # override configuration
+ config.interwiki_graph = True
+ elif arg == '-bracket':
+ self.parenthesesonly = True
+ elif arg == '-localright':
+ self.followinterwiki = False
+ elif arg == '-hintsareright':
+ self.hintsareright = True
+ elif arg.startswith('-array:'):
+ self.minsubjects = int(arg[7:])
+ elif arg.startswith('-query:'):
+ self.maxquerysize = int(arg[7:])
+ elif arg == '-back':
+ self.nobackonly = True
+ elif arg == '-quiet':
+ self.quiet = True
+ elif arg == '-async':
+ self.async = True
+ elif arg.startswith('-summary'):
+ if len(arg) == 8:
+ self.summary = pywikibot.input(u'What summary do you want to use?')
+ else:
+ self.summary = arg[9:]
+ elif arg.startswith('-lack:'):
+ remainder = arg[6:].split(':')
+ self.lacklanguage = remainder[0]
+ if len(remainder) > 1:
+ self.minlinks = int(remainder[1])
+ else:
+ self.minlinks = 1
+ else:
+ return False
+ return True
+
+class StoredPage(pywikibot.Page):
+ """
+ Store the Page contents on disk to avoid sucking too much
+ memory when a big number of Page objects will be loaded
+ at the same time.
+ """
+
+ # Please prefix the class members names by SP
+ # to avoid possible name clashes with pywikibot.Page
+
+ # path to the shelve
+ SPpath = None
+ # shelve
+ SPstore = None
+
+ # attributes created by pywikibot.Page.__init__
+ SPcopy = [ '_editrestriction',
+ '_site',
+ '_namespace',
+ '_section',
+ '_title',
+ 'editRestriction',
+ 'moveRestriction',
+ '_permalink',
+ '_userName',
+ '_ipedit',
+ '_editTime',
+ '_startTime',
+ '_revisionId',
+ '_deletedRevs' ]
+
+ def SPdeleteStore():
+ if StoredPage.SPpath:
+ del StoredPage.SPstore
+ os.unlink(StoredPage.SPpath)
+ SPdeleteStore = staticmethod(SPdeleteStore)
+
+ def __init__(self, page):
+ for attr in StoredPage.SPcopy:
+ setattr(self, attr, getattr(page, attr))
+
+ if not StoredPage.SPpath:
+ import shelve
+ index = 1
+ while True:
+ path = config.datafilepath('cache', 'pagestore' + str(index))
+ if not os.path.exists(path): break
+ index += 1
+ StoredPage.SPpath = path
+ StoredPage.SPstore = shelve.open(path)
+
+ self.SPkey = str(self)
+ self.SPcontentSet = False
+
+ def SPgetContents(self):
+ return StoredPage.SPstore[self.SPkey]
+
+ def SPsetContents(self, contents):
+ self.SPcontentSet = True
+ StoredPage.SPstore[self.SPkey] = contents
+
+ def SPdelContents(self):
+ if self.SPcontentSet:
+ del StoredPage.SPstore[self.SPkey]
+
+ _contents = property(SPgetContents, SPsetContents, SPdelContents)
+
+class PageTree(object):
+ """
+ Structure to manipulate a set of pages.
+ Allows filtering efficiently by Site.
+ """
+ def __init__(self):
+ # self.tree :
+ # Dictionary:
+ # keys: Site
+ # values: list of pages
+ # All pages found within Site are kept in
+ # self.tree[site]
+
+ # While using dict values would be faster for
+ # the remove() operation,
+ # keeping list values is important, because
+ # the order in which the pages were found matters:
+ # the earlier a page is found, the closer it is to the
+ # Subject.originPage. Chances are that pages found within
+ # 2 interwiki distance from the originPage are more related
+ # to the original topic than pages found later on, after
+ # 3, 4, 5 or more interwiki hops.
+
+ # Keeping this order is hence important to display an ordered
+ # list of pages to the user when he'll be asked to resolve
+ # conflicts.
+ self.tree = {}
+ self.size = 0
+
+ def filter(self, site):
+ """
+ Iterates over pages that are in Site site
+ """
+ try:
+ for page in self.tree[site]:
+ yield page
+ except KeyError:
+ pass
+
+ def __len__(self):
+ return self.size
+
+ def add(self, page):
+ site = page.site
+ if not site in self.tree:
+ self.tree[site] = []
+ self.tree[site].append(page)
+ self.size += 1
+
+ def remove(self, page):
+ try:
+ self.tree[page.site].remove(page)
+ self.size -= 1
+ except ValueError:
+ pass
+
+ def removeSite(self, site):
+ """
+ Removes all pages from Site site
+ """
+ try:
+ self.size -= len(self.tree[site])
+ del self.tree[site]
+ except KeyError:
+ pass
+
+ def siteCounts(self):
+ """
+ Yields (Site, number of pages in site) pairs
+ """
+ for site, d in self.tree.iteritems():
+ yield site, len(d)
+
+ def __iter__(self):
+ for site, plist in self.tree.iteritems():
+ for page in plist:
+ yield page
+
+class Subject(object):
+ """
+ Class to follow the progress of a single 'subject' (i.e. a page with
+ all its translations)
+
+
+ Subject is a transitive closure of the binary relation on Page:
+ "has_a_langlink_pointing_to".
+
+ A formal way to compute that closure would be:
+
+ With P a set of pages, NL ('NextLevel') a function on sets defined as:
+ NL(P) = { target | ∃ source ∈ P, target ∈ source.langlinks() }
+ pseudocode:
+ todo <- [originPage]
+ done <- []
+ while todo != []:
+ pending <- todo
+ todo <-NL(pending) / done
+ done <- NL(pending) U done
+ return done
+
+
+ There is, however, one limitation that is induced by implementation:
+ to compute efficiently NL(P), one has to load the page contents of
+ pages in P.
+ (Not only the langlinks have to be parsed from each Page, but we also want
+ to know if the Page is a redirect, a disambiguation, etc...)
+
+ Because of this, the pages in pending have to be preloaded.
+ However, because the pages in pending are likely to be in several sites
+ we cannot "just" preload them as a batch.
+
+ Instead of doing "pending <- todo" at each iteration, we have to elect a
+ Site, and we put in pending all the pages from todo that belong to that
+ Site:
+
+ Code becomes:
+ todo <- {originPage.site:[originPage]}
+ done <- []
+ while todo != {}:
+ site <- electSite()
+ pending <- todo[site]
+
+ preloadpages(site, pending)
+
+ todo[site] <- NL(pending) / done
+ done <- NL(pending) U done
+ return done
+
+
+ Subject objects only operate on pages that should have been preloaded before.
+ In fact, at any time:
+ * todo contains new Pages that have not been loaded yet
+ * done contains Pages that have been loaded, and that have been treated.
+ * If batch preloadings are successful, Page._get() is never called from
+ this Object.
+ """
+
+ def __init__(self, originPage=None, hints=None):
+ """Constructor. Takes as arguments the Page on the home wiki
+ plus optionally a list of hints for translation"""
+
+ if globalvar.contentsondisk:
+ if originPage:
+ originPage = StoredPage(originPage)
+
+ # Remember the "origin page"
+ self.originPage = originPage
+ # todo is a list of all pages that still need to be analyzed.
+ # Mark the origin page as todo.
+ self.todo = PageTree()
+ if originPage:
+ self.todo.add(originPage)
+
+ # done is a list of all pages that have been analyzed and that
+ # are known to belong to this subject.
+ self.done = PageTree()
+ # foundIn is a dictionary where pages are keys and lists of
+ # pages are values. It stores where we found each page.
+ # As we haven't yet found a page that links to the origin page, we
+ # start with an empty list for it.
+ if originPage:
+ self.foundIn = {self.originPage:[]}
+ else:
+ self.foundIn = {}
+ # This is a list of all pages that are currently scheduled for
+ # download.
+ self.pending = PageTree()
+ if globalvar.hintsareright:
+ # This is a set of sites that we got hints to
+ self.hintedsites = set()
+ self.translate(hints, globalvar.hintsareright)
+ self.confirm = globalvar.confirm
+ self.problemfound = False
+ self.untranslated = None
+ self.hintsAsked = False
+ self.forcedStop = False
+ self.workonme = True
+
+ def getFoundDisambig(self, site):
+ """
+ If we found a disambiguation on the given site while working on the
+ subject, this method returns it. If several ones have been found, the
+ first one will be returned.
+ Otherwise, None will be returned.
+ """
+ for tree in [self.done, self.pending]:
+ for page in tree.filter(site):
+ if page.exists() and page.isDisambig():
+ return page
+ return None
+
+ def getFoundNonDisambig(self, site):
+ """
+ If we found a non-disambiguation on the given site while working on the
+ subject, this method returns it. If several ones have been found, the
+ first one will be returned.
+ Otherwise, None will be returned.
+ """
+ for tree in [self.done, self.pending]:
+ for page in tree.filter(site):
+ if page.exists() and not page.isDisambig() \
+ and not page.isRedirectPage() and not page.isCategoryRedirect():
+ return page
+ return None
+
+ def getFoundInCorrectNamespace(self, site):
+ """
+ If we found a page that has the expected namespace on the given site
+ while working on the subject, this method returns it. If several ones
+ have been found, the first one will be returned.
+ Otherwise, None will be returned.
+ """
+ for tree in [self.done, self.pending, self.todo]:
+ for page in tree.filter(site):
+ # -hintsonly: before we have an origin page, any namespace will do.
+ if self.originPage and page.namespace() == self.originPage.namespace():
+ if page.exists() and not page.isRedirectPage() and not page.isCategoryRedirect():
+ return page
+ return None
+
+ def translate(self, hints = None, keephintedsites = False):
+ """Add the given translation hints to the todo list"""
+ if globalvar.same and self.originPage:
+ if hints:
+ pages = titletranslate.translate(self.originPage, hints = hints + ['all:'],
+ auto = globalvar.auto, removebrackets = globalvar.hintnobracket)
+ else:
+ pages = titletranslate.translate(self.originPage, hints = ['all:'],
+ auto = globalvar.auto, removebrackets = globalvar.hintnobracket)
+ else:
+ pages = titletranslate.translate(self.originPage, hints=hints,
+ auto=globalvar.auto, removebrackets=globalvar.hintnobracket,
+ site=pywikibot.getSite())
+ for page in pages:
+ if globalvar.contentsondisk:
+ page = StoredPage(page)
+ self.todo.add(page)
+ self.foundIn[page] = [None]
+ if keephintedsites:
+ self.hintedsites.add(page.site)
+
+ def openSites(self):
+ """
+ Iterator. Yields (site, count) pairs:
+ * site is a site where we still have work to do on
+ * count is the number of items in that Site that need work on
+ """
+ return self.todo.siteCounts()
+
+ def whatsNextPageBatch(self, site):
+ """
+ By calling this method, you 'promise' this instance that you will
+ preload all the 'site' Pages that are in the todo list.
+
+ This routine will return a list of pages that can be treated.
+ """
+ # Bug-check: Isn't there any work still in progress? We can't work on
+ # different sites at a time!
+ if len(self.pending) > 0:
+ raise 'BUG: Can\'t start to work on %s; still working on %s' % (site, self.pending)
+ # Prepare a list of suitable pages
+ result = []
+ for page in self.todo.filter(site):
+ self.pending.add(page)
+ result.append(page)
+
+ self.todo.removeSite(site)
+
+ # If there are any, return them. Otherwise, nothing is in progress.
+ return result
+
+ def makeForcedStop(self,counter):
+ """
+ Ends work on the page before the normal end.
+ """
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ self.forcedStop = True
+
+ def addIfNew(self, page, counter, linkingPage):
+ """
+ Adds the pagelink given to the todo list, but only if we didn't know
+ it before. If it is added, update the counter accordingly.
+
+ Also remembers where we found the page, regardless of whether it had
+ already been found before or not.
+
+ Returns True if the page is new.
+ """
+ if self.forcedStop:
+ return False
+ # cannot check backlink before we have an origin page
+ if globalvar.nobackonly and self.originPage:
+ if page == self.originPage:
+ try:
+ pywikibot.output(u"%s has a backlink from %s."
+ % (page, linkingPage))
+ except UnicodeDecodeError:
+ pywikibot.output(u"Found a backlink for a page.")
+ self.makeForcedStop(counter)
+ return False
+
+ if page in self.foundIn:
+ # not new
+ self.foundIn[page].append(linkingPage)
+ return False
+ else:
+ if globalvar.contentsondisk:
+ page = StoredPage(page)
+ self.foundIn[page] = [linkingPage]
+ self.todo.add(page)
+ counter.plus(page.site)
+ return True
+
+ def skipPage(self, page, target, counter):
+ return self.isIgnored(target) or \
+ self.namespaceMismatch(page, target, counter) or \
+ self.wiktionaryMismatch(target)
+
+ def namespaceMismatch(self, linkingPage, linkedPage, counter):
+ """
+ Checks whether or not the given page has another namespace
+ than the origin page.
+
+ Returns True if the namespaces are different and the user
+ has selected not to follow the linked page.
+ """
+ if linkedPage in self.foundIn:
+ # We have seen this page before, don't ask again.
+ return False
+ elif self.originPage and self.originPage.namespace() != linkedPage.namespace():
+ # Allow for a mapping between different namespaces
+ crossFrom = self.originPage.site.family.crossnamespace.get(self.originPage.namespace(), {})
+ crossTo = crossFrom.get(self.originPage.site.language(), crossFrom.get('_default', {}))
+ nsmatch = crossTo.get(linkedPage.site.language(), crossTo.get('_default', []))
+ if linkedPage.namespace() in nsmatch:
+ return False
+ if globalvar.autonomous:
+ pywikibot.output(u"NOTE: Ignoring link from page %s in namespace %i to page %s in namespace %i."
+ % (linkingPage, linkingPage.namespace(),
+ linkedPage, linkedPage.namespace()))
+ # Fill up foundIn, so that we will not write this notice
+ self.foundIn[linkedPage] = [linkingPage]
+ return True
+ else:
+ preferredPage = self.getFoundInCorrectNamespace(linkedPage.site)
+ if preferredPage:
+ pywikibot.output(u"NOTE: Ignoring link from page %s in namespace %i to page %s in namespace %i because page %s in the correct namespace has already been found."
+ % (linkingPage, linkingPage.namespace(), linkedPage,
+ linkedPage.namespace(), preferredPage))
+ return True
+ else:
+ choice = pywikibot.inputChoice(
+u'WARNING: %s is in namespace %i, but %s is in namespace %i. Follow it anyway?'
+ % (self.originPage, self.originPage.namespace(),
+ linkedPage, linkedPage.namespace()),
+ ['Yes', 'No', 'Add an alternative', 'give up'],
+ ['y', 'n', 'a', 'g'])
+ if choice != 'y':
+ # Fill up foundIn, so that we will not ask again
+ self.foundIn[linkedPage] = [linkingPage]
+ if choice == 'g':
+ self.makeForcedStop(counter)
+ elif choice == 'a':
+ newHint = pywikibot.input(u'Give the alternative for language %s, not using a language code:'
+ % linkedPage.site.language())
+ if newHint:
+ alternativePage = pywikibot.Page(linkedPage.site, newHint)
+ if alternativePage:
+ # add the page that was entered by the user
+ self.addIfNew(alternativePage, counter, None)
+ else:
+ pywikibot.output(
+ u"NOTE: ignoring %s and its interwiki links"
+ % linkedPage)
+ return True
+ else:
+ # same namespaces, no problem
+ # or no origin page yet, also no problem
+ return False
+
+ def wiktionaryMismatch(self, page):
+ if self.originPage and globalvar.same=='wiktionary':
+ if page.title().lower() != self.originPage.title().lower():
+ pywikibot.output(u"NOTE: Ignoring %s for %s in wiktionary mode" % (page, self.originPage))
+ return True
+ elif page.title() != self.originPage.title() and self.originPage.site.nocapitalize and page.site.nocapitalize:
+ pywikibot.output(u"NOTE: Ignoring %s for %s in wiktionary mode because both languages are uncapitalized."
+ % (page, self.originPage))
+ return True
+ return False
+
+ def disambigMismatch(self, page, counter):
+ """
+ Checks whether or not the given page has the another disambiguation
+ status than the origin page.
+
+ Returns a tuple (skip, alternativePage).
+
+ skip is True if the pages have mismatching statuses and the bot
+ is either in autonomous mode, or the user chose not to use the
+ given page.
+
+ alternativePage is either None, or a page that the user has
+ chosen to use instead of the given page.
+ """
+ if not self.originPage:
+ return (False, None) # any page matches until we have an origin page
+ if globalvar.autonomous:
+ if self.originPage.isDisambig() and not page.isDisambig():
+ pywikibot.output(u"NOTE: Ignoring link from disambiguation page %s to non-disambiguation %s"
+ % (self.originPage, page))
+ return (True, None)
+ elif not self.originPage.isDisambig() and page.isDisambig():
+ pywikibot.output(u"NOTE: Ignoring link from non-disambiguation page %s to disambiguation %s"
+ % (self.originPage, page))
+ return (True, None)
+ else:
+ choice = 'y'
+ if self.originPage.isDisambig() and not page.isDisambig():
+ disambig = self.getFoundDisambig(page.site)
+ if disambig:
+ pywikibot.output(
+ u"NOTE: Ignoring non-disambiguation page %s for %s because disambiguation page %s has already been found."
+ % (page, self.originPage, disambig))
+ return (True, None)
+ else:
+ choice = pywikibot.inputChoice(
+ u'WARNING: %s is a disambiguation page, but %s doesn\'t seem to be one. Follow it anyway?'
+ % (self.originPage, page),
+ ['Yes', 'No', 'Add an alternative', 'Give up'],
+ ['y', 'n', 'a', 'g'])
+ elif not self.originPage.isDisambig() and page.isDisambig():
+ nondisambig = self.getFoundNonDisambig(page.site)
+ if nondisambig:
+ pywikibot.output(u"NOTE: Ignoring disambiguation page %s for %s because non-disambiguation page %s has already been found."
+ % (page, self.originPage, nondisambig))
+ return (True, None)
+ else:
+ choice = pywikibot.inputChoice(
+ u'WARNING: %s doesn\'t seem to be a disambiguation page, but %s is one. Follow it anyway?'
+ % (self.originPage, page),
+ ['Yes', 'No', 'Add an alternative', 'Give up'],
+ ['y', 'n', 'a', 'g'])
+ if choice == 'n':
+ return (True, None)
+ elif choice == 'a':
+ newHint = pywikibot.input(u'Give the alternative for language %s, not using a language code:'
+ % page.site.language())
+ alternativePage = pywikibot.Page(page.site, newHint)
+ return (True, alternativePage)
+ elif choice == 'g':
+ self.makeForcedStop(counter)
+ return (True, None)
+ # We can follow the page.
+ return (False, None)
+
+ def isIgnored(self, page):
+ if page.site.language() in globalvar.neverlink:
+ pywikibot.output(u"Skipping link %s to an ignored language" % page)
+ return True
+ if page in globalvar.ignore:
+ pywikibot.output(u"Skipping link %s to an ignored page" % page)
+ return True
+ return False
+
+ def reportInterwikilessPage(self, page):
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s does not have any interwiki links"
+ % self.originPage)
+ if config.without_interwiki:
+ f = codecs.open(
+ pywikibot.config.datafilepath('without_interwiki.txt'),
+ 'a', 'utf-8')
+ f.write(u"# %s \n" % page)
+ f.close()
+
+ def askForHints(self, counter):
+ if not self.workonme:
+ # Do not ask hints for pages that we don't work on anyway
+ return
+ if (self.untranslated or globalvar.askhints) and not self.hintsAsked \
+ and self.originPage and self.originPage.exists() \
+ and not self.originPage.isRedirectPage() and not self.originPage.isCategoryRedirect():
+ # Only once!
+ self.hintsAsked = True
+ if globalvar.untranslated:
+ newhint = None
+ t = globalvar.showtextlink
+ if t:
+ pywikibot.output(self.originPage.get()[:t])
+ # loop
+ while True:
+ newhint = pywikibot.input(u'Give a hint (? to see pagetext):')
+ if newhint == '?':
+ t += globalvar.showtextlinkadd
+ pywikibot.output(self.originPage.get()[:t])
+ elif newhint and not ':' in newhint:
+ pywikibot.output(u'Please enter a hint in the format language:pagename or type nothing if you do not have a hint.')
+ elif not newhint:
+ break
+ else:
+ pages = titletranslate.translate(self.originPage, hints=[newhint],
+ auto = globalvar.auto, removebrackets=globalvar.hintnobracket)
+ for page in pages:
+ self.addIfNew(page, counter, None)
+ if globalvar.hintsareright:
+ self.hintedsites.add(page.site)
+
+ def batchLoaded(self, counter):
+ """
+ This is called by a worker to tell us that the promised batch of
+ pages was loaded.
+ In other words, all the pages in self.pending have already
+ been preloaded.
+
+ The only argument is an instance
+ of a counter class, that has methods minus() and plus() to keep
+ counts of the total work todo.
+ """
+ # Loop over all the pages that should have been taken care of
+ for page in self.pending:
+ # Mark the page as done
+ self.done.add(page)
+
+ # make sure that none of the linked items is an auto item
+ if globalvar.skipauto:
+ dictName, year = page.autoFormat()
+ if dictName is not None:
+ if self.originPage:
+ pywikibot.output(u'WARNING: %s:%s relates to %s:%s, which is an auto entry %s(%s)'
+ % (self.originPage.site.language(), self.originPage,
+ page.site.language(), page, dictName, year))
+
+ # Abort processing if the bot is running in autonomous mode.
+ if globalvar.autonomous:
+ self.makeForcedStop(counter)
+
+ # Register this fact at the todo-counter.
+ counter.minus(page.site)
+
+ # Now check whether any interwiki links should be added to the
+ # todo list.
+
+ if not page.exists():
+ globalvar.remove.append(unicode(page))
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s does not exist. Skipping."
+ % page)
+ if page == self.originPage:
+ # The page we are working on is the page that does not exist.
+ # No use in doing any work on it in that case.
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ # In some rare cases it might be we already did check some 'automatic' links
+ self.done = PageTree()
+ continue
+
+ elif page.isRedirectPage() or page.isCategoryRedirect():
+ if page.isRedirectPage():
+ redir = u''
+ else:
+ redir = u'category '
+ try:
+ if page.isRedirectPage():
+ redirectTargetPage = page.getRedirectTarget()
+ else:
+ redirectTargetPage = page.getCategoryRedirectTarget()
+ except pywikibot.InvalidTitle:
+ # MW considers #redirect [[en:#foo]] as a redirect page,
+ # but we can't do anything useful with such pages
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(
+ u"NOTE: %s redirects to an invalid title" % page)
+ continue
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is %sredirect to %s"
+ % (page, redir, redirectTargetPage))
+ if self.originPage is None or page == self.originPage:
+ # the 1st existig page becomes the origin page, if none was supplied
+ if globalvar.initialredirect:
+ if globalvar.contentsondisk:
+ redirectTargetPage = StoredPage(redirectTargetPage)
+ # don't follow another redirect; it might be a self loop
+ if not redirectTargetPage.isRedirectPage() \
+ and not redirectTargetPage.isCategoryRedirect():
+ self.originPage = redirectTargetPage
+ self.todo.add(redirectTargetPage)
+ counter.plus(redirectTargetPage.site)
+ else:
+ # This is a redirect page to the origin. We don't need to
+ # follow the redirection.
+ # In this case we can also stop all hints!
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ elif not globalvar.followredirect:
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: not following %sredirects."
+ % redir)
+ elif page.isStaticRedirect():
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(
+ u"NOTE: not following static %sredirects." % redir)
+ elif page.site.family == redirectTargetPage.site.family \
+ and not self.skipPage(page, redirectTargetPage, counter):
+ if self.addIfNew(redirectTargetPage, counter, page):
+ if config.interwiki_shownew or pywikibot.verbose:
+ pywikibot.output(u"%s: %s gives new %sredirect %s"
+ % (self.originPage, page, redir,
+ redirectTargetPage))
+ continue
+
+ # must be behind the page.isRedirectPage() part
+ # otherwise a redirect error would be raised
+ elif page.isEmpty() and not page.isCategory():
+ globalvar.remove.append(unicode(page))
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is empty. Skipping." % page)
+ if page == self.originPage:
+ for site, count in self.todo.siteCounts():
+ counter.minus(site, count)
+ self.todo = PageTree()
+ self.done = PageTree()
+ self.originPage = None
+ continue
+
+ elif page.section():
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is a page section. Skipping."
+ % page)
+ continue
+
+ # Page exists, isnt a redirect, and is a plain link (no section)
+ if self.originPage is None:
+ # the 1st existig page becomes the origin page, if none was supplied
+ self.originPage = page
+ try:
+ iw = page.interwiki()
+ except pywikibot.NoSuchSite:
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: site %s does not exist" % page.site())
+ continue
+
+ (skip, alternativePage) = self.disambigMismatch(page, counter)
+ if skip:
+ pywikibot.output(u"NOTE: ignoring %s and its interwiki links"
+ % page)
+ self.done.remove(page)
+ iw = ()
+ if alternativePage:
+ # add the page that was entered by the user
+ self.addIfNew(alternativePage, counter, None)
+
+ duplicate = None
+ for p in self.done.filter(page.site):
+ if p != page and p.exists() and not p.isRedirectPage() and not p.isCategoryRedirect():
+ duplicate = p
+ break
+
+ if self.originPage == page:
+ self.untranslated = (len(iw) == 0)
+ if globalvar.untranslatedonly:
+ # Ignore the interwiki links.
+ iw = ()
+ if globalvar.lacklanguage:
+ if globalvar.lacklanguage in [link.site.language() for link in iw]:
+ iw = ()
+ self.workonme = False
+ if len(iw) < globalvar.minlinks:
+ iw = ()
+ self.workonme = False
+
+ elif globalvar.autonomous and duplicate and not skip:
+ pywikibot.output(u"Stopping work on %s because duplicate pages"\
+ " %s and %s are found" % (self.originPage, duplicate, page))
+ self.makeForcedStop(counter)
+ try:
+ f = codecs.open(
+ pywikibot.config.datafilepath('autonomous_problems.dat'),
+ 'a', 'utf-8')
+ f.write(u"* %s {Found more than one link for %s}"
+ % (self.originPage, page.site))
+ if config.interwiki_graph and config.interwiki_graph_url:
+ filename = interwiki_graph.getFilename(self.originPage, extension = config.interwiki_graph_formats[0])
+ f.write(u" [%s%s graph]" % (config.interwiki_graph_url, filename))
+ f.write("\n")
+ f.close()
+ # FIXME: What errors are we catching here?
+ # except: should be avoided!!
+ except:
+ #raise
+ pywikibot.output(u'File autonomous_problems.dat open or corrupted! Try again with -restore.')
+ sys.exit()
+ iw = ()
+ elif page.isEmpty() and not page.isCategory():
+ globalvar.remove.append(unicode(page))
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: %s is empty; ignoring it and its interwiki links"
+ % page)
+ # Ignore the interwiki links
+ self.done.remove(page)
+ iw = ()
+
+ for linkedPage in iw:
+ if globalvar.hintsareright:
+ if linkedPage.site in self.hintedsites:
+ pywikibot.output(u"NOTE: %s: %s extra interwiki on hinted site ignored %s"
+ % (self.originPage, page, linkedPage))
+ break
+ if not self.skipPage(page, linkedPage, counter):
+ if globalvar.followinterwiki or page == self.originPage:
+ if self.addIfNew(linkedPage, counter, page):
+ # It is new. Also verify whether it is the second on the
+ # same site
+ lpsite=linkedPage.site
+ for prevPage in self.foundIn:
+ if prevPage != linkedPage and prevPage.site == lpsite:
+ # Still, this could be "no problem" as either may be a
+ # redirect to the other. No way to find out quickly!
+ pywikibot.output(u"NOTE: %s: %s gives duplicate interwiki on same site %s"
+ % (self.originPage, page,
+ linkedPage))
+ break
+ else:
+ if config.interwiki_shownew or pywikibot.verbose:
+ pywikibot.output(u"%s: %s gives new interwiki %s"
+ % (self.originPage,
+ page, linkedPage))
+ if self.forcedStop:
+ break
+ # These pages are no longer 'in progress'
+ self.pending = PageTree()
+ # Check whether we need hints and the user offered to give them
+ if self.untranslated and not self.hintsAsked:
+ self.reportInterwikilessPage(page)
+ self.askForHints(counter)
+
+ def isDone(self):
+ """Return True if all the work for this subject has completed."""
+ return len(self.todo) == 0
+
+ def problem(self, txt, createneed = True):
+ """Report a problem with the resolution of this subject."""
+ pywikibot.output(u"ERROR: %s" % txt)
+ self.confirm = True
+ if createneed:
+ self.problemfound = True
+
+ def whereReport(self, page, indent=4):
+ for page2 in sorted(self.foundIn[page]):
+ if page2 is None:
+ pywikibot.output(u" "*indent + "Given as a hint.")
+ else:
+ pywikibot.output(u" "*indent + unicode(page2))
+
+
+ def assemble(self):
+ # No errors have been seen so far, except....
+ errorCount = self.problemfound
+ mysite = pywikibot.getSite()
+ # Build up a dictionary of all pages found, with the site as key.
+ # Each value will be a list of pages.
+ new = {}
+ for page in self.done:
+ if page.exists() and not page.isRedirectPage() and not page.isCategoryRedirect():
+ site = page.site
+ if site.family.interwiki_forward:
+ #TODO: allow these cases to be propagated!
+ continue # inhibit the forwarding families pages to be updated.
+ if site == self.originPage.site:
+ if page != self.originPage:
+ self.problem(u"Found link to %s" % page)
+ self.whereReport(page)
+ errorCount += 1
+ else:
+ if site in new:
+ new[site].append(page)
+ else:
+ new[site] = [page]
+ # See if new{} contains any problematic values
+ result = {}
+ for site, pages in new.iteritems():
+ if len(pages) > 1:
+ errorCount += 1
+ self.problem(u"Found more than one link for %s" % site)
+
+ if not errorCount and not globalvar.select:
+ # no errors, so all lists have only one item
+ for site, pages in new.iteritems():
+ result[site] = pages[0]
+ return result
+
+ # There are any errors.
+ if config.interwiki_graph:
+ graphDrawer = interwiki_graph.GraphDrawer(self)
+ graphDrawer.createGraph()
+
+ # We don't need to continue with the rest if we're in autonomous
+ # mode.
+ if globalvar.autonomous:
+ return None
+
+ # First loop over the ones that have more solutions
+ for site, pages in new.iteritems():
+ if len(pages) > 1:
+ pywikibot.output(u"=" * 30)
+ pywikibot.output(u"Links to %s" % site)
+ i = 0
+ for page2 in pages:
+ i += 1
+ pywikibot.output(u" (%d) Found link to %s in:"
+ % (i, page2))
+ self.whereReport(page2, indent = 8)
+ while True:
+ #TODO: allow answer to repeat previous or go back after a mistake
+ answer = pywikibot.input(u"Which variant should be used? (<number>, [n]one, [g]ive up) ").lower()
+ if answer:
+ if answer == 'g':
+ return None
+ elif answer == 'n':
+ # None acceptable
+ break
+ elif answer.isdigit():
+ answer = int(answer)
+ try:
+ result[site] = pages[answer - 1]
+ except IndexError:
+ # user input is out of range
+ pass
+ else:
+ break
+ # Loop over the ones that have one solution, so are in principle
+ # not a problem.
+ acceptall = False
+ for site, pages in new.iteritems():
+ if len(pages) == 1:
+ if not acceptall:
+ pywikibot.output(u"=" * 30)
+ page2 = pages[0]
+ pywikibot.output(u"Found link to %s in:" % page2)
+ self.whereReport(page2, indent = 4)
+ while True:
+ if acceptall:
+ answer = 'a'
+ else:
+ #TODO: allow answer to repeat previous or go back after a mistake
+ answer = pywikibot.inputChoice(u'What should be done?', ['accept', 'reject', 'give up', 'accept all'], ['a', 'r', 'g', 'l'], 'a')
+ if answer == 'l': # accept all
+ acceptall = True
+ answer = 'a'
+ if answer == 'a': # accept this one
+ result[site] = pages[0]
+ break
+ elif answer == 'g': # give up
+ return None
+ elif answer == 'r': # reject
+ # None acceptable
+ break
+ return result
+
+ def finish(self, bot = None):
+ """Round up the subject, making any necessary changes. This method
+ should be called exactly once after the todo list has gone empty.
+
+ This contains a shortcut: if a subject list is given in the argument
+ bot, just before submitting a page change to the live wiki it is
+ checked whether we will have to wait. If that is the case, the bot will
+ be told to make another get request first."""
+
+ #from clean_sandbox
+ def minutesDiff(time1, time2):
+ if type(time1) is long:
+ time1 = str(time1)
+ if type(time2) is long:
+ time2 = str(time2)
+ t1 = (((int(time1[0:4]) * 12 + int(time1[4:6])) * 30 +
+ int(time1[6:8])) * 24 + int(time1[8:10])) * 60 + \
+ int(time1[10:12])
+ t2 = (((int(time2[0:4]) * 12 + int(time2[4:6])) * 30 +
+ int(time2[6:8])) * 24 + int(time2[8:10])) * 60 + \
+ int(time2[10:12])
+ return abs(t2-t1)
+
+ if not self.isDone():
+ raise "Bugcheck: finish called before done"
+ if not self.workonme:
+ return
+ if self.originPage:
+ if self.originPage.isRedirectPage():
+ return
+ if self.originPage.isCategoryRedirect():
+ return
+ else:
+ return
+ if not self.untranslated and globalvar.untranslatedonly:
+ return
+ if self.forcedStop: # autonomous with problem
+ pywikibot.output(u"======Aborted processing %s======"
+ % self.originPage)
+ return
+ # The following check is not always correct and thus disabled.
+ # self.done might contain no interwiki links because of the -neverlink
+ # argument or because of disambiguation conflicts.
+# if len(self.done) == 1:
+# # No interwiki at all
+# return
+ pywikibot.output(u"======Post-processing %s======" % self.originPage)
+ # Assemble list of accepted interwiki links
+ new = self.assemble()
+ if new is None: # User said give up
+ pywikibot.output(u"======Aborted processing %s======"
+ % self.originPage)
+ return
+
+ # Make sure new contains every page link, including the page we are processing
+ # TODO: should be move to assemble()
+ # replaceLinks will skip the site it's working on.
+ if self.originPage.site not in new:
+ #TODO: make this possible as well.
+ if not self.originPage.site.family.interwiki_forward:
+ new[self.originPage.site] = self.originPage
+
+ #self.replaceLinks(self.originPage, new, True, bot)
+
+ updatedSites = []
+ notUpdatedSites = []
+ # Process all languages here
+ globalvar.always = False
+ if globalvar.limittwo:
+ lclSite = self.originPage.site
+ lclSiteDone = False
+ frgnSiteDone = False
+
+ for siteCode in lclSite.family.languages_by_size:
+ site = pywikibot.getSite(code = siteCode)
+ if (not lclSiteDone and site == lclSite) or \
+ (not frgnSiteDone and site != lclSite and site in new):
+ if site == lclSite:
+ lclSiteDone = True # even if we fail the update
+ if site.family.name in config.usernames and site.lang in config.usernames[site.family.name]:
+ try:
+ if self.replaceLinks(new[site], new, bot):
+ updatedSites.append(site)
+ if site != lclSite:
+ frgnSiteDone = True
+ except SaveError:
+ notUpdatedSites.append(site)
+ except GiveUpOnPage:
+ break
+ elif not globalvar.strictlimittwo and site in new \
+ and site != lclSite:
+ old={}
+ try:
+ for page in new[site].interwiki():
+ old[page.site] = page
+ except pywikibot.NoPage:
+ pywikibot.output(u"BUG>>> %s no longer exists?"
+ % new[site])
+ continue
+ mods, mcomment, adding, removing, modifying \
+ = compareLanguages(old, new, insite = lclSite)
+ if (len(removing) > 0 and not globalvar.autonomous) or \
+ (len(modifying) > 0 and self.problemfound) or \
+ len(old) == 0 or \
+ (globalvar.needlimit and \
+ len(adding) + len(modifying) >= globalvar.needlimit +1):
+ try:
+ if self.replaceLinks(new[site], new, bot):
+ updatedSites.append(site)
+ except SaveError:
+ notUpdatedSites.append(site)
+ except pywikibot.NoUsername:
+ pass
+ except GiveUpOnPage:
+ break
+ else:
+ for (site, page) in new.iteritems():
+ # edit restriction on is-wiki
+ # http://is.wikipedia.org/wiki/Wikipediaspjall:V%C3%A9lmenni
+ # allow edits for the same conditions as -whenneeded
+ # or the last edit wasn't a bot
+ # or the last edit was 1 month ago
+ smallWikiAllowed = True
+ if globalvar.autonomous and page.site.sitename() == 'wikipedia:is':
+ old={}
+ try:
+ for mypage in new[page.site].interwiki():
+ old[mypage.site] = mypage
+ except pywikibot.NoPage:
+ pywikibot.output(u"BUG>>> %s no longer exists?"
+ % new[site])
+ continue
+ mods, mcomment, adding, removing, modifying \
+ = compareLanguages(old, new, insite=site)
+ #cannot create userlib.User with IP
+ smallWikiAllowed = page.isIpEdit() or \
+ len(removing) > 0 or len(old) == 0 or \
+ len(adding) + len(modifying) > 2 or \
+ len(removing) + len(modifying) == 0 and \
+ adding == [page.site]
+ if not smallWikiAllowed:
+ import userlib
+ user = userlib.User(page.site, page.userName())
+ if not 'bot' in user.groups() \
+ and not 'bot' in page.userName().lower(): #erstmal auch keine namen mit bot
+ smallWikiAllowed = True
+ else:
+ diff = minutesDiff(page.editTime(),
+ time.strftime("%Y%m%d%H%M%S",
+ time.gmtime()))
+ if diff > 30*24*60:
+ smallWikiAllowed = True
+ else:
+ pywikibot.output(
+u'NOTE: number of edits are restricted at %s'
+ % page.site.sitename())
+
+ # if we have an account for this site
+ if site.family.name in config.usernames \
+ and site.lang in config.usernames[site.family.name] \
+ and smallWikiAllowed:
+ # Try to do the changes
+ try:
+ if self.replaceLinks(page, new, bot):
+ # Page was changed
+ updatedSites.append(site)
+ except SaveError:
+ notUpdatedSites.append(site)
+ except GiveUpOnPage:
+ break
+
+ # disabled graph drawing for minor problems: it just takes too long
+ #if notUpdatedSites != [] and config.interwiki_graph:
+ # # at least one site was not updated, save a conflict graph
+ # self.createGraph()
+
+ # don't report backlinks for pages we already changed
+ if config.interwiki_backlink:
+ self.reportBacklinks(new, updatedSites)
+
+ def clean(self):
+ """
+ Delete the contents that are stored on disk for this Subject.
+
+ We cannot afford to define this in a StoredPage destructor because
+ StoredPage instances can get referenced cyclicly: that would stop the
+ garbage collector from destroying some of those objects.
+
+ It's also not necessary to set these lines as a Subject destructor:
+ deleting all stored content one entry by one entry when bailing out
+ after a KeyboardInterrupt for example is redundant, because the
+ whole storage file will be eventually removed.
+ """
+ if globalvar.contentsondisk:
+ for page in self.foundIn:
+ # foundIn can contain either Page or StoredPage objects
+ # calling the destructor on _contents will delete the
+ # disk records if necessary
+ if hasattr(page, '_contents'):
+ del page._contents
+
+ def replaceLinks(self, page, newPages, bot):
+ """
+ Returns True if saving was successful.
+ """
+ if globalvar.localonly:
+ # In this case only continue on the Page we started with
+ if page != self.originPage:
+ raise SaveError(u'-localonly and page != originPage')
+ if page.section():
+ # This is not a page, but a subpage. Do not edit it.
+ pywikibot.output(u"Not editing %s: not doing interwiki on subpages"
+ % page)
+ raise SaveError(u'Link has a #section')
+ try:
+ pagetext = page.get()
+ except pywikibot.NoPage:
+ pywikibot.output(u"Not editing %s: page does not exist" % page)
+ raise SaveError(u'Page doesn\'t exist')
+ if page.isEmpty() and not page.isCategory():
+ pywikibot.output(u"Not editing %s: page is empty" % page)
+ raise SaveError
+
+ # clone original newPages dictionary, so that we can modify it to the
+ # local page's needs
+ new = dict(newPages)
+ interwikis = page.interwiki()
+
+ # remove interwiki links to ignore
+ for iw in re.finditer('<!-- *\[\[(.*?:.*?)\]\] *-->', pagetext):
+ try:
+ ignorepage = pywikibot.Page(page.site, iw.groups()[0])
+ except (pywikibot.NoSuchSite, pywikibot.InvalidTitle):
+ continue
+ try:
+ if (new[ignorepage.site] == ignorepage) and \
+ (ignorepage.site != page.site):
+ if (ignorepage not in interwikis):
+ pywikibot.output(
+ u"Ignoring link to %(to)s for %(from)s"
+ % {'to': ignorepage,
+ 'from': page})
+ new.pop(ignorepage.site)
+ else:
+ pywikibot.output(
+ u"NOTE: Not removing interwiki from %(from)s to %(to)s (exists both commented and non-commented)"
+ % {'to': ignorepage,
+ 'from': page})
+ except KeyError:
+ pass
+
+ # sanity check - the page we are fixing must be the only one for that
+ # site.
+ pltmp = new[page.site]
+ if pltmp != page:
+ s = u"None"
+ if pltmp is not None: s = pltmp
+ pywikibot.output(
+ u"BUG>>> %s is not in the list of new links! Found %s."
+ % (page, s))
+ raise SaveError(u'BUG: sanity check failed')
+
+ # Avoid adding an iw link back to itself
+ del new[page.site]
+ # Do not add interwiki links to foreign families that page.site() does not forward to
+ for stmp in new.keys():
+ if stmp.family != page.site.family:
+ if stmp.family.name != page.site.family.interwiki_forward:
+ del new[stmp]
+
+ # Put interwiki links into a map
+ old={}
+ for page2 in interwikis:
+ old[page2.site] = page2
+
+ # Check what needs to get done
+ mods, mcomment, adding, removing, modifying = compareLanguages(old,
+ new,
+ insite=page.site)
+
+ # When running in autonomous mode without -force switch, make sure we
+ # don't remove any items, but allow addition of the new ones
+ if globalvar.autonomous and (not globalvar.force or
+ pywikibot.unicode_error
+ ) and len(removing) > 0:
+ for rmsite in removing:
+ # Sometimes sites have an erroneous link to itself as an
+ # interwiki
+ if rmsite == page.site:
+ continue
+ rmPage = old[rmsite]
+ #put it to new means don't delete it
+ if not globalvar.cleanup and not globalvar.force or \
+ globalvar.cleanup and \
+ unicode(rmPage) not in globalvar.remove or \
+ rmPage.site.lang in ['hak', 'hi', 'cdo'] and \
+ pywikibot.unicode_error: #work-arround for bug #3081100 (do not remove affected pages)
+ new[rmsite] = rmPage
+ pywikibot.output(
+ u"WARNING: %s is either deleted or has a mismatching disambiguation state."
+ % rmPage)
+ # Re-Check what needs to get done
+ mods, mcomment, adding, removing, modifying = compareLanguages(old,
+ new,
+ insite=page.site)
+ if not mods:
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u'No changes needed on page %s' % page)
+ return False
+
+ # Show a message in purple.
+ pywikibot.output(
+ u"\03{lightpurple}Updating links on page %s.\03{default}" % page)
+ pywikibot.output(u"Changes to be made: %s" % mods)
+ oldtext = page.get()
+ template = (page.namespace() == 10)
+ newtext = pywikibot.replaceLanguageLinks(oldtext, new,
+ site=page.site,
+ template=template)
+ # This is for now. Later there should be different funktions for each
+ # kind
+ if not botMayEdit(page):
+ if template:
+ pywikibot.output(
+ u'SKIPPING: %s should have interwiki links on subpage.'
+ % page)
+ else:
+ pywikibot.output(
+ u'SKIPPING: %s is under construction or to be deleted.'
+ % page)
+ return False
+ if newtext == oldtext:
+ return False
+ pywikibot.showDiff(oldtext, newtext)
+
+ # pywikibot.output(u"NOTE: Replace %s" % page)
+ # Determine whether we need permission to submit
+ ask = False
+
+ # Allow for special case of a self-pointing interwiki link
+ if removing and removing != [page.site]:
+ self.problem(u'Found incorrect link to %s in %s'
+ % (", ".join([x.lang for x in removing]), page),
+ createneed=False)
+ if pywikibot.unicode_error:
+ for x in removing:
+ if x.lang in ['hi', 'cdo']:
+ pywikibot.output(
+u'\03{lightred}WARNING: This may be false positive due to unicode bug #3081100\03{default}')
+ break
+ ask = True
+ if globalvar.force or globalvar.cleanup:
+ ask = False
+ if globalvar.confirm and not globalvar.always:
+ ask = True
+ # If we need to ask, do so
+ if ask:
+ if globalvar.autonomous:
+ # If we cannot ask, deny permission
+ answer = 'n'
+ else:
+ answer = pywikibot.inputChoice(u'Submit?',
+ ['Yes', 'No', 'open in Browser',
+ 'Give up', 'Always'],
+ ['y', 'n', 'b', 'g', 'a'])
+ if answer == 'b':
+ webbrowser.open("http://%s%s" % (
+ page.site.hostname(),
+ page.site.nice_get_address(page.title())
+ ))
+ pywikibot.input(u"Press Enter when finished in browser.")
+ return True
+ elif answer == 'a':
+ # don't ask for the rest of this subject
+ globalvar.always = True
+ answer = 'y'
+ else:
+ # If we do not need to ask, allow
+ answer = 'y'
+ # If we got permission to submit, do so
+ if answer == 'y':
+ # Check whether we will have to wait for pywikibot. If so, make
+ # another get-query first.
+ if bot:
+ while pywikibot.get_throttle.waittime() + 2.0 < pywikibot.put_throttle.waittime():
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(
+ u"NOTE: Performing a recursive query first to save time....")
+ qdone = bot.oneQuery()
+ if not qdone:
+ # Nothing more to do
+ break
+ if not globalvar.quiet or pywikibot.verbose:
+ pywikibot.output(u"NOTE: Updating live wiki...")
+ timeout=60
+ while True:
+ try:
+ if globalvar.async:
+ page.put_async(newtext, comment=mcomment)
+ status = 302
+ else:
+ status, reason, data = page.put(newtext, comment=mcomment)
+ except pywikibot.LockedPage:
+ pywikibot.output(u'Page %s is locked. Skipping.' % page)
+ raise SaveError(u'Locked')
+ except pywikibot.EditConflict:
+ pywikibot.output(
+ u'ERROR putting page: An edit conflict occurred. Giving up.')
+ raise SaveError(u'Edit conflict')
+ except (pywikibot.SpamfilterError), error:
+ pywikibot.output(
+ u'ERROR putting page: %s blacklisted by spamfilter. Giving up.'
+ % (error.url,))
+ raise SaveError(u'Spam filter')
+ except (pywikibot.PageNotSaved), error:
+ pywikibot.output(u'ERROR putting page: %s' % (error.args,))
+ raise SaveError(u'PageNotSaved')
+ except (socket.error, IOError), error:
+ if timeout>3600:
+ raise
+ pywikibot.output(u'ERROR putting page: %s' % (error.args,))
+ pywikibot.output(u'Sleeping %i seconds before trying again.'
+ % (timeout,))
+ timeout *= 2
+ time.sleep(timeout)
+ except pywikibot.ServerError:
+ if timeout > 3600:
+ raise
+ pywikibot.output(u'ERROR putting page: ServerError.')
+ pywikibot.output(u'Sleeping %i seconds before trying again.'
+ % (timeout,))
+ timeout *= 2
+ time.sleep(timeout)
+ else:
+ break
+ if str(status) == '302':
+ return True
+ else:
+ pywikibot.output(u'%s %s' % (status, reason))
+ return False
+ elif answer == 'g':
+ raise GiveUpOnPage(u'User asked us to give up')
+ else:
+ raise LinkMustBeRemoved(u'Found incorrect link to %s in %s'
+ % (", ".join([x.lang for x in removing]),
+ page))
+
+ def reportBacklinks(self, new, updatedSites):
+ """
+ Report missing back links. This will be called from finish() if needed.
+
+ updatedSites is a list that contains all sites we changed, to avoid
+ reporting of missing backlinks for pages we already fixed
+
+ """
+ # use sets because searching an element is faster than in lists
+ expectedPages = set(new.itervalues())
+ expectedSites = set(new)
+ try:
+ for site in expectedSites - set(updatedSites):
+ page = new[site]
+ if not page.section():
+ try:
+ linkedPages = set(page.interwiki())
+ except pywikibot.NoPage:
+ pywikibot.output(u"WARNING: Page %s does no longer exist?!" % page)
+ break
+ # To speed things up, create a dictionary which maps sites to pages.
+ # This assumes that there is only one interwiki link per language.
+ linkedPagesDict = {}
+ for linkedPage in linkedPages:
+ linkedPagesDict[linkedPage.site] = linkedPage
+ for expectedPage in expectedPages - linkedPages:
+ if expectedPage != page:
+ try:
+ linkedPage = linkedPagesDict[expectedPage.site]
+ pywikibot.output(
+ u"WARNING: %s: %s does not link to %s but to %s"
+ % (page.site.family.name,
+ page, expectedPage, linkedPage))
+ except KeyError:
+ pywikibot.output(
+ u"WARNING: %s: %s does not link to %s"
+ % (page.site.family.name,
+ page, expectedPage))
+ # Check for superfluous links
+ for linkedPage in linkedPages:
+ if linkedPage not in expectedPages:
+ # Check whether there is an alternative page on that language.
+ # In this case, it was already reported above.
+ if linkedPage.site not in expectedSites:
+ pywikibot.output(
+ u"WARNING: %s: %s links to incorrect %s"
+ % (page.site.family.name,
+ page, linkedPage))
+ except (socket.error, IOError):
+ pywikibot.output(u'ERROR: could not report backlinks')
+
+class InterwikiBot(object):
+ """A class keeping track of a list of subjects, controlling which pages
+ are queried from which languages when."""
+
+ def __init__(self):
+ """Constructor. We always start with empty lists."""
+ self.subjects = []
+ # We count how many pages still need to be loaded per site.
+ # This allows us to find out from which site to retrieve pages next
+ # in a way that saves bandwidth.
+ # sites are keys, integers are values.
+ # Modify this only via plus() and minus()!
+ self.counts = {}
+ self.pageGenerator = None
+ self.generated = 0
+
+ def add(self, page, hints = None):
+ """Add a single subject to the list"""
+ subj = Subject(page, hints = hints)
+ self.subjects.append(subj)
+ for site, count in subj.openSites():
+ # Keep correct counters
+ self.plus(site, count)
+
+ def setPageGenerator(self, pageGenerator, number = None, until = None):
+ """Add a generator of subjects. Once the list of subjects gets
+ too small, this generator is called to produce more Pages"""
+ self.pageGenerator = pageGenerator
+ self.generateNumber = number
+ self.generateUntil = until
+
+ def dump(self, append = True):
+ site = pywikibot.getSite()
+ dumpfn = pywikibot.config.datafilepath(
+ 'interwiki-dumps',
+ 'interwikidump-%s-%s.txt' % (site.family.name, site.lang))
+ if append: mode = 'appended'
+ else: mode = 'written'
+ f = codecs.open(dumpfn, mode[0], 'utf-8')
+ for subj in self.subjects:
+ if subj.originPage:
+ f.write(subj.originPage.title(asLink=True)+'\n')
+ f.close()
+ pywikibot.output(u'Dump %s (%s) %s.' % (site.lang, site.family.name, mode))
+ return dumpfn
+
+ def generateMore(self, number):
+ """Generate more subjects. This is called internally when the
+ list of subjects becomes too small, but only if there is a
+ PageGenerator"""
+ fs = self.firstSubject()
+ if fs and (not globalvar.quiet or pywikibot.verbose):
+ pywikibot.output(u"NOTE: The first unfinished subject is %s"
+ % fs.originPage)
+ pywikibot.output(u"NOTE: Number of pages queued is %d, trying to add %d more."
+ % (len(self.subjects), number))
+ for i in xrange(number):
+ try:
+ while True:
+ try:
+ page = self.pageGenerator.next()
+ except IOError:
+ pywikibot.output(u'IOError occured; skipping')
+ continue
+ if page in globalvar.skip:
+ pywikibot.output(u'Skipping: %s is in the skip list' % page)
+ continue
+ if globalvar.skipauto:
+ dictName, year = page.autoFormat()
+ if dictName is not None:
+ pywikibot.output(u'Skipping: %s is an auto entry %s(%s)' % (page, dictName, year))
+ continue
+ if globalvar.parenthesesonly:
+ # Only yield pages that have ( ) in titles
+ if "(" not in page.title():
+ continue
+ if page.isTalkPage():
+ pywikibot.output(u'Skipping: %s is a talk page' % page)
+ continue
+ #doesn't work: page must be preloaded for this test
+ #if page.isEmpty():
+ # pywikibot.output(u'Skipping: %s is a empty page' % page.title())
+ # continue
+ if page.namespace() == 10:
+ loc = None
+ try:
+ tmpl, loc = moved_links[page.site.lang]
+ del tmpl
+ except KeyError:
+ pass
+ if loc is not None and loc in page.title():
+ pywikibot.output(u'Skipping: %s is a templates subpage' % page.title())
+ continue
+ break
+
+ if self.generateUntil:
+ until = self.generateUntil
+ if page.site.lang not in page.site.family.nocapitalize:
+ until = until[0].upper()+until[1:]
+ if page.title(withNamespace=False) > until:
+ raise StopIteration
+ self.add(page, hints = globalvar.hints)
+ self.generated += 1
+ if self.generateNumber:
+ if self.generated >= self.generateNumber:
+ raise StopIteration
+ except StopIteration:
+ self.pageGenerator = None
+ break
+
+ def firstSubject(self):
+ """Return the first subject that is still being worked on"""
+ if self.subjects:
+ return self.subjects[0]
+
+ def maxOpenSite(self):
+ """Return the site that has the most
+ open queries plus the number. If there is nothing left, return
+ None. Only languages that are TODO for the first Subject
+ are returned."""
+ max = 0
+ maxlang = None
+ if not self.firstSubject():
+ return None
+ oc = dict(self.firstSubject().openSites())
+ if not oc:
+ # The first subject is done. This might be a recursive call made because we
+ # have to wait before submitting another modification to go live. Select
+ # any language from counts.
+ oc = self.counts
+ if pywikibot.getSite() in oc:
+ return pywikibot.getSite()
+ for lang in oc:
+ count = self.counts[lang]
+ if count > max:
+ max = count
+ maxlang = lang
+ return maxlang
+
+ def selectQuerySite(self):
+ """Select the site the next query should go out for."""
+ # How many home-language queries we still have?
+ mycount = self.counts.get(pywikibot.getSite(), 0)
+ # Do we still have enough subjects to work on for which the
+ # home language has been retrieved? This is rough, because
+ # some subjects may need to retrieve a second home-language page!
+ if len(self.subjects) - mycount < globalvar.minsubjects:
+ # Can we make more home-language queries by adding subjects?
+ if self.pageGenerator and mycount < globalvar.maxquerysize:
+ timeout = 60
+ while timeout<3600:
+ try:
+ self.generateMore(globalvar.maxquerysize - mycount)
+ except pywikibot.ServerError:
+ # Could not extract allpages special page?
+ pywikibot.output(u'ERROR: could not retrieve more pages. Will try again in %d seconds'%timeout)
+ time.sleep(timeout)
+ timeout *= 2
+ else:
+ break
+ # If we have a few, getting the home language is a good thing.
+ if not globalvar.restoreAll:
+ try:
+ if self.counts[pywikibot.getSite()] > 4:
+ return pywikibot.getSite()
+ except KeyError:
+ pass
+ # If getting the home language doesn't make sense, see how many
+ # foreign page queries we can find.
+ return self.maxOpenSite()
+
+ def oneQuery(self):
+ """
+ Perform one step in the solution process.
+
+ Returns True if pages could be preloaded, or false
+ otherwise.
+ """
+ # First find the best language to work on
+ site = self.selectQuerySite()
+ if site is None:
+ pywikibot.output(u"NOTE: Nothing left to do")
+ return False
+ # Now assemble a reasonable list of pages to get
+ subjectGroup = []
+ pageGroup = []
+ for subject in self.subjects:
+ # Promise the subject that we will work on the site.
+ # We will get a list of pages we can do.
+ pages = subject.whatsNextPageBatch(site)
+ if pages:
+ pageGroup.extend(pages)
+ subjectGroup.append(subject)
+ if len(pageGroup) >= globalvar.maxquerysize:
+ # We have found enough pages to fill the bandwidth.
+ break
+ if len(pageGroup) == 0:
+ pywikibot.output(u"NOTE: Nothing left to do 2")
+ return False
+ # Get the content of the assembled list in one blow
+ gen = pagegenerators.PreloadingGenerator(iter(pageGroup))
+ for page in gen:
+ # we don't want to do anything with them now. The
+ # page contents will be read via the Subject class.
+ pass
+ # Tell all of the subjects that the promised work is done
+ for subject in subjectGroup:
+ subject.batchLoaded(self)
+ return True
+
+ def queryStep(self):
+ self.oneQuery()
+ # Delete the ones that are done now.
+ for i in xrange(len(self.subjects)-1, -1, -1):
+ subj = self.subjects[i]
+ if subj.isDone():
+ subj.finish(self)
+ subj.clean()
+ del self.subjects[i]
+
+ def isDone(self):
+ """Check whether there is still more work to do"""
+ return len(self) == 0 and self.pageGenerator is None
+
+ def plus(self, site, count=1):
+ """This is a routine that the Subject class expects in a counter"""
+ try:
+ self.counts[site] += count
+ except KeyError:
+ self.counts[site] = count
+
+ def minus(self, site, count=1):
+ """This is a routine that the Subject class expects in a counter"""
+ self.counts[site] -= count
+
+ def run(self):
+ """Start the process until finished"""
+ while not self.isDone():
+ self.queryStep()
+
+ def __len__(self):
+ return len(self.subjects)
+
+def compareLanguages(old, new, insite):
+
+ oldiw = set(old)
+ newiw = set(new)
+
+ # sort by language code
+ adding = sorted(newiw - oldiw)
+ removing = sorted(oldiw - newiw)
+ modifying = sorted(site for site in oldiw & newiw if old[site] != new[site])
+
+ if not globalvar.summary and \
+ len(adding) + len(removing) + len(modifying) <= 3:
+ # Use an extended format for the string linking to all added pages.
+ fmt = lambda d, site: unicode(d[site])
+ else:
+ # Use short format, just the language code
+ fmt = lambda d, site: site.lang
+
+ mods = mcomment = u''
+
+ commentname = 'interwiki'
+ if adding:
+ commentname += '-adding'
+ if removing:
+ commentname += '-removing'
+ if modifying:
+ commentname += '-modifying'
+
+ if adding or removing or modifying:
+ #Version info marks bots without unicode error
+ #This also prevents abuse filter blocking on de-wiki
+ if not pywikibot.unicode_error:
+ mcomment += u'r%s) (' % sys.version.split()[0]
+
+ mcomment += globalvar.summary
+
+ changes = {'adding': ', '.join([fmt(new, x) for x in adding]),
+ 'removing': ', '.join([fmt(old, x) for x in removing]),
+ 'modifying': ', '.join([fmt(new, x) for x in modifying])}
+
+ mcomment += i18n.twtranslate(insite.lang, commentname) % changes
+ mods = i18n.twtranslate('en', commentname) % changes
+
+ return mods, mcomment, adding, removing, modifying
+
+def botMayEdit (page):
+ tmpl = []
+ try:
+ tmpl, loc = moved_links[page.site.lang]
+ except KeyError:
+ pass
+ if type(tmpl) != list:
+ tmpl = [tmpl]
+ try:
+ tmpl += ignoreTemplates[page.site.lang]
+ except KeyError:
+ pass
+ tmpl += ignoreTemplates['_default']
+ if tmpl != []:
+ templates = page.templatesWithParams(get_redirect=True);
+ for template in templates:
+ if template[0].lower() in tmpl:
+ return False
+ return True
+
+def readWarnfile(filename, bot):
+ import warnfile
+ reader = warnfile.WarnfileReader(filename)
+ # we won't use removeHints
+ (hints, removeHints) = reader.getHints()
+ for page, pagelist in hints.iteritems():
+ # The WarnfileReader gives us a list of pagelinks, but titletranslate.py expects a list of strings, so we convert it back.
+ # TODO: This is a quite ugly hack, in the future we should maybe make titletranslate expect a list of pagelinks.
+ hintStrings = ['%s:%s' % (hintedPage.site.language(), hintedPage.title()) for hintedPage in pagelist]
+ bot.add(page, hints = hintStrings)
+
+def main():
+ singlePageTitle = []
+ opthintsonly = False
+ start = None
+ # Which namespaces should be processed?
+ # default to [] which means all namespaces will be processed
+ namespaces = []
+ number = None
+ until = None
+ warnfile = None
+ # a normal PageGenerator (which doesn't give hints, only Pages)
+ hintlessPageGen = None
+ optContinue = False
+ optRestore = False
+ restoredFiles = []
+ File2Restore = []
+ dumpFileName = ''
+ append = True
+ newPages = None
+ # This factory is responsible for processing command line arguments
+ # that are also used by other scripts and that determine on which pages
+ # to work on.
+ genFactory = pagegenerators.GeneratorFactory()
+
+ for arg in pywikibot.handleArgs():
+ if globalvar.readOptions(arg):
+ continue
+ elif arg.startswith('-warnfile:'):
+ warnfile = arg[10:]
+ elif arg.startswith('-years'):
+ # Look if user gave a specific year at which to start
+ # Must be a natural number or negative integer.
+ if len(arg) > 7 and (arg[7:].isdigit() or (arg[7] == "-" and arg[8:].isdigit())):
+ startyear = int(arg[7:])
+ else:
+ startyear = 1
+ # avoid problems where year pages link to centuries etc.
+ globalvar.followredirect = False
+ hintlessPageGen = pagegenerators.YearPageGenerator(startyear)
+ elif arg.startswith('-days'):
+ if len(arg) > 6 and arg[5] == ':' and arg[6:].isdigit():
+ # Looks as if the user gave a specific month at which to start
+ # Must be a natural number.
+ startMonth = int(arg[6:])
+ else:
+ startMonth = 1
+ hintlessPageGen = pagegenerators.DayPageGenerator(startMonth)
+ elif arg.startswith('-new'):
+ if len(arg) > 5 and arg[4] == ':' and arg[5:].isdigit():
+ # Looks as if the user gave a specific number of pages
+ newPages = int(arg[5:])
+ else:
+ newPages = 100
+ elif arg.startswith('-restore'):
+ globalvar.restoreAll = arg[9:].lower() == 'all'
+ optRestore = not globalvar.restoreAll
+ elif arg == '-continue':
+ optContinue = True
+ elif arg == '-hintsonly':
+ opthintsonly = True
+ elif arg.startswith('-namespace:'):
+ try:
+ namespaces.append(int(arg[11:]))
+ except ValueError:
+ namespaces.append(arg[11:])
+ # deprecated for consistency with other scripts
+ elif arg.startswith('-number:'):
+ number = int(arg[8:])
+ elif arg.startswith('-until:'):
+ until = arg[7:]
+ else:
+ if not genFactory.handleArg(arg):
+ singlePageTitle.append(arg)
+
+ # Do not use additional summary with autonomous mode
+ if globalvar.autonomous:
+ globalvar.summary = u''
+ elif globalvar.summary:
+ globalvar.summary += u'; '
+
+ # ensure that we don't try to change main page
+ try:
+ site = pywikibot.getSite()
+ try:
+ mainpagename = site.siteinfo()['mainpage']
+ except TypeError: #pywikibot module handle
+ mainpagename = site.siteinfo['mainpage']
+ globalvar.skip.add(pywikibot.Page(site, mainpagename))
+ except pywikibot.Error:
+ pywikibot.output(u'Missing main page name')
+
+ if newPages is not None:
+ if len(namespaces) == 0:
+ ns = 0
+ elif len(namespaces) == 1:
+ ns = namespaces[0]
+ if ns != 'all':
+ if isinstance(ns, unicode) or isinstance(ns, str):
+ index = site.getNamespaceIndex(ns)
+ if index is None:
+ raise ValueError(u'Unknown namespace: %s' % ns)
+ ns = index
+ namespaces = []
+ else:
+ ns = 'all'
+ hintlessPageGen = pagegenerators.NewpagesPageGenerator(newPages, namespace=ns)
+
+ elif optRestore or optContinue or globalvar.restoreAll:
+ site = pywikibot.getSite()
+ if globalvar.restoreAll:
+ import glob
+ for FileName in glob.iglob('interwiki-dumps/interwikidump-*.txt'):
+ s = FileName.split('\\')[1].split('.')[0].split('-')
+ sitename = s[1]
+ for i in xrange(0,2):
+ s.remove(s[0])
+ sitelang = '-'.join(s)
+ if site.family.name == sitename:
+ File2Restore.append([sitename, sitelang])
+ else:
+ File2Restore.append([site.family.name, site.lang])
+ for sitename, sitelang in File2Restore:
+ dumpfn = pywikibot.config.datafilepath(
+ 'interwiki-dumps',
+ u'interwikidump-%s-%s.txt'
+ % (sitename, sitelang))
+ pywikibot.output(u'Reading interwikidump-%s-%s.txt' % (sitename, sitelang))
+ site = pywikibot.getSite(sitelang, sitename)
+ if not hintlessPageGen:
+ hintlessPageGen = pagegenerators.TextfilePageGenerator(dumpfn, site)
+ else:
+ hintlessPageGen = pagegenerators.CombinedPageGenerator([hintlessPageGen,pagegenerators.TextfilePageGenerator(dumpfn, site)])
+ restoredFiles.append(dumpfn)
+ if hintlessPageGen:
+ hintlessPageGen = pagegenerators.DuplicateFilterPageGenerator(hintlessPageGen)
+ if optContinue:
+ # We waste this generator to find out the last page's title
+ # This is an ugly workaround.
+ nextPage = "!"
+ namespace = 0
+ searchGen = pagegenerators.TextfilePageGenerator(dumpfn, site)
+ for page in searchGen:
+ lastPage = page.title(withNamespace=False)
+ if lastPage > nextPage:
+ nextPage = lastPage
+ namespace = page.namespace()
+ if nextPage == "!":
+ pywikibot.output(u"Dump file is empty?! Starting at the beginning.")
+ else:
+ nextPage += '!'
+ hintlessPageGen = pagegenerators.CombinedPageGenerator([hintlessPageGen, pagegenerators.AllpagesPageGenerator(nextPage, namespace, includeredirects = False)])
+ if not hintlessPageGen:
+ pywikibot.output(u'No Dumpfiles found.')
+ return
+
+ bot = InterwikiBot()
+
+ if not hintlessPageGen:
+ hintlessPageGen = genFactory.getCombinedGenerator()
+ if hintlessPageGen:
+ if len(namespaces) > 0:
+ hintlessPageGen = pagegenerators.NamespaceFilterPageGenerator(hintlessPageGen, namespaces)
+ # we'll use iter() to create make a next() function available.
+ bot.setPageGenerator(iter(hintlessPageGen), number = number, until=until)
+ elif warnfile:
+ # TODO: filter namespaces if -namespace parameter was used
+ readWarnfile(warnfile, bot)
+ else:
+ singlePageTitle = ' '.join(singlePageTitle)
+ if not singlePageTitle and not opthintsonly:
+ singlePageTitle = pywikibot.input(u'Which page to check:')
+ if singlePageTitle:
+ singlePage = pywikibot.Page(pywikibot.getSite(), singlePageTitle)
+ else:
+ singlePage = None
+ bot.add(singlePage, hints = globalvar.hints)
+
+ try:
+ try:
+ append = not (optRestore or optContinue or globalvar.restoreAll)
+ bot.run()
+ except KeyboardInterrupt:
+ dumpFileName = bot.dump(append)
+ except:
+ dumpFileName = bot.dump(append)
+ raise
+ finally:
+ if globalvar.contentsondisk:
+ StoredPage.SPdeleteStore()
+ if dumpFileName:
+ try:
+ restoredFiles.remove(dumpFileName)
+ except ValueError:
+ pass
+ for dumpFileName in restoredFiles:
+ try:
+ os.remove(dumpFileName)
+ pywikibot.output(u'Dumpfile %s deleted' % dumpFileName.split('\\')[-1])
+ except WindowsError:
+ pass
+
+#===========
+globalvar=Global()
+
+if __name__ == "__main__":
+ try:
+ main()
+ finally:
+ pywikibot.stopme()
Copied: archive/old python 2.3 scripts/wikipedia.py (from rev 10463, trunk/pywikipedia/wikipedia.py)
===================================================================
--- archive/old python 2.3 scripts/wikipedia.py (rev 0)
+++ archive/old python 2.3 scripts/wikipedia.py 2012-09-16 13:48:36 UTC (rev 10528)
@@ -0,0 +1,8639 @@
+# -*- coding: utf-8 -*-
+"""
+Library to get and put pages on a MediaWiki.
+
+Contents of the library (objects and functions to be used outside)
+
+Classes:
+ Page(site, title): A page on a MediaWiki site
+ ImagePage(site, title): An image descriptor Page
+ Site(lang, fam): A MediaWiki site
+
+Factory functions:
+ Family(name): Import the named family
+ getSite(lang, fam): Return a Site instance
+
+Exceptions:
+ Error: Base class for all exceptions in this module
+ NoUsername: Username is not in user-config.py
+ NoPage: Page does not exist on the wiki
+ NoSuchSite: Site does not exist
+ IsRedirectPage: Page is a redirect page
+ IsNotRedirectPage: Page is not a redirect page
+ LockedPage: Page is locked
+ SectionError: The section specified in the Page title does not exist
+ PageNotSaved: Saving the page has failed
+ EditConflict: PageNotSaved due to edit conflict while uploading
+ SpamfilterError: PageNotSaved due to MediaWiki spam filter
+ LongPageError: PageNotSaved due to length limit
+ ServerError: Got unexpected response from wiki server
+ BadTitle: Server responded with BadTitle
+ UserBlocked: Client's username or IP has been blocked
+ PageNotFound: Page not found in list
+
+Objects:
+ get_throttle: Call to limit rate of read-access to wiki
+ put_throttle: Call to limit rate of write-access to wiki
+
+Other functions:
+ getall(): Load a group of pages
+ handleArgs(): Process all standard command line arguments (such as
+ -family, -lang, -log and others)
+ translate(xx, dict): dict is a dictionary, giving text depending on
+ language, xx is a language. Returns the text in the most applicable
+ language for the xx: wiki
+ setAction(text): Use 'text' instead of "Wikipedia python library" in
+ edit summaries
+ setUserAgent(text): Sets the string being passed to the HTTP server as
+ the User-agent: header. Defaults to 'Pywikipediabot/1.0'.
+
+ output(text): Prints the text 'text' in the encoding of the user's
+ console. **Use this instead of "print" statements**
+ input(text): Asks input from the user, printing the text 'text' first.
+ inputChoice: Shows user a list of choices and returns user's selection.
+
+ showDiff(oldtext, newtext): Prints the differences between oldtext and
+ newtext on the screen
+
+Wikitext manipulation functions: each of these takes a unicode string
+containing wiki text as its first argument, and returns a modified version
+of the text unless otherwise noted --
+
+ replaceExcept: replace all instances of 'old' by 'new', skipping any
+ instances of 'old' within comments and other special text blocks
+ removeDisabledParts: remove text portions exempt from wiki markup
+ isDisabled(text,index): return boolean indicating whether text[index] is
+ within a non-wiki-markup section of text
+ decodeEsperantoX: decode Esperanto text using the x convention.
+ encodeEsperantoX: convert wikitext to the Esperanto x-encoding.
+ findmarker(text, startwith, append): return a string which is not part
+ of text
+ expandmarker(text, marker, separator): return marker string expanded
+ backwards to include separator occurrences plus whitespace
+
+Wikitext manipulation functions for interlanguage links:
+
+ getLanguageLinks(text,xx): extract interlanguage links from text and
+ return in a dict
+ removeLanguageLinks(text): remove all interlanguage links from text
+ removeLanguageLinksAndSeparator(text, site, marker, separator = ''):
+ remove language links, whitespace, preceeding separators from text
+ replaceLanguageLinks(oldtext, new): remove the language links and
+ replace them with links from a dict like the one returned by
+ getLanguageLinks
+ interwikiFormat(links): convert a dict of interlanguage links to text
+ (using same dict format as getLanguageLinks)
+ interwikiSort(sites, inSite): sorts a list of sites according to interwiki
+ sort preference of inSite.
+ url2link: Convert urlname of a wiki page into interwiki link format.
+
+Wikitext manipulation functions for category links:
+
+ getCategoryLinks(text): return list of Category objects corresponding
+ to links in text
+ removeCategoryLinks(text): remove all category links from text
+ replaceCategoryLinksAndSeparator(text, site, marker, separator = ''):
+ remove language links, whitespace, preceeding separators from text
+ replaceCategoryLinks(oldtext,new): replace the category links in oldtext by
+ those in a list of Category objects
+ replaceCategoryInPlace(text,oldcat,newtitle): replace a single link to
+ oldcat with a link to category given by newtitle
+ categoryFormat(links): return a string containing links to all
+ Categories in a list.
+
+Unicode utility functions:
+ UnicodeToAsciiHtml: Convert unicode to a bytestring using HTML entities.
+ url2unicode: Convert url-encoded text to unicode using a site's encoding.
+ unicode2html: Ensure unicode string is encodable; if not, convert it to
+ ASCII for HTML.
+ html2unicode: Replace HTML entities in text with unicode characters.
+
+stopme(): Put this on a bot when it is not or not communicating with the Wiki
+ any longer. It will remove the bot from the list of running processes,
+ and thus not slow down other bot threads anymore.
+
+"""
+from __future__ import generators
+#
+# (C) Pywikipedia bot team, 2003-2012
+#
+# Distributed under the terms of the MIT license.
+#
+__version__ = '$Id$'
+
+import os, sys
+import httplib, socket, urllib, urllib2, cookielib
+import traceback
+import time, threading, Queue
+import math
+import re, codecs, difflib, locale
+try:
+ from hashlib import md5
+except ImportError: # Python 2.4 compatibility
+ from md5 import new as md5
+import xml.sax, xml.sax.handler
+import htmlentitydefs
+import warnings
+import unicodedata
+import xmlreader
+from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup, SoupStrainer
+import weakref
+# Splitting the bot into library parts
+from pywikibot import *
+
+# Set the locale to system default. This will ensure correct string
+# handling for non-latin characters on Python 2.3.x. For Python 2.4.x it's no
+# longer needed.
+locale.setlocale(locale.LC_ALL, '')
+
+import config, login, query, version
+
+try:
+ set # introduced in Python2.4: faster and future
+except NameError:
+ from sets import Set as set
+
+# Check Unicode support (is this a wide or narrow python build?)
+# See http://www.python.org/doc/peps/pep-0261/
+try:
+ unichr(66365) # a character in th: alphabet, uses 32 bit encoding
+ WIDEBUILD = True
+except ValueError:
+ WIDEBUILD = False
+
+
+SaxError = xml.sax._exceptions.SAXParseException
+
+# Pre-compile re expressions
+reNamespace = re.compile("^(.+?) *: *(.*)$")
+Rwatch = re.compile(
+ r"<input type='hidden' value=\"(.*?)\" name=\"wpEditToken\"")
+Rwatchlist = re.compile(r"<input tabindex='[\d]+' type='checkbox' "
+ r"name='wpWatchthis' checked='checked'")
+Rlink = re.compile(r'\[\[(?P<title>[^\]\|\[]*)(\|[^\]]*)?\]\]')
+
+
+# Page objects (defined here) represent the page itself, including its contents.
+class Page(object):
+ """Page: A MediaWiki page
+
+ Constructor has two required parameters:
+ 1) The wiki Site on which the page resides [note that, if the
+ title is in the form of an interwiki link, the Page object may
+ have a different Site than this]
+ 2) The title of the page as a unicode string
+
+ Optional parameters:
+ insite - the wiki Site where this link was found (to help decode
+ interwiki links)
+ defaultNamespace - A namespace to use if the link does not contain one
+
+ Methods available:
+
+ title : The name of the page, including namespace and
+ section if any
+ urlname : Title, in a form suitable for a URL
+ namespace : The namespace in which the page is found
+ section : The section of the page (the part of the title
+ after '#', if any)
+ sectionFreeTitle : Title, without the section part
+ site : The wiki this page is in
+ encoding : The encoding of the page
+ isAutoTitle : Title can be translated using the autoFormat method
+ autoFormat : Auto-format certain dates and other standard
+ format page titles
+ isCategory : True if the page is a category
+ isDisambig (*) : True if the page is a disambiguation page
+ isImage : True if the page is an image
+ isRedirectPage (*) : True if the page is a redirect, false otherwise
+ getRedirectTarget (*) : The page the page redirects to
+ isTalkPage : True if the page is in any "talk" namespace
+ toggleTalkPage : Return the talk page (if this is one, return the
+ non-talk page)
+ get (*) : The text of the page
+ getSections (*) : Retrieve page section heading and assign them to
+ the byte offset
+ latestRevision (*) : The page's current revision id
+ userName : Last user to edit page
+ userNameHuman : Last human (non-bot) user to edit page
+ isIpEdit : True if last editor was unregistered
+ editTime : Timestamp of the last revision to the page
+ previousRevision (*) : The revision id of the previous version
+ permalink (*) : The url of the permalink of the current version
+ getOldVersion(id) (*) : The text of a previous version of the page
+ getRestrictions : Returns a protection dictionary
+ getVersionHistory : Load the version history information from wiki
+ getVersionHistoryTable: Create a wiki table from the history data
+ fullVersionHistory : Return all past versions including wikitext
+ contributingUsers : Return set of users who have edited page
+ getCreator : Function to get the first editor of a page
+ getLatestEditors : Function to get the last editors of a page
+ exists (*) : True if the page actually exists, false otherwise
+ isEmpty (*) : True if the page has 4 characters or less content,
+ not counting interwiki and category links
+ interwiki (*) : The interwiki links from the page (list of Pages)
+ categories (*) : The categories the page is in (list of Pages)
+ linkedPages (*) : The normal pages linked from the page (list of
+ Pages)
+ imagelinks (*) : The pictures on the page (list of ImagePages)
+ templates (*) : All templates referenced on the page (list of
+ Pages)
+ templatesWithParams(*): All templates on the page, with list of parameters
+ getReferences : List of pages linking to the page
+ canBeEdited (*) : True if page is unprotected or user has edit
+ privileges
+ protection(*) : This page protection level
+ botMayEdit (*) : True if bot is allowed to edit page
+ put(newtext) : Saves the page
+ put_async(newtext) : Queues the page to be saved asynchronously
+ append(newtext) : Append to page section
+ watch : Add the page to the watchlist
+ unwatch : Remove the page from the watchlist
+ move : Move the page to another title
+ delete : Deletes the page (requires being logged in)
+ protect : Protect or unprotect a page (requires sysop status)
+ removeImage : Remove all instances of an image from this page
+ replaceImage : Replace all instances of an image with another
+ loadDeletedRevisions : Load all deleted versions of this page
+ getDeletedRevision : Return a particular deleted revision
+ markDeletedRevision : Mark a version to be undeleted, or not
+ undelete : Undelete past version(s) of the page
+ purgeCache : Purge page from server cache
+
+ (*) : This loads the page if it has not been loaded before; permalink might
+ even reload it if it has been loaded before
+
+ """
+ def __init__(self, site, title, insite=None, defaultNamespace=0):
+ """Instantiate a Page object.
+
+ """
+ try:
+ # if _editrestriction is True, it means that the page has been found
+ # to have an edit restriction, but we do not know yet whether the
+ # restriction affects us or not
+ self._editrestriction = False
+
+ if site is None or isinstance(site, basestring):
+ site = getSite(site)
+ self._site = site
+
+ if not insite:
+ insite = site
+
+ # Clean up the name, it can come from anywhere.
+ # Convert HTML entities to unicode
+ t = html2unicode(title)
+
+ # Convert URL-encoded characters to unicode
+ # Sometimes users copy the link to a site from one to another.
+ # Try both the source site and the destination site to decode.
+ try:
+ t = url2unicode(t, site=insite, site2=site)
+ except UnicodeDecodeError:
+ raise InvalidTitle(u'Bad page title : %s' % t)
+
+ # Normalize unicode string to a NFC (composed) format to allow
+ # proper string comparisons. According to
+ # http://svn.wikimedia.org/viewvc/mediawiki/branches/REL1_6/phase3/includes/n…
+ # the mediawiki code normalizes everything to NFC, not NFKC
+ # (which might result in information loss).
+ t = unicodedata.normalize('NFC', t)
+
+ if u'\ufffd' in t:
+ raise InvalidTitle("Title contains illegal char (\\uFFFD)")
+
+ # Replace underscores by spaces
+ t = t.replace(u"_", u" ")
+ # replace multiple spaces a single space
+ while u" " in t: t = t.replace(u" ", u" ")
+ # Strip spaces at both ends
+ t = t.strip()
+ # Remove left-to-right and right-to-left markers.
+ t = t.replace(u'\u200e', '').replace(u'\u200f', '')
+ # leading colon implies main namespace instead of the default
+ if t.startswith(':'):
+ t = t[1:]
+ self._namespace = 0
+ else:
+ self._namespace = defaultNamespace
+
+ if not t:
+ raise InvalidTitle(u"Invalid title '%s'" % title )
+
+ self._namespace = defaultNamespace
+ #
+ # This code was adapted from Title.php : secureAndSplit()
+ #
+ # Namespace or interwiki prefix
+ while True:
+ m = reNamespace.match(t)
+ if not m:
+ break
+ p = m.group(1)
+ lowerNs = p.lower()
+ ns = self._site.getNamespaceIndex(lowerNs)
+ if ns:
+ t = m.group(2)
+ self._namespace = ns
+ break
+
+ if lowerNs in self._site.family.langs.keys():
+ # Interwiki link
+ t = m.group(2)
+
+ # Redundant interwiki prefix to the local wiki
+ if lowerNs == self._site.lang:
+ if t == '':
+ raise Error("Can't have an empty self-link")
+ else:
+ self._site = getSite(lowerNs, self._site.family.name)
+ if t == '':
+ t = self._site.mediawiki_message('Mainpage')
+
+ # If there's an initial colon after the interwiki, that also
+ # resets the default namespace
+ if t != '' and t[0] == ':':
+ self._namespace = 0
+ t = t[1:]
+ elif lowerNs in self._site.family.get_known_families(site = self._site):
+ if self._site.family.get_known_families(site = self._site)[lowerNs] == self._site.family.name:
+ t = m.group(2)
+ else:
+ # This page is from a different family
+ if verbose:
+ output(u"Target link '%s' has different family '%s'" % (title, lowerNs))
+ if self._site.family.name in ['commons', 'meta']:
+ #When the source wiki is commons or meta,
+ #w:page redirects you to w:en:page
+ otherlang = 'en'
+ else:
+ otherlang = self._site.lang
+ familyName = self._site.family.get_known_families(site = self._site)[lowerNs]
+ if familyName in ['commons', 'meta']:
+ otherlang = familyName
+ try:
+ self._site = getSite(otherlang, familyName)
+ except ValueError:
+ raise NoPage("""\
+%s is not a local page on %s, and the %s family is
+not supported by PyWikipediaBot!"""
+ % (title, self._site, familyName))
+ t = m.group(2)
+ else:
+ # If there's no recognized interwiki or namespace,
+ # then let the colon expression be part of the title.
+ break
+
+ sectionStart = t.find(u'#')
+ # But maybe there are magic words like {{#time|}}
+ # TODO: recognize magic word and templates inside links
+ # see http://la.wikipedia.org/w/index.php?title=997_Priska&diff=prev&oldid=1038880
+ if sectionStart > 0:
+ # Categories does not have sections.
+ if self._namespace == 14:
+ raise InvalidTitle(u"Invalid section in category '%s'" % t)
+ else:
+ t, sec = t.split(u'#', 1)
+ self._section = sec.lstrip() or None
+ t = t.rstrip()
+ elif sectionStart == 0:
+ raise InvalidTitle(u"Invalid title starting with a #: '%s'" % t)
+ else:
+ self._section = None
+
+ if t:
+ if not self._site.nocapitalize:
+ t = t[:1].upper() + t[1:]
+
+ # reassemble the title from its parts
+ if self._namespace != 0:
+ t = u'%s:%s' % (self._site.namespace(self._namespace), t)
+ if self._section:
+ t += u'#' + self._section
+
+ self._title = t
+ self.editRestriction = None
+ self.moveRestriction = None
+ self._permalink = None
+ self._userName = None
+ self._ipedit = None
+ self._editTime = None
+ self._startTime = '0'
+ # For the Flagged Revisions MediaWiki extension
+ self._revisionId = None
+ self._deletedRevs = None
+ except NoSuchSite:
+ raise
+ except:
+ if verbose:
+ output(u"Exception in Page constructor")
+ output(
+ u"site=%s, title=%s, insite=%s, defaultNamespace=%i"
+ % (site, title, insite, defaultNamespace)
+ )
+ raise
+
+ @property
+ def site(self):
+ """Return the Site object for the wiki on which this Page resides."""
+ return self._site
+
+ def namespace(self):
+ """Return the number of the namespace of the page.
+
+ Only recognizes those namespaces defined in family.py.
+ If not defined, it will return 0 (the main namespace).
+
+ """
+ return self._namespace
+
+ def encoding(self):
+ """Return the character encoding used on this Page's wiki Site."""
+ return self._site.encoding()
+
+ @deprecate_arg("decode", None)
+ def title(self, underscore=False, savetitle=False, withNamespace=True,
+ withSection=True, asUrl=False, asLink=False,
+ allowInterwiki=True, forceInterwiki=False, textlink=False,
+ as_filename=False):
+ """Return the title of this Page, as a Unicode string.
+
+ @param underscore: if true, replace all ' ' characters with '_'
+ @param withNamespace: if false, omit the namespace prefix
+ @param withSection: if false, omit the section
+ @param asUrl: - not implemented yet -
+ @param asLink: if true, return the title in the form of a wikilink
+ @param allowInterwiki: (only used if asLink is true) if true, format
+ the link as an interwiki link if necessary
+ @param forceInterwiki: (only used if asLink is true) if true, always
+ format the link as an interwiki link
+ @param textlink: (only used if asLink is true) if true, place a ':'
+ before Category: and Image: links
+ @param as_filename: - not implemented yet -
+ @param savetitle: if True, encode any wiki syntax in the title.
+
+ """
+ title = self._title
+ if not withNamespace and self.namespace() != 0:
+ title = title.split(':', 1)[1]
+ if asLink:
+ iw_target_site = getSite()
+ iw_target_family = getSite().family
+ if iw_target_family.interwiki_forward:
+ iw_target_family = pywikibot.Family(iw_target_family.interwiki_forward)
+
+ if allowInterwiki and (forceInterwiki or self._site != iw_target_site):
+ colon = ""
+ if textlink:
+ colon = ":"
+ if self._site.family != iw_target_family \
+ and self._site.family.name != self._site.lang:
+ title = u'[[%s%s:%s:%s]]' % (colon, self._site.family.name,
+ self._site.lang, title)
+ else:
+ title = u'[[%s%s:%s]]' % (colon, self._site.lang, title)
+ elif textlink and (self.isImage() or self.isCategory()):
+ title = u'[[:%s]]' % title
+ else:
+ title = u'[[%s]]' % title
+ if savetitle or asLink:
+ # Ensure there's no wiki syntax in the title
+ title = title.replace(u"''", u'%27%27')
+ if underscore:
+ title = title.replace(' ', '_')
+ if not withSection:
+ sectionName = self.section(underscore=underscore)
+ if sectionName:
+ title = title[:-len(sectionName)-1]
+ return title
+
+ #@deprecated("Page.title(withNamespace=False)")
+ def titleWithoutNamespace(self, underscore=False):
+ """Return title of Page without namespace and without section."""
+ return self.title(underscore=underscore, withNamespace=False,
+ withSection=False)
+
+ def titleForFilename(self):
+ """
+ Return the title of the page in a form suitable for a filename on
+ the user's file system.
+ """
+ result = self.title()
+ # Replace characters that are not possible in file names on some
+ # systems.
+ # Spaces are possible on most systems, but are bad for URLs.
+ for forbiddenChar in ':*?/\\ ':
+ result = result.replace(forbiddenChar, '_')
+ return result
+
+ @deprecate_arg("decode", None)
+ def section(self, underscore = False):
+ """Return the name of the section this Page refers to.
+
+ The section is the part of the title following a '#' character, if
+ any. If no section is present, return None.
+
+ """
+ section = self._section
+ if section and underscore:
+ section = section.replace(' ', '_')
+ return section
+
+ def sectionFreeTitle(self, underscore=False):
+ """Return the title of this Page, without the section (if any)."""
+ sectionName = self.section(underscore=underscore)
+ title = self.title(underscore=underscore)
+ if sectionName:
+ return title[:-len(sectionName)-1]
+ else:
+ return title
+
+ def urlname(self, withNamespace=True):
+ """Return the Page title encoded for use in an URL."""
+ title = self.title(withNamespace=withNamespace, underscore=True)
+ encodedTitle = title.encode(self.site().encoding())
+ return urllib.quote(encodedTitle)
+
+ def __str__(self):
+ """Return a console representation of the pagelink."""
+ return self.title(asLink=True, forceInterwiki=True
+ ).encode(config.console_encoding,
+ "xmlcharrefreplace")
+
+ def __unicode__(self):
+ return self.title(asLink=True, forceInterwiki=True)
+
+ def __repr__(self):
+ """Return a more complete string representation."""
+ return "%s{%s}" % (self.__class__.__name__,
+ self.title(asLink=True).encode(config.console_encoding))
+
+ def __cmp__(self, other):
+ """Test for equality and inequality of Page objects.
+
+ Page objects are "equal" if and only if they are on the same site
+ and have the same normalized title, including section if any.
+
+ Page objects are sortable by namespace first, then by title.
+
+ """
+ if not isinstance(other, Page):
+ # especially, return -1 if other is None
+ return -1
+ if self._site == other._site:
+ return cmp(self._title, other._title)
+ else:
+ return cmp(self._site, other._site)
+
+ def __hash__(self):
+ # Pseudo method that makes it possible to store Page objects as keys
+ # in hash-tables. This relies on the fact that the string
+ # representation of an instance can not change after the construction.
+ return hash(unicode(self))
+
+ @deprecated("Page.title(asLink=True)")
+ def aslink(self, forceInterwiki=False, textlink=False, noInterwiki=False):
+ """Return a string representation in the form of a wikilink.
+
+ If forceInterwiki is True, return an interwiki link even if it
+ points to the home wiki. If False, return an interwiki link only if
+ needed.
+
+ If textlink is True, always return a link in text form (that is,
+ interwiki links and internal links to the Category: and Image:
+ namespaces will be preceded by a : character).
+
+ DEPRECATED to merge to rewrite branch:
+ use self.title(asLink=True) instead.
+ """
+ return self.title(asLink=True, forceInterwiki=forceInterwiki,
+ allowInterwiki=not noInterwiki, textlink=textlink)
+
+ def autoFormat(self):
+ """Return (dictName, value) if title is in date.autoFormat dictionary.
+
+ Value can be a year, date, etc., and dictName is 'YearBC',
+ 'Year_December', or another dictionary name. Please note that two
+ entries may have exactly the same autoFormat, but be in two
+ different namespaces, as some sites have categories with the
+ same names. Regular titles return (None, None).
+
+ """
+ if not hasattr(self, '_autoFormat'):
+ import date
+ self._autoFormat = date.getAutoFormat(self.site().language(),
+ self.title(withNamespace=False))
+ return self._autoFormat
+
+ def isAutoTitle(self):
+ """Return True if title of this Page is in the autoFormat dictionary."""
+ return self.autoFormat()[0] is not None
+
+ def get(self, force=False, get_redirect=False, throttle=True,
+ sysop=False, change_edit_time=True, expandtemplates=False):
+ """Return the wiki-text of the page.
+
+ This will retrieve the page from the server if it has not been
+ retrieved yet, or if force is True. This can raise the following
+ exceptions that should be caught by the calling code:
+
+ @exception NoPage The page does not exist
+ @exception IsRedirectPage The page is a redirect. The argument of the
+ exception is the title of the page it
+ redirects to.
+ @exception SectionError The section does not exist on a page with
+ a # link
+
+ @param force reload all page attributes, including errors.
+ @param get_redirect return the redirect text, do not follow the
+ redirect, do not raise an exception.
+ @param sysop if the user has a sysop account, use it to
+ retrieve this page
+ @param change_edit_time if False, do not check this version for
+ changes before saving. This should be used only
+ if the page has been loaded previously.
+ @param expandtemplates all templates in the page content are fully
+ resolved too (if API is used).
+
+ """
+ # NOTE: The following few NoPage exceptions could already be thrown at
+ # the Page() constructor. They are raised here instead for convenience,
+ # because all scripts are prepared for NoPage exceptions raised by
+ # get(), but not for such raised by the constructor.
+ # \ufffd represents a badly encoded character, the other characters are
+ # disallowed by MediaWiki.
+ for illegalChar in u'#<>[]|{}\n\ufffd':
+ if illegalChar in self.sectionFreeTitle():
+ if verbose:
+ output(u'Illegal character in %s!'
+ % self.title(asLink=True))
+ raise NoPage('Illegal character in %s!'
+ % self.title(asLink=True))
+ if self.namespace() == -1:
+ raise NoPage('%s is in the Special namespace!'
+ % self.title(asLink=True))
+ if self.site().isInterwikiLink(self.title()):
+ raise NoPage('%s is not a local page on %s!'
+ % (self.title(asLink=True), self.site()))
+ if force:
+ # When forcing, we retry the page no matter what:
+ # * Old exceptions and contents do not apply any more
+ # * Deleting _contents and _expandcontents to force reload
+ for attr in ['_redirarg', '_getexception',
+ '_contents', '_expandcontents',
+ '_sections']:
+ if hasattr(self, attr):
+ delattr(self, attr)
+ else:
+ # Make sure we re-raise an exception we got on an earlier attempt
+ if hasattr(self, '_redirarg') and not get_redirect:
+ raise IsRedirectPage, self._redirarg
+ elif hasattr(self, '_getexception'):
+ if self._getexception == IsRedirectPage and get_redirect:
+ pass
+ else:
+ raise self._getexception
+ # Make sure we did try to get the contents once
+ if expandtemplates:
+ attr = '_expandcontents'
+ else:
+ attr = '_contents'
+ if not hasattr(self, attr):
+ try:
+ contents = self._getEditPage(get_redirect=get_redirect, throttle=throttle, sysop=sysop,
+ expandtemplates = expandtemplates)
+ if expandtemplates:
+ self._expandcontents = contents
+ else:
+ self._contents = contents
+ hn = self.section()
+ if hn:
+ m = re.search("=+[ ']*%s[ ']*=+" % re.escape(hn),
+ self._contents)
+ if verbose and not m:
+ output(u"WARNING: Section does not exist: %s" % self)
+ # Store any exceptions for later reference
+ except NoPage:
+ self._getexception = NoPage
+ raise
+ except IsRedirectPage, arg:
+ self._getexception = IsRedirectPage
+ self._redirarg = arg
+ if not get_redirect:
+ raise
+ except SectionError:
+ self._getexception = SectionError
+ raise
+ except UserBlocked:
+ if self.site().loggedInAs(sysop=sysop):
+ raise UserBlocked(self.site(), unicode(self))
+ else:
+ if verbose:
+ output("The IP address is blocked, retry by login.")
+ self.site().forceLogin(sysop=sysop)
+ return self.get(force, get_redirect, throttle, sysop, change_edit_time)
+ if expandtemplates:
+ return self._expandcontents
+ return self._contents
+
+ def _getEditPage(self, get_redirect=False, throttle=True, sysop=False,
+ oldid=None, change_edit_time=True, expandtemplates=False):
+ """Get the contents of the Page via API query
+
+ Do not use this directly, use get() instead.
+
+ Arguments:
+ oldid - Retrieve an old revision (by id), not the current one
+ get_redirect - Get the contents, even if it is a redirect page
+ expandtemplates - Fully resolve templates within page content
+ (if API is used)
+
+ This method returns the raw wiki text as a unicode string.
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 12:
+ return self._getEditPageOld(get_redirect, throttle, sysop, oldid, change_edit_time)
+ params = {
+ 'action': 'query',
+ 'titles': self.title(),
+ 'prop': ['revisions', 'info'],
+ 'rvprop': ['content', 'ids', 'flags', 'timestamp', 'user', 'comment', 'size'],
+ 'rvlimit': 1,
+ #'talkid' valid for release > 1.12
+ #'url', 'readable' valid for release > 1.14
+ 'inprop': ['protection', 'subjectid'],
+ #'intoken': 'edit',
+ }
+ if oldid:
+ params['rvstartid'] = oldid
+ if expandtemplates:
+ params[u'rvexpandtemplates'] = u''
+
+ if throttle:
+ get_throttle()
+ textareaFound = False
+ # retrying loop is done by query.GetData
+ data = query.GetData(params, self.site(), sysop=sysop)
+ if 'error' in data:
+ raise RuntimeError("API query error: %s" % data)
+ if not 'pages' in data['query']:
+ raise RuntimeError("API query error, no pages found: %s" % data)
+ pageInfo = data['query']['pages'].values()[0]
+ if data['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo:
+ raise NoPage(self.site(), unicode(self),
+"Page does not exist. In rare cases, if you are certain the page does exist, look into overriding family.RversionTab")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+ elif 'revisions' in pageInfo: #valid Title
+ lastRev = pageInfo['revisions'][0]
+ if isinstance(lastRev['*'], basestring):
+ textareaFound = True
+ # I got page date with 'revisions' in pageInfo but
+ # lastRev['*'] = False instead of the content. The Page itself was
+ # deleted but there was not 'missing' in pageInfo as expected
+ # I raise a ServerError() yet, but maybe it should be NoPage().
+ if not textareaFound:
+ if verbose:
+ print pageInfo
+ raise ServerError('ServerError: No textarea found in %s' % self)
+
+ self.editRestriction = ''
+ self.moveRestriction = ''
+
+ # Note: user may be hidden and mw returns 'userhidden' flag
+ if 'userhidden' in lastRev:
+ self._userName = None
+ else:
+ self._userName = lastRev['user']
+ self._ipedit = 'anon' in lastRev
+ for restr in pageInfo['protection']:
+ if restr['type'] == 'edit':
+ self.editRestriction = restr['level']
+ elif restr['type'] == 'move':
+ self.moveRestriction = restr['level']
+
+ self._revisionId = lastRev['revid']
+
+ if change_edit_time:
+ self._editTime = parsetime2stamp(lastRev['timestamp'])
+ if "starttimestamp" in pageInfo:
+ self._startTime = parsetime2stamp(pageInfo["starttimestamp"])
+
+ self._isWatched = False #cannot handle in API in my research for now.
+
+ pagetext = lastRev['*']
+ pagetext = pagetext.rstrip()
+ # pagetext must not decodeEsperantoX() if loaded via API
+ m = self.site().redirectRegex().match(pagetext)
+ if m:
+ # page text matches the redirect pattern
+ if self.section() and not "#" in m.group(1):
+ redirtarget = "%s#%s" % (m.group(1), self.section())
+ else:
+ redirtarget = m.group(1)
+ if get_redirect:
+ self._redirarg = redirtarget
+ else:
+ raise IsRedirectPage(redirtarget)
+
+ if self.section() and \
+ not textlib.does_text_contain_section(pagetext, self.section()):
+ try:
+ self._getexception
+ except AttributeError:
+ raise SectionError # Page has no section by this name
+ return pagetext
+
+ def _getEditPageOld(self, get_redirect=False, throttle=True, sysop=False,
+ oldid=None, change_edit_time=True):
+ """Get the contents of the Page via the edit page."""
+
+ if verbose:
+ output(u'Getting page %s' % self.title(asLink=True))
+ path = self.site().edit_address(self.urlname())
+ if oldid:
+ path += "&oldid="+oldid
+ # Make sure Brion doesn't get angry by waiting if the last time a page
+ # was retrieved was not long enough ago.
+ if throttle:
+ get_throttle()
+ textareaFound = False
+ retry_idle_time = 1
+ while not textareaFound:
+ text = self.site().getUrl(path, sysop = sysop)
+
+ if "<title>Wiki does not exist</title>" in text:
+ raise NoSuchSite(u'Wiki %s does not exist yet' % self.site())
+
+ # Extract the actual text from the textarea
+ m1 = re.search('<textarea([^>]*)>', text)
+ m2 = re.search('</textarea>', text)
+ if m1 and m2:
+ i1 = m1.end()
+ i2 = m2.start()
+ textareaFound = True
+ else:
+ # search for messages with no "view source" (aren't used in new versions)
+ if self.site().mediawiki_message('whitelistedittitle') in text:
+ raise NoPage(u'Page editing is forbidden for anonymous users.')
+ elif self.site().has_mediawiki_message('nocreatetitle') and self.site().mediawiki_message('nocreatetitle') in text:
+ raise NoPage(self.site(), unicode(self))
+ # Bad title
+ elif 'var wgPageName = "Special:Badtitle";' in text \
+ or self.site().mediawiki_message('badtitle') in text:
+ raise BadTitle('BadTitle: %s' % self)
+ # find out if the username or IP has been blocked
+ elif self.site().isBlocked():
+ raise UserBlocked(self.site(), unicode(self))
+ # If there is no text area and the heading is 'View Source'
+ # but user is not blocked, the page does not exist, and is
+ # locked
+ elif self.site().mediawiki_message('viewsource') in text:
+ raise NoPage(self.site(), unicode(self))
+ # Some of the newest versions don't have a "view source" tag for
+ # non-existant pages
+ # Check also the div class because if the language is not english
+ # the bot can not seeing that the page is blocked.
+ elif self.site().mediawiki_message('badaccess') in text or \
+ "<div class=\"permissions-errors\">" in text:
+ raise NoPage(self.site(), unicode(self))
+ elif config.retry_on_fail:
+ if "<title>Wikimedia Error</title>" in text:
+ output( u"Wikimedia has technical problems; will retry in %i minutes." % retry_idle_time)
+ else:
+ output( unicode(text) )
+ # We assume that the server is down. Wait some time, then try again.
+ output( u"WARNING: No text area found on %s%s. Maybe the server is down. Retrying in %i minutes..." % (self.site().hostname(), path, retry_idle_time) )
+ time.sleep(retry_idle_time * 60)
+ # Next time wait longer, but not longer than half an hour
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ else:
+ output( u"Failed to access wiki")
+ sys.exit(1)
+ # Check for restrictions
+ m = re.search('var wgRestrictionEdit = \\["(\w+)"\\]', text)
+ if m:
+ if verbose:
+ output(u"DBG> page is locked for group %s" % m.group(1))
+ self.editRestriction = m.group(1);
+ else:
+ self.editRestriction = ''
+ m = re.search('var wgRestrictionMove = \\["(\w+)"\\]', text)
+ if m:
+ self.moveRestriction = m.group(1);
+ else:
+ self.moveRestriction = ''
+ m = re.search('name=["\']baseRevId["\'] type=["\']hidden["\'] value="(\d+)"', text)
+ if m:
+ self._revisionId = m.group(1)
+ if change_edit_time:
+ # Get timestamps
+ m = re.search('value="(\d+)" name=["\']wpEdittime["\']', text)
+ if m:
+ self._editTime = m.group(1)
+ else:
+ self._editTime = "0"
+ m = re.search('value="(\d+)" name=["\']wpStarttime["\']', text)
+ if m:
+ self._startTime = m.group(1)
+ else:
+ self._startTime = "0"
+ # Find out if page actually exists. Only existing pages have a
+ # version history tab.
+ if self.site().family.RversionTab(self.site().language()):
+ # In case a family does not have version history tabs, or in
+ # another form
+ RversionTab = re.compile(self.site().family.RversionTab(self.site().language()))
+ else:
+ RversionTab = re.compile(r'<li id="ca-history"><a href=".*?title=.*?&action=history".*?>.*?</a></li>', re.DOTALL)
+ matchVersionTab = RversionTab.search(text)
+ if not matchVersionTab and not self.site().family.name == 'wikitravel':
+ raise NoPage(self.site(), unicode(self),
+"Page does not exist. In rare cases, if you are certain the page does exist, look into overriding family.RversionTab" )
+ # Look if the page is on our watchlist
+ matchWatching = Rwatchlist.search(text)
+ if matchWatching:
+ self._isWatched = True
+ else:
+ self._isWatched = False
+ # Now process the contents of the textarea
+ # Unescape HTML characters, strip whitespace
+ pagetext = text[i1:i2]
+ pagetext = unescape(pagetext)
+ pagetext = pagetext.rstrip()
+ if self.site().lang == 'eo':
+ pagetext = decodeEsperantoX(pagetext)
+ m = self.site().redirectRegex().match(pagetext)
+ if m:
+ # page text matches the redirect pattern
+ if self.section() and not "#" in m.group(1):
+ redirtarget = "%s#%s" % (m.group(1), self.section())
+ else:
+ redirtarget = m.group(1)
+ if get_redirect:
+ self._redirarg = redirtarget
+ else:
+ raise IsRedirectPage(redirtarget)
+
+ if self.section() and \
+ not textlib.does_text_contain_section(text, self.section()):
+ try:
+ self._getexception
+ except AttributeError:
+ raise SectionError # Page has no section by this name
+
+ return pagetext
+
+ def getOldVersion(self, oldid, force=False, get_redirect=False,
+ throttle=True, sysop=False, change_edit_time=True):
+ """Return text of an old revision of this page; same options as get().
+
+ @param oldid: The revid of the revision desired.
+
+ """
+ # TODO: should probably check for bad pagename, NoPage, and other
+ # exceptions that would prevent retrieving text, as get() does
+
+ # TODO: should this default to change_edit_time = False? If we're not
+ # getting the current version, why change the timestamps?
+ return self._getEditPage(
+ get_redirect=get_redirect, throttle=throttle,
+ sysop=sysop, oldid=oldid,
+ change_edit_time=change_edit_time
+ )
+
+ ## @since r10309
+ # @remarks needed by various bots
+ def getSections(self, minLevel=2, sectionsonly=False, force=False):
+ """Parses the page with API and return section information.
+
+ @param minLevel: The minimal level of heading for section to be reported.
+ @type minLevel: int
+ @param sectionsonly: Report only the result from API call, do not assign
+ the headings to wiki text (for compression e.g.).
+ @type sectionsonly: bool
+ @param force: Use API for full section list resolution, works always but
+ is extremely slow, since each single section has to be retrieved.
+ @type force: bool
+
+ Returns a list with entries: (byteoffset, level, wikiline, line, anchor)
+ This list may be empty and if sections are embedded by template, the according
+ byteoffset and wikiline entries are None. The wikiline is the wiki text,
+ line is the parsed text and anchor ist the (unique) link label.
+ """
+ # replace 'byteoffset' ALWAYS by self calculated, since parsed does not match wiki text
+ # bug fix; JIRA: DRTRIGON-82
+
+ # was there already a call? already some info available?
+ if hasattr(self, '_sections'):
+ return self._sections
+
+ # Old exceptions and contents do not apply any more.
+ for attr in ['_sections']:
+ if hasattr(self, attr):
+ delattr(self,attr)
+
+ # call the wiki to get info
+ params = {
+ u'action' : u'parse',
+ u'page' : self.title(),
+ u'prop' : u'sections',
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Reading section info from %s via API..." % self.title(asLink=True))
+
+ result = query.GetData(params, self.site())
+ # JIRA: DRTRIGON-90; catch and convert error (convert it such that the whole page gets processed later)
+ try:
+ r = result[u'parse'][u'sections']
+ except KeyError: # sequence of sometimes occuring "KeyError: u'parse'"
+ pywikibot.output(u'WARNING: Query result (gS): %r' % result)
+ raise pywikibot.Error('Problem occured during data retrieval for sections in %s!' % self.title(asLink=True))
+ #debug_data = str(r) + '\n'
+ debug_data = str(result) + '\n'
+
+ if not sectionsonly:
+ # assign sections with wiki text and section byteoffset
+ #pywikibot.output(u" Reading wiki page text (if not already done).")
+
+ debug_data += str(len(self.__dict__.get('_contents',u''))) + '\n'
+ self.get()
+ debug_data += str(len(self._contents)) + '\n'
+ debug_data += self._contents + '\n'
+
+ # code debugging
+ if verbose:
+ debugDump( 'Page.getSections', self.site, 'Page.getSections', debug_data.encode(config.textfile_encoding) )
+
+ for setting in [(0.05,0.95), (0.4,0.8), (0.05,0.8), (0.0,0.8)]: # 0.6 is default upper border
+ try:
+ pos = 0
+ for i, item in enumerate(r):
+ item[u'level'] = int(item[u'level'])
+ # byteoffset may be 0; 'None' means template
+ #if (item[u'byteoffset'] != None) and item[u'line']:
+ # (empty index means also template - workaround for bug:
+ # https://bugzilla.wikimedia.org/show_bug.cgi?id=32753)
+ if (item[u'byteoffset'] != None) and item[u'line'] and item[u'index']:
+ # section on this page and index in format u"%i"
+ self._getSectionByteOffset(item, pos, force, cutoff=setting) # raises 'Error' if not sucessfull !
+ pos = item[u'wikiline_bo'] + len(item[u'wikiline'])
+ item[u'byteoffset'] = item[u'wikiline_bo']
+ else:
+ # section embedded from template (index in format u"T-%i") or the
+ # parser was not able to recongnize section correct (e.g. html) at all
+ # (the byteoffset, index, ... may be correct or not)
+ item[u'wikiline'] = None
+ r[i] = item
+ break
+ except pywikibot.Error:
+ pos = None
+ if (pos == None):
+ raise # re-raise
+
+ # check min. level
+ data = []
+ for item in r:
+ if (item[u'level'] < minLevel): continue
+ data.append( item )
+ r = data
+
+ # prepare resulting data
+ self._sections = [ (item[u'byteoffset'], item[u'level'], item[u'wikiline'], item[u'line'], item[u'anchor']) for item in r ]
+
+ return self._sections
+
+ ## @since r10309
+ # @remarks needed by Page.getSections()
+ def _getSectionByteOffset(self, section, pos, force=False, cutoff=(0.05, 0.95)):
+ """determine the byteoffset of the given section (can be slow due another API call).
+ """
+ wikitextlines = self._contents[pos:].splitlines()
+ possible_headers = []
+ #print section[u'line']
+
+ if not force:
+ # how the heading should look like (re)
+ l = section[u'level']
+ headers = [ u'^(\s*)%(spacer)s(.*?)%(spacer)s(\s*)((<!--(.*?)-->)?)(\s*)$' % {'line': section[u'line'], 'spacer': u'=' * l},
+ u'^(\s*)<h%(level)i>(.*?)</h%(level)i>(.*?)$' % {'line': section[u'line'], 'level': l}, ]
+
+ # try to give exact match for heading (remove HTML comments)
+ for h in headers:
+ #ph = re.search(h, pywikibot.removeDisabledParts(self._contents[pos:]), re.M)
+ ph = re.search(h, self._contents[pos:], re.M)
+ if ph:
+ ph = ph.group(0).strip()
+ possible_headers += [ (ph, section[u'line']) ]
+
+ # how the heading could look like (difflib)
+ headers = [ u'%(spacer)s %(line)s %(spacer)s' % {'line': section[u'line'], 'spacer': u'=' * l},
+ u'<h%(level)i>%(line)s</h%(level)i>' % {'line': section[u'line'], 'level': l}, ]
+
+ # give possible match for heading
+ # http://stackoverflow.com/questions/2923420/fuzzy-string-matching-algorithm-…
+ # http://docs.python.org/library/difflib.html
+ # (http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/)
+ for h in headers:
+ ph = difflib.get_close_matches(h, wikitextlines, cutoff=cutoff[1]) # cutoff=0.6 (default)
+ possible_headers += [ (p, section[u'line']) for p in ph ]
+ #print h, possible_headers
+
+ if not possible_headers and section[u'index']: # nothing found, try 'prop=revisions (rv)'
+ # call the wiki to get info
+ params = {
+ u'action' : u'query',
+ u'titles' : self.title(),
+ u'prop' : u'revisions',
+ u'rvprop' : u'content',
+ u'rvsection' : section[u'index'],
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u" Reading section %s from %s via API..." % (section[u'index'], self.title(asLink=True)))
+
+ result = query.GetData(params, self.site())
+ # JIRA: DRTRIGON-90; catch and convert error (convert it such that the whole page gets processed later)
+ try:
+ r = result[u'query'][u'pages'].values()[0]
+ pl = r[u'revisions'][0][u'*'].splitlines()
+ except KeyError: # sequence of sometimes occuring "KeyError: u'parse'"
+ pywikibot.output(u'WARNING: Query result (gSBO): %r' % result)
+ raise pywikibot.Error('Problem occured during data retrieval for sections in %s!' % self.title(asLink=True))
+
+ if pl:
+ possible_headers = [ (pl[0], pl[0]) ]
+
+ # find the most probable match for heading
+ #print possible_headers
+ best_match = (0.0, None)
+ for i, (ph, header) in enumerate(possible_headers):
+ #print u' ', i, difflib.SequenceMatcher(None, header, ph).ratio(), header, ph
+ mr = difflib.SequenceMatcher(None, header, ph).ratio()
+ if mr >= best_match[0]: best_match = (mr, ph)
+ if (i in [0, 1]) and (mr >= cutoff[0]): break # use first (exact; re) match directly (if good enough)
+ #print u' ', best_match
+
+ # prepare resulting data
+ section[u'wikiline'] = best_match[1]
+ section[u'wikiline_mq'] = best_match[0] # match quality
+ section[u'wikiline_bo'] = -1 # byteoffset
+ if section[u'wikiline']:
+ section[u'wikiline_bo'] = self._contents.find(section[u'wikiline'], pos)
+ if section[u'wikiline_bo'] < 0: # nothing found, report/raise error !
+ #page._getexception = ...
+ raise pywikibot.Error('Problem occured during attempt to retrieve and resolve sections in %s!' % self.title(asLink=True))
+ #pywikibot.output(...)
+ # (or create a own error, e.g. look into interwiki.py)
+
+ def permalink(self):
+ """Return the permalink URL for current revision of this page."""
+ return "%s://%s%s&oldid=%i" % (self.site().protocol(),
+ self.site().hostname(),
+ self.site().get_address(self.title()),
+ self.latestRevision())
+
+ def latestRevision(self):
+ """Return the current revision id for this page."""
+ if not self._permalink:
+ # When we get the page with getall, the permalink is received
+ # automatically
+ getall(self.site(),[self],force=True)
+ # Check for exceptions
+ if hasattr(self, '_getexception'):
+ raise self._getexception
+ return int(self._permalink)
+
+ def userName(self):
+ """Return name or IP address of last user to edit page.
+
+ Returns None unless page was retrieved with getAll().
+
+ """
+ return self._userName
+
+ ## @since r10310
+ # @remarks needed by various bots
+ def userNameHuman(self):
+ """Return name or IP address of last human/non-bot user to edit page.
+
+ Returns the most recent human editor out of the last revisions
+ (optimal used with getAll()). If it was not able to retrieve a
+ human user returns None.
+ """
+
+ # was there already a call? already some info available?
+ if hasattr(self, '_userNameHuman'):
+ return self._userNameHuman
+
+ # get history (use preloaded if available)
+ (revid, timestmp, username, comment) = self.getVersionHistory(revCount=1)[0][:4]
+
+ # is the last/actual editor already a human?
+ import botlist # like watchlist
+ if not botlist.isBot(username):
+ self._userNameHuman = username
+ return username
+
+ # search the last human
+ self._userNameHuman = None
+ for vh in self.getVersionHistory()[1:]:
+ (revid, timestmp, username, comment) = vh[:4]
+
+ if username and (not botlist.isBot(username)):
+ # user is a human (not a bot)
+ self._userNameHuman = username
+ break
+
+ # store and return info
+ return self._userNameHuman
+
+ def isIpEdit(self):
+ """Return True if last editor was unregistered.
+
+ Returns None unless page was retrieved with getAll() or _getEditPage().
+
+ """
+ return self._ipedit
+
+ def editTime(self, datetime=False):
+ """Return timestamp (in MediaWiki format) of last revision to page.
+
+ Returns None unless page was retrieved with getAll() or _getEditPage().
+
+ """
+ if self._editTime and datetime:
+ import datetime
+ return datetime.datetime.strptime(str(self._editTime), '%Y%m%d%H%M%S')
+
+ return self._editTime
+
+ def previousRevision(self):
+ """Return the revision id for the previous revision of this Page."""
+ vh = self.getVersionHistory(revCount=2)
+ return vh[1][0]
+
+ def exists(self):
+ """Return True if page exists on the wiki, even if it's a redirect.
+
+ If the title includes a section, return False if this section isn't
+ found.
+
+ """
+ try:
+ self.get()
+ except NoPage:
+ return False
+ except IsRedirectPage:
+ return True
+ except SectionError:
+ return False
+ return True
+
+ def pageAPInfo(self):
+ """Return the last revid if page exists on the wiki,
+ Raise IsRedirectPage if it's a redirect
+ Raise NoPage if the page doesn't exist
+
+ Using the API should be a lot faster.
+ Function done in order to improve the scripts performance.
+
+ """
+ params = {
+ 'action' :'query',
+ 'prop' :'info',
+ 'titles' :self.title(),
+ }
+ data = query.GetData(params, self.site(), encodeTitle = False)['query']['pages'].values()[0]
+ if 'redirect' in data:
+ raise IsRedirectPage
+ elif 'missing' in data:
+ raise NoPage
+ elif 'lastrevid' in data:
+ return data['lastrevid'] # if ok, return the last revid
+ else:
+ # should not exists, OR we have problems.
+ # better double check in this situations
+ x = self.get()
+ return True # if we reach this point, we had no problems.
+
+ def getTemplates(self, tllimit = 5000):
+ #action=query&prop=templates&titles=Main Page
+ """
+ Returns the templates that are used in the page given by API.
+
+ If no templates found, returns None.
+
+ """
+ params = {
+ 'action': 'query',
+ 'prop': 'templates',
+ 'titles': self.title(),
+ 'tllimit': tllimit,
+ }
+ if tllimit > config.special_page_limit:
+ params['tllimit'] = config.special_page_limit
+ if tllimit > 5000 and self.site.isAllowed('apihighlimits'):
+ params['tllimit'] = 5000
+
+ tmpsFound = []
+ count = 0
+ while True:
+ data = query.GetData(params, self.site(), encodeTitle = False)['query']['pages'].values()[0]
+ if "templates" not in data:
+ return []
+
+ for tmp in data['templates']:
+ count += 1
+ tmpsFound.append(Page(self.site(), tmp['title'], defaultNamespace=tmp['ns']) )
+ if count >= tllimit:
+ break
+
+ if 'query-continue' in data and count < tllimit:
+ params["tlcontinue"] = data["query-continue"]["templates"]["tlcontinue"]
+ else:
+ break
+
+ return tmpsFound
+
+ def isRedirectPage(self):
+ """Return True if this is a redirect, False if not or not existing."""
+ try:
+ self.get()
+ except NoPage:
+ return False
+ except IsRedirectPage:
+ return True
+ except SectionError:
+ return False
+ return False
+
+ def isStaticRedirect(self, force=False):
+ """Return True if this is a redirect containing the magic word
+ __STATICREDIRECT__, False if not or not existing.
+
+ """
+ found = False
+ if self.isRedirectPage() and self.site().versionnumber() > 13:
+ staticKeys = self.site().getmagicwords('staticredirect')
+ text = self.get(get_redirect=True, force=force)
+ if staticKeys:
+ for key in staticKeys:
+ if key in text:
+ found = True
+ break
+ return found
+
+ def isCategoryRedirect(self, text=None):
+ """Return True if this is a category redirect page, False otherwise."""
+
+ if not self.isCategory():
+ return False
+ if not hasattr(self, "_catredirect"):
+ if not text:
+ try:
+ text = self.get(get_redirect=True)
+ except NoPage:
+ return False
+ catredirs = self.site().category_redirects()
+ for (t, args) in self.templatesWithParams(thistxt=text):
+ template = Page(self.site(), t, defaultNamespace=10
+ ).title(withNamespace=False) # normalize title
+ if template in catredirs:
+ # Get target (first template argument)
+ if not args:
+ pywikibot.output(u'Warning: redirect target for %s is missing'
+ % self.title(asLink=True))
+ self._catredirect = False
+ else:
+ self._catredirect = self.site().namespace(14) + ":" + args[0]
+ break
+ else:
+ self._catredirect = False
+ return bool(self._catredirect)
+
+ def getCategoryRedirectTarget(self):
+ """If this is a category redirect, return the target category title."""
+ if self.isCategoryRedirect():
+ import catlib
+ return catlib.Category(self.site(), self._catredirect)
+ raise IsNotRedirectPage
+
+ def isEmpty(self):
+ """Return True if the page text has less than 4 characters.
+
+ Character count ignores language links and category links.
+ Can raise the same exceptions as get().
+
+ """
+ txt = self.get()
+ txt = removeLanguageLinks(txt, site = self.site())
+ txt = removeCategoryLinks(txt, site = self.site())
+ if len(txt) < 4:
+ return True
+ else:
+ return False
+
+ def isTalkPage(self):
+ """Return True if this page is in any talk namespace."""
+ ns = self.namespace()
+ return ns >= 0 and ns % 2 == 1
+
+ def toggleTalkPage(self):
+ """Return other member of the article-talk page pair for this Page.
+
+ If self is a talk page, returns the associated content page;
+ otherwise, returns the associated talk page.
+ Returns None if self is a special page.
+
+ """
+ ns = self.namespace()
+ if ns < 0: # Special page
+ return None
+ if self.isTalkPage():
+ ns -= 1
+ else:
+ ns += 1
+
+ if ns == 6:
+ return ImagePage(self.site(), self.title(withNamespace=False))
+
+ return Page(self.site(), self.title(withNamespace=False),
+ defaultNamespace=ns)
+
+ def isCategory(self):
+ """Return True if the page is a Category, False otherwise."""
+ return self.namespace() == 14
+
+ def isImage(self):
+ """Return True if this is an image description page, False otherwise."""
+ return self.namespace() == 6
+
+ def isDisambig(self, get_Index=True):
+ """Return True if this is a disambiguation page, False otherwise.
+
+ Relies on the presence of specific templates, identified in
+ the Family file or on a wiki page, to identify disambiguation
+ pages.
+
+ By default, loads a list of template names from the Family file;
+ if the value in the Family file is None no entry was made, looks for
+ the list on [[MediaWiki:Disambiguationspage]]. If this page does not
+ exist, take the mediawiki message.
+
+ If get_Index is True then also load the templates for index articles
+ which are given on en-wiki
+
+ Template:Disambig is always assumed to be default, and will be
+ appended regardless of its existence.
+
+ """
+ if not hasattr(self, "_isDisambig"):
+ if not hasattr(self._site, "_disambigtemplates"):
+ try:
+ default = set(self._site.family.disambig('_default'))
+ except KeyError:
+ default = set([u'Disambig'])
+ try:
+ distl = self._site.family.disambig(self._site.lang,
+ fallback=False)
+ except KeyError:
+ distl = None
+ if distl is None:
+ try:
+ disambigpages = Page(self._site,
+ "MediaWiki:Disambiguationspage")
+ disambigs = set(link.title(withNamespace=False)
+ for link in disambigpages.linkedPages()
+ if link.namespace() == 10)
+ # add index article templates
+ if get_Index and \
+ self._site.sitename() == 'wikipedia:en':
+ regex = re.compile('\(\((.+?)\)\)')
+ content = disambigpages.get()
+ for index in regex.findall(content):
+ disambigs.add(index[:1].upper() + index[1:])
+ except NoPage:
+ disambigs = set([self._site.mediawiki_message(
+ 'Disambiguationspage').split(':', 1)[1]])
+ # add the default template(s)
+ self._site._disambigtemplates = disambigs | default
+ else:
+ # Normalize template capitalization
+ self._site._disambigtemplates = set(
+ t[:1].upper() + t[1:] for t in distl
+ )
+ disambigInPage = self._site._disambigtemplates.intersection(
+ self.templates())
+ self._isDisambig = self.namespace() != 10 and \
+ len(disambigInPage) > 0
+ return self._isDisambig
+
+ def canBeEdited(self):
+ """Return bool indicating whether this page can be edited.
+
+ This returns True if and only if:
+ - page is unprotected, and bot has an account for this site, or
+ - page is protected, and bot has a sysop account for this site.
+
+ """
+ try:
+ self.get()
+ except:
+ pass
+ if self.editRestriction == 'sysop':
+ userdict = config.sysopnames
+ else:
+ userdict = config.usernames
+ try:
+ userdict[self.site().family.name][self.site().lang]
+ return True
+ except:
+ # We don't have a user account for that wiki, or the
+ # page is locked and we don't have a sysop account.
+ return False
+
+ def botMayEdit(self, username):
+ """Return True if this page allows bots to edit it.
+
+ This will be True if the page doesn't contain {{bots}} or
+ {{nobots}}, or it contains them and the active bot is allowed to
+ edit this page. (This method is only useful on those sites that
+ recognize the bot-exclusion protocol; on other sites, it will always
+ return True.)
+
+ The framework enforces this restriction by default. It is possible
+ to override this by setting ignore_bot_templates=True in
+ user-config.py, or using page.put(force=True).
+
+ """
+
+ if self.site().family.name == 'wikitravel': # Wikitravel's bot control.
+ self.site().family.bot_control(self.site())
+
+ if config.ignore_bot_templates: #Check the "master ignore switch"
+ return True
+
+ try:
+ templates = self.templatesWithParams(get_redirect=True);
+ except (NoPage, IsRedirectPage, SectionError):
+ return True
+
+ for template in templates:
+ if template[0].lower() == 'nobots':
+ return False
+ elif template[0].lower() == 'bots':
+ if len(template[1]) == 0:
+ return True
+ else:
+ (ttype, bots) = template[1][0].split('=', 1)
+ bots = bots.split(',')
+ if ttype == 'allow':
+ if 'all' in bots or username in bots:
+ return True
+ else:
+ return False
+ if ttype == 'deny':
+ if 'all' in bots or username in bots:
+ return False
+ else:
+ return True
+ # no restricting template found
+ return True
+
+ def getReferences(self, follow_redirects=True, withTemplateInclusion=True,
+ onlyTemplateInclusion=False, redirectsOnly=False, internal = False):
+ """Yield all pages that link to the page by API
+
+ If you need a full list of referring pages, use this:
+ pages = [page for page in s.getReferences()]
+ Parameters:
+ * follow_redirects - if True, also returns pages that link to a
+ redirect pointing to the page.
+ * withTemplateInclusion - if True, also returns pages where self is
+ used as a template.
+ * onlyTemplateInclusion - if True, only returns pages where self is
+ used as a template.
+ * redirectsOnly - if True, only returns redirects to self.
+
+ """
+ if not self.site().has_api():
+ for s in self.getReferencesOld(follow_redirects, withTemplateInclusion, onlyTemplateInclusion, redirectsOnly):
+ yield s
+ return
+
+ params = {
+ 'action': 'query',
+ 'list': [],
+ }
+ if not onlyTemplateInclusion:
+ params['list'].append('backlinks')
+ params['bltitle'] = self.title()
+ params['bllimit'] = config.special_page_limit
+ params['blfilterredir'] = 'all'
+ if follow_redirects:
+ params['blredirect'] = 1
+ if redirectsOnly:
+ params['blfilterredir'] = 'redirects'
+ if not self.site().isAllowed('apihighlimits') and config.special_page_limit > 500:
+ params['bllimit'] = 500
+
+ if withTemplateInclusion or onlyTemplateInclusion:
+ params['list'].append('embeddedin')
+ params['eititle'] = self.title()
+ params['eilimit'] = config.special_page_limit
+ params['eifilterredir'] = 'all'
+ if follow_redirects:
+ params['eiredirect'] = 1
+ if redirectsOnly:
+ params['eifilterredir'] = 'redirects'
+ if not self.site().isAllowed('apihighlimits') and config.special_page_limit > 500:
+ params['eilimit'] = 500
+
+ allDone = False
+
+ while not allDone:
+ if not internal:
+ output(u'Getting references to %s via API...'
+ % self.title(asLink=True))
+
+ datas = query.GetData(params, self.site())
+ data = datas['query'].values()
+ if len(data) == 2:
+ data = data[0] + data[1]
+ else:
+ data = data[0]
+
+ refPages = set()
+ for blp in data:
+ pg = Page(self.site(), blp['title'], defaultNamespace = blp['ns'])
+ if pg in refPages:
+ continue
+
+ yield pg
+ refPages.add(pg)
+ if follow_redirects and 'redirect' in blp and 'redirlinks' in blp:
+ for p in blp['redirlinks']:
+ plk = Page(self.site(), p['title'], defaultNamespace = p['ns'])
+ if plk in refPages:
+ continue
+
+ yield plk
+ refPages.add(plk)
+ if follow_redirects and 'redirect' in p and plk != self:
+ for zms in plk.getReferences(follow_redirects, withTemplateInclusion,
+ onlyTemplateInclusion, redirectsOnly, internal=True):
+ yield zms
+ else:
+ continue
+ else:
+ continue
+
+ if 'query-continue' in datas:
+ if 'backlinks' in datas['query-continue']:
+ params['blcontinue'] = datas['query-continue']['backlinks']['blcontinue']
+
+ if 'embeddedin' in datas['query-continue']:
+ params['eicontinue'] = datas['query-continue']['embeddedin']['eicontinue']
+ else:
+ allDone = True
+
+
+ def getReferencesOld(self,
+ follow_redirects=True, withTemplateInclusion=True,
+ onlyTemplateInclusion=False, redirectsOnly=False):
+ """Yield all pages that link to the page.
+ """
+ # Temporary bug-fix while researching more robust solution:
+ if config.special_page_limit > 999:
+ config.special_page_limit = 999
+ site = self.site()
+ path = self.site().references_address(self.urlname())
+ if withTemplateInclusion:
+ path+=u'&hidetrans=0'
+ if onlyTemplateInclusion:
+ path+=u'&hidetrans=0&hidelinks=1&hideredirs=1&hideimages=1'
+ if redirectsOnly:
+ path+=u'&hideredirs=0&hidetrans=1&hidelinks=1&hideimages=1'
+ content = SoupStrainer("div", id=self.site().family.content_id)
+ try:
+ next_msg = self.site().mediawiki_message('whatlinkshere-next')
+ except KeyError:
+ next_msg = "next %i" % config.special_page_limit
+ plural = (config.special_page_limit == 1) and "\\1" or "\\2"
+ next_msg = re.sub(r"{{PLURAL:\$1\|(.*?)\|(.*?)}}", plural, next_msg)
+ nextpattern = re.compile("^%s$" % next_msg.replace("$1", "[0-9]+"))
+ delay = 1
+ if self.site().has_mediawiki_message("Isredirect"):
+ self._isredirectmessage = self.site().mediawiki_message("Isredirect")
+ if self.site().has_mediawiki_message("Istemplate"):
+ self._istemplatemessage = self.site().mediawiki_message("Istemplate")
+ # to avoid duplicates:
+ refPages = set()
+ while path:
+ output(u'Getting references to %s' % self.title(asLink=True))
+ get_throttle()
+ txt = self.site().getUrl(path)
+ body = BeautifulSoup(txt,
+ convertEntities=BeautifulSoup.HTML_ENTITIES,
+ parseOnlyThese=content)
+ next_text = body.find(text=nextpattern)
+ if next_text is not None and next_text.parent.has_key('href'):
+ path = next_text.parent['href'].replace("&", "&")
+ else:
+ path = ""
+ reflist = body.find("ul")
+ if reflist is None:
+ return
+ for page in self._parse_reflist(reflist,
+ follow_redirects, withTemplateInclusion,
+ onlyTemplateInclusion, redirectsOnly):
+ if page not in refPages:
+ yield page
+ refPages.add(page)
+
+ def _parse_reflist(self, reflist,
+ follow_redirects=True, withTemplateInclusion=True,
+ onlyTemplateInclusion=False, redirectsOnly=False):
+ """For internal use only
+
+ Parse a "Special:Whatlinkshere" list of references and yield Page
+ objects that meet the criteria (used by getReferences)
+ """
+ for link in reflist("li", recursive=False):
+ title = link.a.string
+ if title is None:
+ output(u"DBG> invalid <li> item in Whatlinkshere: %s" % link)
+ try:
+ p = Page(self.site(), title)
+ except InvalidTitle:
+ output(u"DBG> Whatlinkshere:%s contains invalid link to %s"
+ % (self.title(), title))
+ continue
+ isredirect, istemplate = False, False
+ textafter = link.a.findNextSibling(text=True)
+ if textafter is not None:
+ if self.site().has_mediawiki_message("Isredirect") \
+ and self._isredirectmessage in textafter:
+ # make sure this is really a redirect to this page
+ # (MediaWiki will mark as a redirect any link that follows
+ # a #REDIRECT marker, not just the first one).
+ if p.getRedirectTarget().sectionFreeTitle() == self.sectionFreeTitle():
+ isredirect = True
+ if self.site().has_mediawiki_message("Istemplate") \
+ and self._istemplatemessage in textafter:
+ istemplate = True
+ if (withTemplateInclusion or onlyTemplateInclusion or not istemplate
+ ) and (not redirectsOnly or isredirect
+ ) and (not onlyTemplateInclusion or istemplate
+ ):
+ yield p
+ continue
+
+ if isredirect and follow_redirects:
+ sublist = link.find("ul")
+ if sublist is not None:
+ for p in self._parse_reflist(sublist,
+ follow_redirects, withTemplateInclusion,
+ onlyTemplateInclusion, redirectsOnly):
+ yield p
+
+ def _getActionUser(self, action, restriction = '', sysop = False):
+ """
+ Get the user to do an action: sysop or not sysop, or raise an exception
+ if the user cannot do that.
+
+ Parameters:
+ * action - the action to be done, which is the name of the right
+ * restriction - the restriction level or an empty string for no restriction
+ * sysop - initially use sysop user?
+ """
+ # Login
+ self.site().forceLogin(sysop = sysop)
+
+ # Check permissions
+ if not self.site().isAllowed(action, sysop):
+ if sysop:
+ raise LockedPage(u'The sysop user is not allowed to %s in site %s' % (action, self.site()))
+ else:
+ try:
+ user = self._getActionUser(action, restriction, sysop = True)
+ output(u'The user is not allowed to %s on site %s. Using sysop account.' % (action, self.site()))
+ return user
+ except NoUsername:
+ raise LockedPage(u'The user is not allowed to %s on site %s, and no sysop account is defined.' % (action, self.site()))
+ except LockedPage:
+ raise
+
+ # Check restrictions
+ if not self.site().isAllowed(restriction, sysop):
+ if sysop:
+ raise LockedPage(u'Page on %s is locked in a way that sysop user cannot %s it' % (self.site(), action))
+ else:
+ try:
+ user = self._getActionUser(action, restriction, sysop = True)
+ output(u'Page is locked on %s - cannot %s, using sysop account.' % (self.site(), action))
+ return user
+ except NoUsername:
+ raise LockedPage(u'Page is locked on %s - cannot %s, and no sysop account is defined.' % (self.site(), action))
+ except LockedPage:
+ raise
+
+ return sysop
+
+ def getRestrictions(self):
+ """
+ Get the protections on the page.
+ * Returns a restrictions dictionary. Keys are 'edit' and 'move',
+ Values are None (no restriction for that action) or [level, expiry] :
+ * level is the level of auth needed to perform that action
+ ('autoconfirmed' or 'sysop')
+ * expiry is the expiration time of the restriction
+ """
+ #, titles = None
+ #if titles:
+ # restrictions = {}
+ #else:
+ restrictions = { 'edit': None, 'move': None }
+ try:
+ api_url = self.site().api_address()
+ except NotImplementedError:
+ return restrictions
+
+ predata = {
+ 'action': 'query',
+ 'prop': 'info',
+ 'inprop': 'protection',
+ 'titles': self.title(),
+ }
+ #if titles:
+ # predata['titles'] = titles
+
+ text = query.GetData(predata, self.site())['query']['pages']
+
+ for pageid in text:
+ if 'missing' in text[pageid]:
+ self._getexception = NoPage
+ raise NoPage('Page %s does not exist' % self.title(asLink=True))
+ elif not 'pageid' in text[pageid]:
+ # Don't know what may happen here.
+ # We may want to have better error handling
+ raise Error("BUG> API problem.")
+ if text[pageid]['protection'] != []:
+ #if titles:
+ # restrictions = dict([ detail['type'], [ detail['level'], detail['expiry'] ] ]
+ # for detail in text[pageid]['protection'])
+ #else:
+ restrictions = dict([ detail['type'], [ detail['level'], detail['expiry'] ] ]
+ for detail in text[pageid]['protection'])
+
+ return restrictions
+
+ def put_async(self, newtext,
+ comment=None, watchArticle=None, minorEdit=True, force=False,
+ callback=None):
+ """Put page on queue to be saved to wiki asynchronously.
+
+ Asynchronous version of put (takes the same arguments), which places
+ pages on a queue to be saved by a daemon thread. All arguments are
+ the same as for .put(), except --
+
+ callback: a callable object that will be called after the page put
+ operation; this object must take two arguments:
+ (1) a Page object, and (2) an exception instance, which
+ will be None if the page was saved successfully.
+
+ The callback is intended to be used by bots that need to keep track
+ of which saves were successful.
+
+ """
+ try:
+ page_put_queue.mutex.acquire()
+ try:
+ _putthread.start()
+ except (AssertionError, RuntimeError):
+ pass
+ finally:
+ page_put_queue.mutex.release()
+ page_put_queue.put((self, newtext, comment, watchArticle, minorEdit,
+ force, callback))
+
+ def put(self, newtext, comment=None, watchArticle=None, minorEdit=True,
+ force=False, sysop=False, botflag=True, maxTries=-1):
+ """Save the page with the contents of the first argument as the text.
+
+ Optional parameters:
+ comment: a unicode string that is to be used as the summary for
+ the modification.
+ watchArticle: a bool, add or remove this Page to/from bot user's
+ watchlist (if None, leave watchlist status unchanged)
+ minorEdit: mark this edit as minor if True
+ force: ignore botMayEdit() setting.
+ maxTries: the maximum amount of save attempts. -1 for infinite.
+ """
+ # Login
+ try:
+ self.get()
+ except:
+ pass
+ sysop = self._getActionUser(action = 'edit', restriction = self.editRestriction, sysop = sysop)
+ username = self.site().loggedInAs()
+
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+
+ # Determine if we are allowed to edit
+ if not force:
+ if not self.botMayEdit(username):
+ raise LockedPage(
+ u'Not allowed to edit %s because of a restricting template'
+ % self.title(asLink=True))
+ elif self.site().has_api() and self.namespace() in [2,3] \
+ and (self.title().endswith('.css') or \
+ self.title().endswith('.js')):
+ titleparts = self.title().split("/")
+ userpageowner = titleparts[0].split(":")[1]
+ if userpageowner != username:
+ # API enable: if title ends with .css or .js in ns2,3
+ # it needs permission to edit user pages
+ if self.title().endswith('css'):
+ permission = 'editusercss'
+ else:
+ permission = 'edituserjs'
+ sysop = self._getActionUser(action=permission,
+ restriction=self.editRestriction,
+ sysop=True)
+
+ # If there is an unchecked edit restriction, we need to load the page
+ if self._editrestriction:
+ output(
+u'Page %s is semi-protected. Getting edit page to find out if we are allowed to edit.'
+ % self.title(asLink=True))
+ oldtime = self.editTime()
+ # Note: change_edit_time=True is always True since
+ # self.get() calls self._getEditPage without this parameter
+ self.get(force=True, change_edit_time=True)
+ newtime = self.editTime()
+ ### TODO: we have different timestamp formats
+ if re.sub('\D', '', str(oldtime)) != re.sub('\D', '', str(newtime)): # page was changed
+ raise EditConflict(u'Page has been changed after first read.')
+ self._editrestriction = False
+ # If no comment is given for the change, use the default
+ comment = comment or action
+ if config.cosmetic_changes and not self.isTalkPage() and \
+ not calledModuleName() in ('cosmetic_changes', 'touch'):
+ if config.cosmetic_changes_mylang_only:
+ cc = (self.site().family.name == config.family and self.site().lang == config.mylang) or \
+ self.site().family.name in config.cosmetic_changes_enable.keys() and \
+ self.site().lang in config.cosmetic_changes_enable[self.site().family.name]
+ else:
+ cc = True
+ cc = cc and not \
+ (self.site().family.name in config.cosmetic_changes_disable.keys() and \
+ self.site().lang in config.cosmetic_changes_disable[self.site().family.name])
+ if cc:
+ old = newtext
+ if verbose:
+ output(u'Cosmetic Changes for %s-%s enabled.' % (self.site().family.name, self.site().lang))
+ import cosmetic_changes
+ from pywikibot import i18n
+ ccToolkit = cosmetic_changes.CosmeticChangesToolkit(self.site(), redirect=self.isRedirectPage(), namespace = self.namespace(), pageTitle=self.title())
+ newtext = ccToolkit.change(newtext)
+ if comment and old.strip().replace('\r\n', '\n') != newtext.strip().replace('\r\n', '\n'):
+ comment += i18n.twtranslate(self.site(), 'cosmetic_changes-append')
+
+ if watchArticle is None:
+ # if the page was loaded via get(), we know its status
+ if hasattr(self, '_isWatched'):
+ watchArticle = self._isWatched
+ else:
+ import watchlist
+ watchArticle = watchlist.isWatched(self.title(), site = self.site())
+ newPage = not self.exists()
+ # if posting to an Esperanto wiki, we must e.g. write Bordeauxx instead
+ # of Bordeaux
+ if self.site().lang == 'eo' and not self.site().has_api():
+ newtext = encodeEsperantoX(newtext)
+ comment = encodeEsperantoX(comment)
+
+ return self._putPage(newtext, comment, watchArticle, minorEdit,
+ newPage, self.site().getToken(sysop = sysop), sysop = sysop, botflag=botflag, maxTries=maxTries)
+
+ def _encodeArg(self, arg, msgForError):
+ """Encode an ascii string/Unicode string to the site's encoding"""
+ try:
+ return arg.encode(self.site().encoding())
+ except UnicodeDecodeError, e:
+ # happens when arg is a non-ascii bytestring :
+ # when reencoding bytestrings, python decodes first to ascii
+ e.reason += ' (cannot convert input %s string to unicode)' % msgForError
+ raise e
+ except UnicodeEncodeError, e:
+ # happens when arg is unicode
+ e.reason += ' (cannot convert %s to wiki encoding %s)' % (msgForError, self.site().encoding())
+ raise e
+
+ def _putPage(self, text, comment=None, watchArticle=False, minorEdit=True,
+ newPage=False, token=None, newToken=False, sysop=False,
+ captcha=None, botflag=True, maxTries=-1):
+ """Upload 'text' as new content of Page by API
+
+ Don't use this directly, use put() instead.
+
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 13:
+ # api not enabled or version not supported
+ return self._putPageOld(text, comment, watchArticle, minorEdit,
+ newPage, token, newToken, sysop, captcha, botflag, maxTries)
+
+ retry_attempt = 0
+ retry_delay = 1
+ dblagged = False
+ params = {
+ 'action': 'edit',
+ 'title': self.title(),
+ 'text': self._encodeArg(text, 'text'),
+ 'summary': self._encodeArg(comment, 'summary'),
+ }
+
+ if token:
+ params['token'] = token
+ else:
+ params['token'] = self.site().getToken(sysop = sysop)
+
+ # Add server lag parameter (see config.py for details)
+ if config.maxlag:
+ params['maxlag'] = str(config.maxlag)
+
+ if self._editTime:
+ params['basetimestamp'] = self._editTime
+ else:
+ params['basetimestamp'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+
+ if self._startTime:
+ params['starttimestamp'] = self._startTime
+ else:
+ params['starttimestamp'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+
+ if botflag:
+ params['bot'] = 1
+
+ if minorEdit:
+ params['minor'] = 1
+ else:
+ params['notminor'] = 1
+
+ if watchArticle:
+ params['watch'] = 1
+ #else:
+ # params['unwatch'] = 1
+
+ if captcha:
+ params['captchaid'] = captcha['id']
+ params['captchaword'] = captcha['answer']
+
+ while True:
+ if (maxTries == 0):
+ raise MaxTriesExceededError()
+ maxTries -= 1
+ # Check whether we are not too quickly after the previous
+ # putPage, and wait a bit until the interval is acceptable
+ if not dblagged:
+ put_throttle()
+ # Which web-site host are we submitting to?
+ if newPage:
+ output(u'Creating page %s via API' % self.title(asLink=True))
+ params['createonly'] = 1
+ else:
+ output(u'Updating page %s via API' % self.title(asLink=True))
+ params['nocreate'] = 1
+ # Submit the prepared information
+ try:
+ response, data = query.GetData(params, self.site(), sysop=sysop, back_response = True)
+ if isinstance(data,basestring):
+ raise KeyError
+ except httplib.BadStatusLine, line:
+ raise PageNotSaved('Bad status line: %s' % line.line)
+ except ServerError:
+ output(u''.join(traceback.format_exception(*sys.exc_info())))
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise
+ output(u'Got a server error when putting %s; will retry in %i minute%s.' % (self.title(asLink=True), retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ except ValueError: # API result cannot decode
+ output(u"Server error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ # If it has gotten this far then we should reset dblagged
+ dblagged = False
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+ # A second text area means that an edit conflict has occured.
+ if response.code == 500:
+ output(u"Server error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if 'error' in data:
+ #All available error key in edit mode: (from ApiBase.php)
+ # 'noimageredirect-anon':"Anonymous users can't create image redirects",
+ # 'noimageredirect':"You don't have permission to create image redirects",
+ # 'filtered':"The filter callback function refused your edit",
+ # 'noedit-anon':"Anonymous users can't edit pages",
+ # 'noedit':"You don't have permission to edit pages",
+ # 'emptypage':"Creating new, empty pages is not allowed",
+ # 'badmd5':"The supplied MD5 hash was incorrect",
+ # 'notext':"One of the text, appendtext, prependtext and undo parameters must be set",
+ # 'emptynewsection':'Creating empty new sections is not possible.',
+ # 'revwrongpage':"r\$1 is not a revision of ``\$2''",
+ # 'undofailure':'Undo failed due to conflicting intermediate edits',
+
+ #for debug only
+ #------------------------
+ if verbose:
+ output("error occured, code:%s\ninfo:%s\nstatus:%s\nresponse:%s" % (
+ data['error']['code'], data['error']['info'], response.code, response.msg))
+ faked = params
+ if 'text' in faked:
+ del faked['text']
+ output("OriginalData:%s" % faked)
+ del faked
+ #------------------------
+ errorCode = data['error']['code']
+ #cannot handle longpageerror and PageNoSave yet
+ if errorCode == 'maxlag' or response.code == 503:
+ # server lag; wait for the lag time and retry
+ lagpattern = re.compile(r"Waiting for [\d.]+: (?P<lag>\d+) seconds? lagged")
+ lag = lagpattern.search(data['error']['info'])
+ timelag = int(lag.group("lag"))
+ output(u"Pausing %d seconds due to database server lag." % min(timelag,300))
+ dblagged = True
+ time.sleep(min(timelag,300))
+ continue
+ elif errorCode == 'editconflict':
+ # 'editconflict':"Edit conflict detected",
+ raise EditConflict(u'An edit conflict has occured.')
+ elif errorCode == 'spamdetected':
+ # 'spamdetected':"Your edit was refused because it contained a spam fragment: ``\$1''",
+ raise SpamfilterError(data['error']['info'][62:-2])
+ elif errorCode == 'pagedeleted':
+ # 'pagedeleted':"The page has been deleted since you fetched its timestamp",
+ # Make sure your system clock is correct if this error occurs
+ # without any reason!
+ # raise EditConflict(u'Someone deleted the page.')
+ # No raise, simply define these variables and retry:
+ params['recreate'] = 1
+ if self._editTime:
+ params['basetimestamp'] = self._editTime
+ else:
+ params['basetimestamp'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+
+ if self._startTime:
+ params['starttimestamp'] = self._startTime
+ else:
+ params['starttimestamp'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ continue
+ elif errorCode == 'readonly':
+ # 'readonly':"The wiki is currently in read-only mode"
+ output(u"The database is currently locked for write access; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ elif errorCode == 'contenttoobig':
+ # 'contenttoobig':"The content you supplied exceeds the article size limit of \$1 kilobytes",
+ raise LongPageError(len(params['text']), int(data['error']['info'][59:-10]))
+ elif errorCode in ['protectedpage', 'customcssjsprotected', 'cascadeprotected', 'protectednamespace', 'protectednamespace-interface']:
+ # 'protectedpage':"The ``\$1'' right is required to edit this page"
+ # 'cascadeprotected':"The page you're trying to edit is protected because it's included in a cascade-protected page"
+ # 'customcssjsprotected': "You're not allowed to edit custom CSS and JavaScript pages"
+ # 'protectednamespace': "You're not allowed to edit pages in the ``\$1'' namespace"
+ # 'protectednamespace-interface':"You're not allowed to edit interface messages"
+ #
+ # The page is locked. This should have already been
+ # detected when getting the page, but there are some
+ # reasons why this didn't work, e.g. the page might be
+ # locked via a cascade lock.
+ try:
+ # Page is locked - try using the sysop account, unless we're using one already
+ if sysop:# Unknown permissions error
+ raise LockedPage()
+ else:
+ self.site().forceLogin(sysop = True)
+ output(u'Page is locked, retrying using sysop account.')
+ return self._putPage(text, comment, watchArticle, minorEdit, newPage, token=self.site().getToken(sysop = True), sysop = True)
+ except NoUsername:
+ raise LockedPage()
+ elif errorCode == 'badtoken':
+ if newToken:
+ output(u"Edit token has failed. Giving up.")
+ else:
+ # We might have been using an outdated token
+ output(u"Edit token has failed. Retrying.")
+ return self._putPage(text, comment, watchArticle, minorEdit, newPage, token=self.site().getToken(sysop = sysop, getagain = True), newToken = True, sysop = sysop)
+ # I think the error message title was changed from "Wikimedia Error"
+ # to "Wikipedia has a problem", but I'm not sure. Maybe we could
+ # just check for HTTP Status 500 (Internal Server Error)?
+ else:
+ output("Unknown Error. API Error code:%s" % data['error']['code'] )
+ output("Information:%s" % data['error']['info'])
+ else:
+ if data['edit']['result'] == u"Success":
+ #
+ # The status code for update page completed in ordinary mode is 302 - Found
+ # But API is always 200 - OK because it only send "success" back in string.
+ # if the page update is successed, we need to return code 302 for cheat script who
+ # using status code
+ #
+ return 302, response.msg, data['edit']
+
+ solve = self.site().solveCaptcha(data)
+ if solve:
+ return self._putPage(text, comment, watchArticle, minorEdit, newPage, token, newToken, sysop, captcha=solve)
+
+ return response.code, response.msg, data
+
+
+ def _putPageOld(self, text, comment=None, watchArticle=False, minorEdit=True,
+ newPage=False, token=None, newToken=False, sysop=False,
+ captcha=None, botflag=True, maxTries=-1):
+ """Upload 'text' as new content of Page by filling out the edit form.
+
+ Don't use this directly, use put() instead.
+
+ """
+ host = self.site().hostname()
+ # Get the address of the page on that host.
+ address = self.site().put_address(self.urlname())
+ predata = {
+ 'wpSave': '1',
+ 'wpSummary': self._encodeArg(comment, 'edit summary'),
+ 'wpTextbox1': self._encodeArg(text, 'wikitext'),
+ # As of October 2008, MW HEAD requires wpSection to be set.
+ # We will need to fill this more smartly if we ever decide to edit by section
+ 'wpSection': '',
+ }
+ if not botflag:
+ predata['bot']='0'
+ if captcha:
+ predata["wpCaptchaId"] = captcha['id']
+ predata["wpCaptchaWord"] = captcha['answer']
+ # Add server lag parameter (see config.py for details)
+ if config.maxlag:
+ predata['maxlag'] = str(config.maxlag)
+ # <s>Except if the page is new, we need to supply the time of the
+ # previous version to the wiki to prevent edit collisions</s>
+ # As of Oct 2008, these must be filled also for new pages
+ if self._editTime:
+ predata['wpEdittime'] = self._editTime
+ else:
+ predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ if self._startTime:
+ predata['wpStarttime'] = self._startTime
+ else:
+ predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ if self._revisionId:
+ predata['baseRevId'] = self._revisionId
+ # Pass the minorEdit and watchArticle arguments to the Wiki.
+ if minorEdit:
+ predata['wpMinoredit'] = '1'
+ if watchArticle:
+ predata['wpWatchthis'] = '1'
+ # Give the token, but only if one is supplied.
+ if token:
+ predata['wpEditToken'] = token
+
+ # Sorry, single-site exception...
+ if self.site().fam().name == 'loveto' and self.site().language() == 'recipes':
+ predata['masteredit'] = '1'
+
+ retry_delay = 1
+ retry_attempt = 0
+ dblagged = False
+ wait = 5
+ while True:
+ if (maxTries == 0):
+ raise MaxTriesExceededError()
+ maxTries -= 1
+ # Check whether we are not too quickly after the previous
+ # putPage, and wait a bit until the interval is acceptable
+ if not dblagged:
+ put_throttle()
+ # Which web-site host are we submitting to?
+ if newPage:
+ output(u'Creating page %s' % self.title(asLink=True))
+ else:
+ output(u'Changing page %s' % self.title(asLink=True))
+ # Submit the prepared information
+ try:
+ response, data = self.site().postForm(address, predata, sysop)
+ if response.code == 503:
+ if 'x-database-lag' in response.msg.keys():
+ # server lag; Mediawiki recommends waiting 5 seconds
+ # and retrying
+ if verbose:
+ output(data, newline=False)
+ output(u"Pausing %d seconds due to database server lag." % wait)
+ dblagged = True
+ time.sleep(wait)
+ wait = min(wait*2, 300)
+ continue
+ # Squid error 503
+ raise ServerError(response.code)
+ except httplib.BadStatusLine, line:
+ raise PageNotSaved('Bad status line: %s' % line.line)
+ except ServerError:
+ output(u''.join(traceback.format_exception(*sys.exc_info())))
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise
+ output(
+ u'Got a server error when putting %s; will retry in %i minute%s.'
+ % (self.title(asLink=True), retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ # If it has gotten this far then we should reset dblagged
+ dblagged = False
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+ # A second text area means that an edit conflict has occured.
+ editconflict1 = re.compile('id=["\']wpTextbox2[\'"] name="wpTextbox2"')
+ editconflict2 = re.compile('name="wpTextbox2" id="wpTextbox2"')
+ if editconflict1.search(data) or editconflict2.search(data):
+ raise EditConflict(u'An edit conflict has occured.')
+
+ # remove the wpAntispam keyword before checking for Spamfilter
+ data = re.sub(u'(?s)<label for="wpAntispam">.*?</label>', '', data)
+ if self.site().has_mediawiki_message("spamprotectiontitle")\
+ and self.site().mediawiki_message('spamprotectiontitle') in data:
+ try:
+ reasonR = re.compile(re.escape(self.site().mediawiki_message('spamprotectionmatch')).replace('\$1', '(?P<url>[^<]*)'))
+ url = reasonR.search(data).group('url')
+ except:
+ # Some wikis have modified the spamprotectionmatch
+ # template in a way that the above regex doesn't work,
+ # e.g. on he.wikipedia the template includes a
+ # wikilink, and on fr.wikipedia there is bold text.
+ # This is a workaround for this: it takes the region
+ # which should contain the spamfilter report and the
+ # URL. It then searches for a plaintext URL.
+ relevant = data[data.find('<!-- start content -->')+22:data.find('<!-- end content -->')].strip()
+ # Throw away all the other links etc.
+ relevant = re.sub('<.*?>', '', relevant)
+ relevant = relevant.replace(':', ':')
+ # MediaWiki only spam-checks HTTP links, and only the
+ # domain name part of the URL.
+ m = re.search('http://[\w\-\.]+', relevant)
+ if m:
+ url = m.group()
+ else:
+ # Can't extract the exact URL. Let the user search.
+ url = relevant
+ raise SpamfilterError(url)
+ if '<label for=\'wpRecreate\'' in data:
+ # Make sure your system clock is correct if this error occurs
+ # without any reason!
+ # raise EditConflict(u'Someone deleted the page.')
+ # No raise, simply define these variables and retry:
+ if self._editTime:
+ predata['wpEdittime'] = self._editTime
+ else:
+ predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ if self._startTime:
+ predata['wpStarttime'] = self._startTime
+ else:
+ predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ continue
+ if self.site().has_mediawiki_message("viewsource")\
+ and self.site().mediawiki_message('viewsource') in data:
+ # The page is locked. This should have already been
+ # detected when getting the page, but there are some
+ # reasons why this didn't work, e.g. the page might be
+ # locked via a cascade lock.
+ try:
+ # Page is locked - try using the sysop account, unless we're using one already
+ if sysop:
+ # Unknown permissions error
+ raise LockedPage()
+ else:
+ self.site().forceLogin(sysop = True)
+ output(u'Page is locked, retrying using sysop account.')
+ return self._putPageOld(text, comment, watchArticle, minorEdit, newPage, token=self.site().getToken(sysop = True), sysop = True)
+ except NoUsername:
+ raise LockedPage()
+ if not newToken and "<textarea" in data:
+ ##if "<textarea" in data: # for debug use only, if badtoken still happen
+ # We might have been using an outdated token
+ output(u"Changing page has failed. Retrying.")
+ return self._putPageOld(text, comment, watchArticle, minorEdit, newPage, token=self.site().getToken(sysop = sysop, getagain = True), newToken = True, sysop = sysop)
+ # I think the error message title was changed from "Wikimedia Error"
+ # to "Wikipedia has a problem", but I'm not sure. Maybe we could
+ # just check for HTTP Status 500 (Internal Server Error)?
+ if ("<title>Wikimedia Error</title>" in data or "has a problem</title>" in data) \
+ or response.code == 500:
+ output(u"Server error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if ("1213: Deadlock found when trying to get lock" in data):
+ output(u"Deadlock error encountered; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if self.site().mediawiki_message('readonly') in data or self.site().mediawiki_message('readonly_lag') in data:
+ output(u"The database is currently locked for write access; will retry in %i minute%s."
+ % (retry_delay, retry_delay != 1 and "s" or ""))
+ time.sleep(60 * retry_delay)
+ retry_delay *= 2
+ if retry_delay > 30:
+ retry_delay = 30
+ continue
+ if self.site().has_mediawiki_message('longpageerror'):
+ # FIXME: Long page error detection isn't working in Vietnamese Wikipedia.
+ long_page_errorR = re.compile(
+ # Some wikis (e.g. Lithuanian and Slovak Wikipedia) use {{plural}} in
+ # [[MediaWiki:longpageerror]]
+ re.sub(r'\\{\\{plural\\:.*?\\}\\}', '.*?',
+ re.escape(
+ html2unicode(
+ self.site().mediawiki_message('longpageerror')
+ )
+ )
+ ).replace("\$1", "(?P<length>[\d,.\s]+)", 1).replace("\$2", "(?P<limit>[\d,.\s]+)", 1),
+ re.UNICODE)
+
+ match = long_page_errorR.search(data)
+ if match:
+ # Some wikis (e.g. Lithuanian Wikipedia) don't use $2 parameter in
+ # [[MediaWiki:longpageerror]]
+ longpage_length = 0 ; longpage_limit = 0
+ if 'length' in match.groups():
+ longpage_length = match.group('length')
+ if 'limit' in match.groups():
+ longpage_limit = match.group('limit')
+ raise LongPageError(longpage_length, longpage_limit)
+
+ # We might have been prompted for a captcha if the
+ # account is not autoconfirmed, checking....
+ ## output('%s' % data) # WHY?
+ solve = self.site().solveCaptcha(data)
+ if solve:
+ return self._putPageOld(text, comment, watchArticle, minorEdit, newPage, token, newToken, sysop, captcha=solve)
+
+ # We are expecting a 302 to the action=view page. I'm not sure why this was removed in r5019
+ if response.code != 302 and data.strip() != u"":
+ # Something went wrong, and we don't know what. Show the
+ # HTML code that hopefully includes some error message.
+ output(u"ERROR: Unexpected response from wiki server.")
+ output(u" %s (%s) " % (response.code, response.msg))
+ output(data)
+ # Unexpected responses should raise an error and not pass,
+ # be it silently or loudly. This should raise an error
+
+ if 'name="wpTextbox1"' in data and 'var wgAction = "submit"' in data:
+ # We are on the preview page, so the page was not saved
+ raise PageNotSaved
+
+ return response.code, response.msg, data
+
+ ## @since r10311
+ # @remarks to support appending to single sections
+ def append(self, newtext, comment=None, minorEdit=True, section=0):
+ """Append the wiki-text to the page.
+
+ Returns the result of text append to page section number 'section'.
+ 0 for the top section, 'new' for a new section (end of page).
+ """
+
+ # If no comment is given for the change, use the default
+ comment = comment or pywikibot.action
+
+ # send data by POST request
+ params = {
+ 'action' : 'edit',
+ 'title' : self.title(),
+ 'section' : '%s' % section,
+ 'appendtext' : self._encodeArg(newtext, 'text'),
+ 'token' : self.site().getToken(),
+ 'summary' : self._encodeArg(comment, 'summary'),
+ 'bot' : 1,
+ }
+
+ if minorEdit:
+ params['minor'] = 1
+ else:
+ params['notminor'] = 1
+
+ response, data = query.GetData(params, self.site(), back_response = True)
+
+ if not (data['edit']['result'] == u"Success"):
+ raise PageNotSaved('Bad result returned: %s' % data['edit']['result'])
+
+ return response.code, response.msg, data
+
+ def protection(self):
+ """Return list of dicts of this page protection level. like:
+ [{u'expiry': u'2010-05-26T14:41:51Z', u'type': u'edit', u'level': u'autoconfirmed'}, {u'expiry': u'2010-05-26T14:41:51Z', u'type': u'move', u'level': u'sysop'}]
+
+ if the page non protection, return []
+ """
+
+ params = {
+ 'action': 'query',
+ 'prop' : 'info',
+ 'inprop': 'protection',
+ 'titles' : self.title(),
+ }
+
+ datas = query.GetData(params, self.site())
+ data=datas['query']['pages'].values()[0]['protection']
+ return data
+
+ def interwiki(self):
+ """Return a list of interwiki links in the page text.
+
+ This will retrieve the page to do its work, so it can raise
+ the same exceptions that are raised by the get() method.
+
+ The return value is a list of Page objects for each of the
+ interwiki links in the page text.
+
+ """
+ if hasattr(self, "_interwikis"):
+ return self._interwikis
+
+ text = self.get()
+
+ # Replace {{PAGENAME}} by its value
+ for pagenametext in self.site().pagenamecodes(
+ self.site().language()):
+ text = text.replace(u"{{%s}}" % pagenametext, self.title())
+
+ ll = getLanguageLinks(text, insite=self.site(), pageLink=self.title(asLink=True))
+
+ result = ll.values()
+
+ self._interwikis = result
+ return result
+
+
+
+ def categories(self, get_redirect=False, api=False):
+ """Return a list of Category objects that the article is in.
+ Please be aware: the api call returns also categies which are included
+ by templates. This differs to the old non-api code. If you need only
+ these categories which are in the page text please use getCategoryLinks
+ (or set api=False but this could be deprecated in future).
+ """
+ if not (self.site().has_api() and api):
+ try:
+ category_links_to_return = getCategoryLinks(self.get(get_redirect=get_redirect), self.site())
+ except NoPage:
+ category_links_to_return = []
+ return category_links_to_return
+
+ else:
+ import catlib
+ params = {
+ 'action': 'query',
+ 'prop' : 'categories',
+ 'titles' : self.title(),
+ }
+ if not self.site().isAllowed('apihighlimits') and config.special_page_limit > 500:
+ params['cllimit'] = 500
+
+ output(u'Getting categories in %s via API...' % self.title(asLink=True))
+ allDone = False
+ cats=[]
+ while not allDone:
+ datas = query.GetData(params, self.site())
+ data=datas['query']['pages'].values()[0]
+ if "categories" in data:
+ for c in data['categories']:
+ if c['ns'] is 14:
+ cat = catlib.Category(self.site(), c['title'])
+ cats.append(cat)
+
+ if 'query-continue' in datas:
+ if 'categories' in datas['query-continue']:
+ params['clcontinue'] = datas['query-continue']['categories']['clcontinue']
+ else:
+ allDone = True
+ return cats
+
+ def linkedPages(self, withImageLinks = False):
+ """Return a list of Pages that this Page links to.
+
+ Excludes interwiki and category links, and also image links by default.
+ """
+ result = []
+ try:
+ thistxt = removeLanguageLinks(self.get(get_redirect=True),
+ self.site())
+ except NoPage:
+ raise
+ except IsRedirectPage:
+ raise
+ except SectionError:
+ return []
+ thistxt = removeCategoryLinks(thistxt, self.site())
+
+ # remove HTML comments, pre, nowiki, and includeonly sections
+ # from text before processing
+ thistxt = removeDisabledParts(thistxt)
+
+ # resolve {{ns:-1}} or {{ns:Help}}
+ thistxt = self.site().resolvemagicwords(thistxt)
+
+ for match in Rlink.finditer(thistxt):
+ title = match.group('title')
+ title = title.replace("_", " ").strip(" ")
+ if title.startswith("#"):
+ # this is an internal section link
+ continue
+ if not self.site().isInterwikiLink(title):
+ try:
+ page = Page(self.site(), title)
+ try:
+ hash(str(page))
+ except Exception:
+ raise Error(u"Page %s contains invalid link to [[%s]]."
+ % (self.title(), title))
+ except Error:
+ if verbose:
+ output(u"Page %s contains invalid link to [[%s]]."
+ % (self.title(), title))
+ continue
+ if not withImageLinks and page.isImage():
+ continue
+ if page.sectionFreeTitle() and page not in result:
+ result.append(page)
+ return result
+
+ def imagelinks(self, followRedirects=False, loose=False):
+ """Return a list of ImagePage objects for images displayed on this Page.
+
+ Includes images in galleries.
+ If loose is True, this will find anything that looks like it
+ could be an image. This is useful for finding, say, images that are
+ passed as parameters to templates.
+
+ """
+ results = []
+ # Find normal images
+ for page in self.linkedPages(withImageLinks = True):
+ if page.isImage():
+ # convert Page object to ImagePage object
+ results.append( ImagePage(page.site(), page.title()) )
+ # Find images in galleries
+ pageText = self.get(get_redirect=followRedirects)
+ galleryR = re.compile('<gallery>.*?</gallery>', re.DOTALL)
+ galleryEntryR = re.compile('(?P<title>(%s|%s):.+?)(\|.+)?\n' % (self.site().image_namespace(), self.site().family.image_namespace(code = '_default')))
+ for gallery in galleryR.findall(pageText):
+ for match in galleryEntryR.finditer(gallery):
+ results.append( ImagePage(self.site(), match.group('title')) )
+ if loose:
+ ns = getSite().image_namespace()
+ imageR = re.compile('\w\w\w+\.(?:gif|png|jpg|jpeg|svg|JPG|xcf|pdf|mid|ogg|djvu)', re.IGNORECASE)
+ for imageName in imageR.findall(pageText):
+ results.append( ImagePage(self.site(), imageName) )
+ return list(set(results))
+
+ def templates(self, get_redirect=False):
+ """Return a list of titles (unicode) of templates used on this Page.
+
+ Template parameters are ignored.
+ """
+ if not hasattr(self, "_templates"):
+ self._templates = list(set([template
+ for (template, param)
+ in self.templatesWithParams(
+ get_redirect=get_redirect)]))
+ return self._templates
+
+ def templatesWithParams(self, thistxt=None, get_redirect=False):
+ """Return a list of templates used on this Page.
+
+ Return value is a list of tuples. There is one tuple for each use of
+ a template in the page, with the template title as the first entry
+ and a list of parameters as the second entry.
+
+ If thistxt is set, it is used instead of current page content.
+ """
+ if not thistxt:
+ try:
+ thistxt = self.get(get_redirect=get_redirect)
+ except (IsRedirectPage, NoPage):
+ return []
+
+ # remove commented-out stuff etc.
+ thistxt = removeDisabledParts(thistxt)
+
+ # marker for inside templates or parameters
+ marker = findmarker(thistxt, u'@@', u'@')
+
+ # marker for links
+ marker2 = findmarker(thistxt, u'##', u'#')
+
+ # marker for math
+ marker3 = findmarker(thistxt, u'%%', u'%')
+
+ result = []
+ inside = {}
+ count = 0
+ Rtemplate = re.compile(
+ ur'{{(msg:)?(?P<name>[^{\|]+?)(\|(?P<params>[^{]*?))?}}')
+ Rlink = re.compile(ur'\[\[[^\]]+\]\]')
+ Rmath = re.compile(ur'<math>[^<]+</math>')
+ Rmarker = re.compile(ur'%s(\d+)%s' % (marker, marker))
+ Rmarker2 = re.compile(ur'%s(\d+)%s' % (marker2, marker2))
+ Rmarker3 = re.compile(ur'%s(\d+)%s' % (marker3, marker3))
+
+ # Replace math with markers
+ maths = {}
+ count = 0
+ for m in Rmath.finditer(thistxt):
+ count += 1
+ text = m.group()
+ thistxt = thistxt.replace(text, '%s%d%s' % (marker3, count, marker3))
+ maths[count] = text
+
+ while Rtemplate.search(thistxt) is not None:
+ for m in Rtemplate.finditer(thistxt):
+ # Make sure it is not detected again
+ count += 1
+ text = m.group()
+ thistxt = thistxt.replace(text,
+ '%s%d%s' % (marker, count, marker))
+ # Make sure stored templates don't contain markers
+ for m2 in Rmarker.finditer(text):
+ text = text.replace(m2.group(), inside[int(m2.group(1))])
+ for m2 in Rmarker3.finditer(text):
+ text = text.replace(m2.group(), maths[int(m2.group(1))])
+ inside[count] = text
+
+ # Name
+ name = m.group('name').strip()
+ m2 = Rmarker.search(name) or Rmath.search(name)
+ if m2 is not None:
+ # Doesn't detect templates whose name changes,
+ # or templates whose name contains math tags
+ continue
+ if self.site().isInterwikiLink(name):
+ continue
+
+ # {{#if: }}
+ if name.startswith('#'):
+ continue
+ # {{DEFAULTSORT:...}}
+ defaultKeys = self.site().versionnumber() > 13 and \
+ self.site().getmagicwords('defaultsort')
+ # It seems some wikis does not have this magic key
+ if defaultKeys:
+ found = False
+ for key in defaultKeys:
+ if name.startswith(key):
+ found = True
+ break
+ if found: continue
+
+ try:
+ name = Page(self.site(), name).title()
+ except InvalidTitle:
+ if name:
+ output(
+ u"Page %s contains invalid template name {{%s}}."
+ % (self.title(), name.strip()))
+ continue
+ # Parameters
+ paramString = m.group('params')
+ params = []
+ if paramString:
+ # Replace links to markers
+ links = {}
+ count2 = 0
+ for m2 in Rlink.finditer(paramString):
+ count2 += 1
+ text = m2.group()
+ paramString = paramString.replace(text,
+ '%s%d%s' % (marker2, count2, marker2))
+ links[count2] = text
+ # Parse string
+ markedParams = paramString.split('|')
+ # Replace markers
+ for param in markedParams:
+ for m2 in Rmarker.finditer(param):
+ param = param.replace(m2.group(),
+ inside[int(m2.group(1))])
+ for m2 in Rmarker2.finditer(param):
+ param = param.replace(m2.group(),
+ links[int(m2.group(1))])
+ for m2 in Rmarker3.finditer(param):
+ param = param.replace(m2.group(),
+ maths[int(m2.group(1))])
+ params.append(param)
+
+ # Add it to the result
+ result.append((name, params))
+ return result
+
+ def getRedirectTarget(self):
+ """Return a Page object for the target this Page redirects to.
+
+ If this page is not a redirect page, will raise an IsNotRedirectPage
+ exception. This method also can raise a NoPage exception.
+
+ """
+ try:
+ self.get()
+ except NoPage:
+ raise
+ except IsRedirectPage, err:
+ # otherwise it will return error pages with " inside.
+ target = err[0].replace('&quot;', '"')
+
+ if '|' in target:
+ warnings.warn("'%s' has a | character, this makes no sense"
+ % target, Warning)
+ return Page(self.site(), target)
+ else:
+ raise IsNotRedirectPage(self)
+
+ def getVersionHistory(self, forceReload=False, reverseOrder=False,
+ getAll=False, revCount=500):
+ """Load the version history page and return history information.
+
+ Return value is a list of tuples, where each tuple represents one
+ edit and is built of revision id, edit date/time, user name,
+ edit summary, size and tags. Starts with the most current revision,
+ unless reverseOrder is True.
+ Defaults to getting the first revCount edits, unless getAll is True.
+
+ @param revCount: iterate no more than this number of revisions in total
+ """
+
+ # regular expression matching one edit in the version history.
+ # results will have 4 groups: oldid, edit date/time, user name, and edit
+ # summary.
+ thisHistoryDone = False
+ skip = False # Used in determining whether we need to skip the first page
+ dataQuery = []
+ hasData = False
+
+
+ # Are we getting by Earliest first?
+ if reverseOrder:
+ # Check if _versionhistoryearliest exists
+ if not hasattr(self, '_versionhistoryearliest') or forceReload:
+ self._versionhistoryearliest = []
+ elif getAll and len(self._versionhistoryearliest) == revCount:
+ # Cause a reload, or at least make the loop run
+ thisHistoryDone = False
+ skip = True
+ dataQuery = self._versionhistoryearliest
+ else:
+ thisHistoryDone = True
+ elif not hasattr(self, '_versionhistory') or forceReload or \
+ len(self._versionhistory) < revCount:
+ self._versionhistory = []
+ # ?? does not load if len(self._versionhistory) > revCount
+ # shouldn't it
+ elif getAll and len(self._versionhistory) == revCount:
+ # Cause a reload, or at least make the loop run
+ thisHistoryDone = False
+ skip = True
+ dataQuery = self._versionhistory
+ else:
+ thisHistoryDone = True
+
+ if not thisHistoryDone:
+ dataQuery.extend(self._getVersionHistory(getAll, skip, reverseOrder, revCount))
+
+ if reverseOrder:
+ # Return only revCount edits, even if the version history is extensive
+ if dataQuery != []:
+ self._versionhistoryearliest = dataQuery
+ del dataQuery
+ if len(self._versionhistoryearliest) > revCount and not getAll:
+ return self._versionhistoryearliest[:revCount]
+ return self._versionhistoryearliest
+
+ if dataQuery != []:
+ self._versionhistory = dataQuery
+ del dataQuery
+ # Return only revCount edits, even if the version history is extensive
+ if len(self._versionhistory) > revCount and not getAll:
+ return self._versionhistory[:revCount]
+ return self._versionhistory
+
+ def _getVersionHistory(self, getAll=False, skipFirst=False, reverseOrder=False,
+ revCount=500):
+ """Load history informations by API query.
+ Internal use for self.getVersionHistory(), don't use this function directly.
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 8:
+ return self._getVersionHistoryOld(reExist, getAll, skipFirst, reverseOrder, revCount)
+ dataQ = []
+ thisHistoryDone = False
+ params = {
+ 'action': 'query',
+ 'prop': 'revisions',
+ 'titles': self.title(),
+ 'rvprop': 'ids|timestamp|flags|comment|user|size|tags',
+ 'rvlimit': revCount,
+ }
+ while not thisHistoryDone:
+ if reverseOrder:
+ params['rvdir'] = 'newer'
+
+ result = query.GetData(params, self.site())
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+ pageInfo = result['query']['pages'].values()[0]
+ if result['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo:
+ raise NoPage(self.site(), unicode(self),
+ "Page does not exist.")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+
+ if 'query-continue' in result and getAll:
+ params['rvstartid'] = result['query-continue']['revisions']['rvstartid']
+ else:
+ thisHistoryDone = True
+
+ if skipFirst:
+ skipFirst = False
+ else:
+ for r in pageInfo['revisions']:
+ c = ''
+ if 'comment' in r:
+ c = r['comment']
+ #revision id, edit date/time, user name, edit summary
+ (revidStrr, timestampStrr, userStrr) = (None, None, None)
+ if 'revid' in r:
+ revidStrr = r['revid']
+ if 'timestamp' in r:
+ timestampStrr = r['timestamp']
+ if 'user' in r:
+ userStrr = r['user']
+ s=-1 #Will return -1 if not found
+ if 'size' in r:
+ s = r['size']
+ tags=[]
+ if 'tags' in r:
+ tags = r['tags']
+ dataQ.append((revidStrr, timestampStrr, userStrr, c, s, tags))
+ if len(result['query']['pages'].values()[0]['revisions']) < revCount:
+ thisHistoryDone = True
+ return dataQ
+
+ def _getVersionHistoryOld(self, getAll = False, skipFirst = False,
+ reverseOrder = False, revCount=500):
+ """Load the version history page and return history information.
+ Internal use for self.getVersionHistory(), don't use this function directly.
+ """
+ dataQ = []
+ thisHistoryDone = False
+ startFromPage = None
+ if self.site().versionnumber() < 4:
+ editR = re.compile('<li>\(.*?\)\s+\(.*\).*?<a href=".*?oldid=([0-9]*)" title=".*?">([^<]*)</a> <span class=\'user\'><a href=".*?" title=".*?">([^<]*?)</a></span>.*?(?:<span class=\'comment\'>(.*?)</span>)?</li>')
+ elif self.site().versionnumber() < 15:
+ editR = re.compile('<li>\(.*?\)\s+\(.*\).*?<a href=".*?oldid=([0-9]*)" title=".*?">([^<]*)</a> (?:<span class=\'history-user\'>|)<a href=".*?" title=".*?">([^<]*?)</a>.*?(?:</span>|).*?(?:<span class=[\'"]comment[\'"]>(.*?)</span>)?</li>')
+ elif self.site().versionnumber() < 16:
+ editR = re.compile(r'<li class=".*?">\((?:\w*|<a[^<]*</a>)\)\s\((?:\w*|<a[^<]*</a>)\).*?<a href=".*?([0-9]*)" title=".*?">([^<]*)</a> <span class=\'history-user\'><a [^>]*?>([^<]*?)</a>.*?</span></span>(?: <span class="minor">.*?</span>|)(?: <span class="history-size">.*?</span>|)(?: <span class=[\'"]comment[\'"]>\((?:<span class="autocomment">|)(.*?)(?:</span>|)\)</span>)?(?: \(<span class="mw-history-undo">.*?</span>\)|)\s*</li>', re.UNICODE)
+ else:
+ editR = re.compile(r'<li(?: class="mw-tag[^>]+)?>\((?:\w+|<a[^<]*</a>)\)\s\((?:\w+|<a[^<]*</a>)\).*?<a href=".*?([0-9]*)" title=".*?">([^<]*)</a> <span class=\'history-user\'><a [^>]*?>([^<]*?)</a>.*?</span></span>(?: <abbr class="minor"[^>]*?>.*?</abbr>|)(?: <span class="history-size">.*?</span>|)(?: <span class="comment">\((?:<span class="autocomment">|)(.*?)(?:</span>|)\)</span>)?(?: \(<span class="mw-history-undo">.*?</span>\))?(?: <span class="mw-tag-markers">.*?</span>\)</span>)?\s*</li>', re.UNICODE)
+
+ RLinkToNextPage = re.compile('&offset=(.*?)&')
+
+ while not thisHistoryDone:
+ path = self.site().family.version_history_address(self.site().language(), self.urlname(), config.special_page_limit)
+
+ if reverseOrder:
+ path += '&dir=prev'
+
+ if startFromPage:
+ path += '&offset=' + startFromPage
+
+ # this loop will run until the page could be retrieved
+ # Try to retrieve the page until it was successfully loaded (just in case
+ # the server is down or overloaded)
+ # wait for retry_idle_time minutes (growing!) between retries.
+ retry_idle_time = 1
+
+ if verbose:
+ if startFromPage:
+ output(u'Continuing to get version history of %s' % self)
+ else:
+ output(u'Getting version history of %s' % self)
+
+ txt = self.site().getUrl(path)
+
+ # save a copy of the text
+ self_txt = txt
+
+ #Find the nextPage link, if not exist, the page is last history page
+ matchObj = RLinkToNextPage.search(self_txt)
+ if getAll and matchObj:
+ startFromPage = matchObj.group(1)
+ else:
+ thisHistoryDone = True
+
+ if not skipFirst:
+ edits = editR.findall(self_txt)
+
+ if skipFirst:
+ # Skip the first page only,
+ skipFirst = False
+ else:
+ if reverseOrder:
+ edits.reverse()
+ #for edit in edits:
+ dataQ.extend(edits)
+ if len(edits) < revCount:
+ thisHistoryDone = True
+ return dataQ
+
+ def getVersionHistoryTable(self, forceReload=False, reverseOrder=False,
+ getAll=False, revCount=500):
+ """Return the version history as a wiki table."""
+
+ result = '{| class="wikitable"\n'
+ result += '! oldid || date/time || size || username || edit summary\n'
+ for oldid, time, username, summary, size, tags \
+ in self.getVersionHistory(forceReload=forceReload,
+ reverseOrder=reverseOrder,
+ getAll=getAll, revCount=revCount):
+ result += '|----\n'
+ result += '| %s || %s || %d || %s || <nowiki>%s</nowiki>\n' \
+ % (oldid, time, size, username, summary)
+ result += '|}\n'
+ return result
+
+ def fullVersionHistory(self, getAll=False, skipFirst=False, reverseOrder=False,
+ revCount=500):
+ """Iterate previous versions including wikitext.
+
+ Gives a list of tuples consisting of revision ID, edit date/time, user name and
+ content
+
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 8:
+ address = self.site().export_address()
+ predata = {
+ 'action': 'submit',
+ 'pages': self.title()
+ }
+ get_throttle(requestsize = 10)
+ now = time.time()
+ response, data = self.site().postForm(address, predata)
+ data = data.encode(self.site().encoding())
+# get_throttle.setDelay(time.time() - now)
+ output = []
+ # TODO: parse XML using an actual XML parser instead of regex!
+ r = re.compile("\<revision\>.*?\<id\>(?P<id>.*?)\<\/id\>.*?\<timestamp\>(?P<timestamp>.*?)\<\/timestamp\>.*?\<(?:ip|username)\>(?P<user>.*?)\</(?:ip|username)\>.*?\<text.*?\>(?P<content>.*?)\<\/text\>",re.DOTALL)
+ #r = re.compile("\<revision\>.*?\<timestamp\>(.*?)\<\/timestamp\>.*?\<(?:ip|username)\>(.*?)\<",re.DOTALL)
+ return [ (match.group('id'),
+ match.group('timestamp'),
+ unescape(match.group('user')),
+ unescape(match.group('content')))
+ for match in r.finditer(data) ]
+
+ # Load history informations by API query.
+
+ dataQ = []
+ thisHistoryDone = False
+ params = {
+ 'action': 'query',
+ 'prop': 'revisions',
+ 'titles': self.title(),
+ 'rvprop': 'ids|timestamp|user|content',
+ 'rvlimit': revCount,
+ }
+ while not thisHistoryDone:
+ if reverseOrder:
+ params['rvdir'] = 'newer'
+
+ result = query.GetData(params, self.site())
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+ pageInfo = result['query']['pages'].values()[0]
+ if result['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo:
+ raise NoPage(self.site(), unicode(self),
+ "Page does not exist.")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+
+ if 'query-continue' in result and getAll:
+ params['rvstartid'] = result['query-continue']['revisions']['rvstartid']
+ else:
+ thisHistoryDone = True
+
+ if skipFirst:
+ skipFirst = False
+ else:
+ for r in pageInfo['revisions']:
+ c = ''
+ if 'comment' in r:
+ c = r['comment']
+ #revision id, edit date/time, user name, edit summary
+ (revidStrr, timestampStrr, userStrr) = (None, None, None)
+ if 'revid' in r:
+ revidStrr = r['revid']
+ if 'timestamp' in r:
+ timestampStrr = r['timestamp']
+ if 'user' in r:
+ userStrr = r['user']
+ s='' #Will return -1 if not found
+ if '*' in r:
+ s = r['*']
+ dataQ.append((revidStrr, timestampStrr, userStrr, s))
+ if len(result['query']['pages'].values()[0]['revisions']) < revCount:
+ thisHistoryDone = True
+ return dataQ
+
+ def contributingUsers(self, step=None, total=None):
+ """Return a set of usernames (or IPs) of users who edited this page.
+
+ @param step: limit each API call to this number of revisions
+ - not used yet, only in rewrite branch -
+ @param total: iterate no more than this number of revisions in total
+
+ """
+ if total is None:
+ total = 500 #set to default of getVersionHistory
+ edits = self.getVersionHistory(revCount=total)
+ users = set([edit[2] for edit in edits])
+ return users
+
+ def getCreator(self):
+ """ Function to get the first editor and time stamp of a page """
+ inf = self.getVersionHistory(reverseOrder=True, revCount=1)[0]
+ return inf[2], inf[1]
+
+ def getLatestEditors(self, limit=1):
+ """ Function to get the last editors of a page """
+ #action=query&prop=revisions&titles=API&rvprop=timestamp|user|comment
+ if hasattr(self, '_versionhistory'):
+ data = self.getVersionHistory(getAll=True, revCount=limit)
+ else:
+ data = self.getVersionHistory(revCount = limit)
+
+ result = []
+ for i in data:
+ result.append({'user':i[2], 'timestamp':i[1]})
+ return result
+
+ def watch(self, unwatch=False):
+ """Add this page to the watchlist"""
+ if self.site().has_api:
+ params = {
+ 'action': 'watch',
+ 'title': self.title()
+ }
+ # watchtoken is needed for mw 1.18
+ # TODO: Find a better implementation for other actions too
+ # who needs a token
+ if self.site().versionnumber() >= 18:
+ api = {
+ 'action': 'query',
+ 'prop': 'info',
+ 'titles' : self.title(),
+ 'intoken' : 'watch',
+ }
+ data = query.GetData(api, self.site())
+ params['token'] = data['query']['pages'].values()[0]['watchtoken']
+ if unwatch:
+ params['unwatch'] = ''
+
+ data = query.GetData(params, self.site())
+ if 'error' in data:
+ raise RuntimeError("API query error: %s" % data['error'])
+ else:
+ urlname = self.urlname()
+ if not unwatch:
+ address = self.site().watch_address(urlname)
+ else:
+ address = self.site().unwatch_address(urlname)
+ response = self.site().getUrl(address)
+ return response
+
+ def unwatch(self):
+ self.watch(unwatch=True)
+
+ def move(self, newtitle, reason=None, movetalkpage=True, movesubpages=False,
+ sysop=False, throttle=True, deleteAndMove=False, safe=True,
+ fixredirects=True, leaveRedirect=True):
+ """Move this page to new title.
+
+ * fixredirects has no effect in MW < 1.13
+
+ @param newtitle: The new page title.
+ @param reason: The edit summary for the move.
+ @param movetalkpage: If true, move this page's talk page (if it exists)
+ @param sysop: Try to move using sysop account, if available
+ @param deleteAndMove: if move succeeds, delete the old page
+ (usually requires sysop privileges, depending on wiki settings)
+ @param safe: If false, attempt to delete existing page at newtitle
+ (if there is one) and then move this page to that title
+
+ """
+ if not self.site().has_api() or self.site().versionnumber() < 12:
+ return self._moveOld(newtitle, reason, movetalkpage, sysop,
+ throttle, deleteAndMove, safe, fixredirects, leaveRedirect)
+ # Login
+ try:
+ self.get()
+ except:
+ pass
+ sysop = self._getActionUser(action = 'move', restriction = self.moveRestriction, sysop = False)
+ if deleteAndMove:
+ sysop = self._getActionUser(action = 'delete', restriction = '', sysop = True)
+ Page(self.site(), newtitle).delete(self.site().mediawiki_message('delete_and_move_reason'), False, False)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+
+ if throttle:
+ put_throttle()
+ if reason is None:
+ pywikibot.output(u'Moving %s to [[%s]].'
+ % (self.title(asLink=True), newtitle))
+ reason = input(u'Please enter a reason for the move:')
+ if self.isTalkPage():
+ movetalkpage = False
+
+ params = {
+ 'action': 'move',
+ 'from': self.title(),
+ 'to': newtitle,
+ 'token': self.site().getToken(sysop=sysop),
+ 'reason': reason,
+ }
+ if movesubpages:
+ params['movesubpages'] = 1
+
+ if movetalkpage:
+ params['movetalk'] = 1
+
+ if not leaveRedirect:
+ params['noredirect'] = 1
+
+ result = query.GetData(params, self.site(), sysop=sysop)
+ if 'error' in result:
+ err = result['error']['code']
+ if err == 'articleexists':
+ if safe:
+ output(u'Page move failed: Target page [[%s]] already exists.' % newtitle)
+ else:
+ try:
+ # Try to delete and move
+ return self.move(newtitle, reason, movetalkpage, movesubpages, throttle = throttle, deleteAndMove = True)
+ except NoUsername:
+ # We dont have the user rights to delete
+ output(u'Page moved failed: Target page [[%s]] already exists.' % newtitle)
+ #elif err == 'protectedpage':
+ #
+ else:
+ output("Unknown Error: %s" % result)
+ return False
+ elif 'move' in result:
+ if deleteAndMove:
+ output(u'Page %s moved to %s, deleting the existing page' % (self.title(), newtitle))
+ else:
+ output(u'Page %s moved to %s' % (self.title(), newtitle))
+
+ if hasattr(self, '_contents'):
+ #self.__init__(self.site(), newtitle, defaultNamespace = self._namespace)
+ try:
+ self.get(force=True, get_redirect=True, throttle=False)
+ except NoPage:
+ output(u'Page %s is moved and no longer exist.' % self.title() )
+ #delattr(self, '_contents')
+ return True
+
+ def _moveOld(self, newtitle, reason=None, movetalkpage=True, movesubpages=False, sysop=False,
+ throttle=True, deleteAndMove=False, safe=True, fixredirects=True, leaveRedirect=True):
+
+ # Login
+ try:
+ self.get()
+ except:
+ pass
+ sysop = self._getActionUser(action = 'move', restriction = self.moveRestriction, sysop = False)
+ if deleteAndMove:
+ sysop = self._getActionUser(action = 'delete', restriction = '', sysop = True)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = sysop)
+
+ if throttle:
+ put_throttle()
+ if reason is None:
+ reason = input(u'Please enter a reason for the move:')
+ if self.isTalkPage():
+ movetalkpage = False
+
+ host = self.site().hostname()
+ address = self.site().move_address()
+ token = self.site().getToken(sysop = sysop)
+ predata = {
+ 'wpOldTitle': self.title().encode(self.site().encoding()),
+ 'wpNewTitle': newtitle.encode(self.site().encoding()),
+ 'wpReason': reason.encode(self.site().encoding()),
+ }
+ if deleteAndMove:
+ predata['wpDeleteAndMove'] = self.site().mediawiki_message('delete_and_move_confirm')
+ predata['wpConfirm'] = '1'
+
+ if movetalkpage:
+ predata['wpMovetalk'] = '1'
+ else:
+ predata['wpMovetalk'] = '0'
+
+ if self.site().versionnumber() >= 13:
+ if fixredirects:
+ predata['wpFixRedirects'] = '1'
+ else:
+ predata['wpFixRedirects'] = '0'
+
+ if leaveRedirect:
+ predata['wpLeaveRedirect'] = '1'
+ else:
+ predata['wpLeaveRedirect'] = '0'
+
+ if movesubpages:
+ predata['wpMovesubpages'] = '1'
+ else:
+ predata['wpMovesubpages'] = '0'
+
+ if token:
+ predata['wpEditToken'] = token
+
+ response, data = self.site().postForm(address, predata, sysop = sysop)
+
+ if data == u'' or self.site().mediawiki_message('pagemovedsub') in data:
+ #Move Success
+ if deleteAndMove:
+ output(u'Page %s moved to %s, deleting the existing page' % (self.title(), newtitle))
+ else:
+ output(u'Page %s moved to %s' % (self.title(), newtitle))
+
+ if hasattr(self, '_contents'):
+ #self.__init__(self.site(), newtitle, defaultNamespace = self._namespace)
+ try:
+ self.get(force=True, get_redirect=True, throttle=False)
+ except NoPage:
+ output(u'Page %s is moved and no longer exist.' % self.title() )
+ #delattr(self, '_contents')
+
+ return True
+ else:
+ #Move Failure
+ self.site().checkBlocks(sysop = sysop)
+ if self.site().mediawiki_message('articleexists') in data or self.site().mediawiki_message('delete_and_move') in data:
+ if safe:
+ output(u'Page move failed: Target page [[%s]] already exists.' % newtitle)
+ return False
+ else:
+ try:
+ # Try to delete and move
+ return self._moveOld(newtitle, reason, movetalkpage, movesubpages, throttle = throttle, deleteAndMove = True)
+ except NoUsername:
+ # We dont have the user rights to delete
+ output(u'Page moved failed: Target page [[%s]] already exists.' % newtitle)
+ return False
+ elif not self.exists():
+ raise NoPage(u'Page move failed: Source page [[%s]] does not exist.' % newtitle)
+ elif Page(self.site(),newtitle).exists():
+ # XXX : This might be buggy : if the move was successful, the target pase *has* been created
+ raise PageNotSaved(u'Page move failed: Target page [[%s]] already exists.' % newtitle)
+ else:
+ output(u'Page move failed for unknown reason.')
+ try:
+ ibegin = data.index('<!-- start content -->') + 22
+ iend = data.index('<!-- end content -->')
+ except ValueError:
+ # if begin/end markers weren't found, show entire HTML file
+ output(data)
+ else:
+ # otherwise, remove the irrelevant sections
+ data = data[ibegin:iend]
+ output(data)
+ return False
+
+ def delete(self, reason=None, prompt=True, throttle=True, mark=False):
+ """Deletes the page from the wiki. Requires administrator status.
+
+ @param reason: The edit summary for the deletion. If None, ask for it.
+ @param prompt: If true, prompt user for confirmation before deleting.
+ @param mark: if true, and user does not have sysop rights, place a
+ speedy-deletion request on the page instead.
+
+ """
+ # Login
+ try:
+ self._getActionUser(action = 'delete', sysop = True)
+ except NoUsername:
+ if mark and self.exists():
+ text = self.get(get_redirect = True)
+ output(u'Cannot delete page %s - marking the page for deletion instead:' % self.title(asLink=True))
+ # Note: Parameters to {{delete}}, and their meanings, vary from one Wikipedia to another.
+ # If you want or need to use them, you must be careful not to break others. Else don't.
+ self.put(u'{{delete|bot=yes}}\n%s --~~~~\n----\n\n%s' % (reason, text), comment = reason)
+ return
+ else:
+ raise
+
+ # Check blocks
+ self.site().checkBlocks(sysop = True)
+
+ if throttle:
+ put_throttle()
+ if reason is None:
+ output(u'Deleting %s.' % (self.title(asLink=True)))
+ reason = input(u'Please enter a reason for the deletion:')
+ answer = u'y'
+ if prompt and not hasattr(self.site(), '_noDeletePrompt'):
+ answer = inputChoice(u'Do you want to delete %s?' % self,
+ ['yes', 'no', 'all'], ['y', 'N', 'a'], 'N')
+ if answer == 'a':
+ answer = 'y'
+ self.site()._noDeletePrompt = True
+ if answer == 'y':
+
+ token = self.site().getToken(self, sysop = True)
+ reason = reason.encode(self.site().encoding())
+
+ if self.site().has_api() and self.site().versionnumber() >= 12:
+ #API Mode
+ params = {
+ 'action': 'delete',
+ 'title': self.title(),
+ 'token': token,
+ 'reason': reason,
+ }
+ datas = query.GetData(params, self.site(), sysop = True)
+ if 'delete' in datas:
+ output(u'Page %s deleted' % self)
+ return True
+ else:
+ if datas['error']['code'] == 'missingtitle':
+ output(u'Page %s could not be deleted - it doesn\'t exist'
+ % self)
+ else:
+ output(u'Deletion of %s failed for an unknown reason. The response text is:'
+ % self)
+ output('%s' % datas)
+
+ return False
+ else:
+ #Ordinary mode from webpage.
+ host = self.site().hostname()
+ address = self.site().delete_address(self.urlname())
+
+ predata = {
+ 'wpDeleteReasonList': 'other',
+ 'wpReason': reason,
+ #'wpComment': reason, <- which version?
+ 'wpConfirm': '1',
+ 'wpConfirmB': '1',
+ 'wpEditToken': token,
+ }
+ response, data = self.site().postForm(address, predata, sysop = True)
+ if data:
+ self.site().checkBlocks(sysop = True)
+ if self.site().mediawiki_message('actioncomplete') in data:
+ output(u'Page %s deleted' % self)
+ return True
+ elif self.site().mediawiki_message('cannotdelete') in data:
+ output(u'Page %s could not be deleted - it doesn\'t exist'
+ % self)
+ return False
+ else:
+ output(u'Deletion of %s failed for an unknown reason. The response text is:'
+ % self)
+ try:
+ ibegin = data.index('<!-- start content -->') + 22
+ iend = data.index('<!-- end content -->')
+ except ValueError:
+ # if begin/end markers weren't found, show entire HTML file
+ output(data)
+ else:
+ # otherwise, remove the irrelevant sections
+ data = data[ibegin:iend]
+ output(data)
+ return False
+
+ def loadDeletedRevisions(self, step=None, total=None):
+ """Retrieve all deleted revisions for this Page from Special/Undelete.
+
+ Stores all revisions' timestamps, dates, editors and comments in
+ self._deletedRevs attribute.
+
+ @return: list of timestamps (which can be used to retrieve
+ revisions later on).
+
+ """
+ # Login
+ self._getActionUser(action = 'deletedhistory', sysop = True)
+
+ #TODO: Handle image file revisions too.
+ output(u'Loading list of deleted revisions for [[%s]]...' % self.title())
+
+ self._deletedRevs = {}
+
+ if self.site().has_api() and self.site().versionnumber() >= 12:
+ params = {
+ 'action': 'query',
+ 'list': 'deletedrevs',
+ 'drfrom': self.title(withNamespace=False),
+ 'drnamespace': self.namespace(),
+ 'drprop': ['revid','user','comment','content'],#','minor','len','token'],
+ 'drlimit': 100,
+ 'drdir': 'older',
+ #'': '',
+ }
+ count = 0
+ while True:
+ data = query.GetData(params, self.site(), sysop=True)
+ for x in data['query']['deletedrevs']:
+ if x['title'] != self.title():
+ continue
+
+ for y in x['revisions']:
+ count += 1
+ self._deletedRevs[parsetime2stamp(y['timestamp'])] = [y['timestamp'], y['user'], y['comment'] , y['*'], False]
+
+ if 'query-continue' in data:
+ # get the continue key for backward compatibility
+ # with pre 1.20wmf8
+ contKey = data['query-continue']['deletedrevs'].keys()[0]
+ if data['query-continue']['deletedrevs'][contKey].split(
+ '|')[1] == self.title(withNamespace=False):
+ params[contKey] = data['query-continue']['deletedrevs'][contKey]
+ else: break
+ else:
+ break
+ self._deletedRevsModified = False
+
+ else:
+ address = self.site().undelete_view_address(self.urlname())
+ text = self.site().getUrl(address, sysop = True)
+ #TODO: Handle non-existent pages etc
+
+ rxRevs = re.compile(r'<input name="(?P<ts>(?:ts|fileid)\d+)".*?title=".*?">(?P<date>.*?)</a>.*?title=".*?">(?P<editor>.*?)</a>.*?<span class="comment">\((?P<comment>.*?)\)</span>',re.DOTALL)
+ for rev in rxRevs.finditer(text):
+ self._deletedRevs[rev.group('ts')] = [
+ rev.group('date'),
+ rev.group('editor'),
+ rev.group('comment'),
+ None, #Revision text
+ False, #Restoration marker
+ ]
+
+ self._deletedRevsModified = False
+
+ return self._deletedRevs.keys()
+
+ def getDeletedRevision(self, timestamp, retrieveText=False):
+ """Return a particular deleted revision by timestamp.
+
+ @return: a list of [date, editor, comment, text, restoration
+ marker]. text will be None, unless retrieveText is True (or has
+ been retrieved earlier). If timestamp is not found, returns
+ None.
+
+ """
+ if self._deletedRevs is None:
+ self.loadDeletedRevisions()
+ if timestamp not in self._deletedRevs:
+ #TODO: Throw an exception instead?
+ return None
+
+ if retrieveText and not self._deletedRevs[timestamp][3] and timestamp[:2]=='ts':
+ # Login
+ self._getActionUser(action = 'delete', sysop = True)
+
+ output(u'Retrieving text of deleted revision...')
+ address = self.site().undelete_view_address(self.urlname(),timestamp)
+ text = self.site().getUrl(address, sysop = True)
+ und = re.search('<textarea readonly="1" cols="80" rows="25">(.*?)</textarea><div><form method="post"',text,re.DOTALL)
+ if und:
+ self._deletedRevs[timestamp][3] = und.group(1)
+
+ return self._deletedRevs[timestamp]
+
+ def markDeletedRevision(self, timestamp, undelete=True):
+ """Mark the revision identified by timestamp for undeletion.
+
+ @param undelete: if False, mark the revision to remain deleted.
+
+ """
+ if self._deletedRevs is None:
+ self.loadDeletedRevisions()
+ if timestamp not in self._deletedRevs:
+ #TODO: Throw an exception?
+ return None
+ self._deletedRevs[timestamp][4] = undelete
+ self._deletedRevsModified = True
+
+ def undelete(self, comment=None, throttle=True):
+ """Undelete page based on the undeletion markers set by previous calls.
+
+ If no calls have been made since loadDeletedRevisions(), everything
+ will be restored.
+
+ Simplest case:
+ Page(...).undelete('This will restore all revisions')
+
+ More complex:
+ pg = Page(...)
+ revs = pg.loadDeletedRevsions()
+ for rev in revs:
+ if ... #decide whether to undelete a revision
+ pg.markDeletedRevision(rev) #mark for undeletion
+ pg.undelete('This will restore only selected revisions.')
+
+ @param comment: The undeletion edit summary.
+
+ """
+ # Login
+ self._getActionUser(action = 'undelete', sysop = True)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = True)
+
+ token = self.site().getToken(self, sysop=True)
+ if comment is None:
+ output(u'Preparing to undelete %s.'
+ % (self.title(asLink=True)))
+ comment = input(u'Please enter a reason for the undeletion:')
+
+ if throttle:
+ put_throttle()
+
+ if self.site().has_api() and self.site().versionnumber() >= 12:
+ params = {
+ 'action': 'undelete',
+ 'title': self.title(),
+ 'reason': comment,
+ 'token': token,
+ }
+ if self._deletedRevs and self._deletedRevsModified:
+ selected = []
+
+ for ts in self._deletedRevs:
+ if self._deletedRevs[ts][4]:
+ selected.append(ts)
+ params['timestamps'] = ts,
+
+ result = query.GetData(params, self.site(), sysop=True)
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+ elif 'undelete' in result:
+ output(u'Page %s undeleted' % self.title(asLink=True))
+
+ return result
+
+ else:
+ address = self.site().undelete_address()
+
+ formdata = {
+ 'target': self.title(),
+ 'wpComment': comment,
+ 'wpEditToken': token,
+ 'restore': self.site().mediawiki_message('undeletebtn')
+ }
+
+ if self._deletedRevs and self._deletedRevsModified:
+ for ts in self._deletedRevs:
+ if self._deletedRevs[ts][4]:
+ formdata['ts'+ts] = '1'
+
+ self._deletedRevs = None
+ #TODO: Check for errors below (have we succeeded? etc):
+ result = self.site().postForm(address,formdata,sysop=True)
+ output(u'Page %s undeleted' % self.title(asLink=True))
+
+ return result
+
+ def protect(self, editcreate='sysop', move='sysop', unprotect=False,
+ reason=None, editcreate_duration='infinite',
+ move_duration = 'infinite', cascading = False, prompt = True, throttle = True):
+ """(Un)protect a wiki page. Requires administrator status.
+
+ If the title is not exist, the protection only ec (aka edit/create) available
+ If reason is None, asks for a reason. If prompt is True, asks the
+ user if he wants to protect the page. Valid values for ec and move
+ are:
+ * '' (equivalent to 'none')
+ * 'autoconfirmed'
+ * 'sysop'
+
+ """
+ # Login
+ self._getActionUser(action = 'protect', sysop = True)
+
+ # Check blocks
+ self.site().checkBlocks(sysop = True)
+
+ #if self.exists() and editcreate != move: # check protect level if edit/move not same
+ # if editcreate == 'sysop' and move != 'sysop':
+ # raise Error("The level configuration is not safe")
+
+ if unprotect:
+ editcreate = move = ''
+ else:
+ editcreate, move = editcreate.lower(), move.lower()
+ if throttle:
+ put_throttle()
+ if reason is None:
+ reason = input(
+ u'Please enter a reason for the change of the protection level:')
+ reason = reason.encode(self.site().encoding())
+ answer = 'y'
+ if prompt and not hasattr(self.site(), '_noProtectPrompt'):
+ answer = inputChoice(
+ u'Do you want to change the protection level of %s?' % self,
+ ['Yes', 'No', 'All'], ['Y', 'N', 'A'], 'N')
+ if answer == 'a':
+ answer = 'y'
+ self.site()._noProtectPrompt = True
+ if answer == 'y':
+ if not self.site().has_api() or self.site().versionnumber() < 12:
+ return self._oldProtect(editcreate, move, unprotect, reason,
+ editcreate_duration, move_duration,
+ cascading, prompt, throttle)
+
+ token = self.site().getToken(self, sysop = True)
+
+ # Translate 'none' to ''
+ protections = []
+ expiry = []
+ if editcreate == 'none':
+ editcreate = 'all'
+ if move == 'none':
+ move = 'all'
+
+ if editcreate_duration == 'none' or not editcreate_duration:
+ editcreate_duration = 'infinite'
+ if move_duration == 'none' or not move_duration:
+ move_duration = 'infinite'
+
+ if self.exists():
+ protections.append("edit=%s" % editcreate)
+
+ protections.append("move=%s" % move)
+ expiry.append(move_duration)
+ else:
+ protections.append("create=%s" % editcreate)
+
+ expiry.append(editcreate_duration)
+
+ params = {
+ 'action': 'protect',
+ 'title': self.title(),
+ 'token': token,
+ 'protections': protections,
+ 'expiry': expiry,
+ #'': '',
+ }
+ if reason:
+ params['reason'] = reason
+
+ if cascading:
+ if editcreate != 'sysop' or move != 'sysop' or not self.exists():
+ # You can't protect a page as autoconfirmed and cascading, prevent the error
+ # Cascade only available exists page, create prot. not.
+ output(u"NOTE: The page can't be protected with cascading and not also with only-sysop. Set cascading \"off\"")
+ else:
+ params['cascade'] = 1
+
+ result = query.GetData(params, self.site(), sysop=True)
+
+ if 'error' in result: #error occured
+ err = result['error']['code']
+ output('%s' % result)
+ #if err == '':
+ #
+ #elif err == '':
+ #
+ else:
+ if result['protect']:
+ output(u'Changed protection level of page %s.' % self.title(asLink=True))
+ return True
+
+ return False
+
+ def _oldProtect(self, editcreate = 'sysop', move = 'sysop', unprotect = False, reason = None, editcreate_duration = 'infinite',
+ move_duration = 'infinite', cascading = False, prompt = True, throttle = True):
+ """internal use for protect page by ordinary web page form"""
+ host = self.site().hostname()
+ token = self.site().getToken(sysop = True)
+
+ # Translate 'none' to ''
+ if editcreate == 'none': editcreate = ''
+ if move == 'none': move = ''
+
+ # Translate no duration to infinite
+ if editcreate_duration == 'none' or not editcreate_duration: editcreate_duration = 'infinite'
+ if move_duration == 'none' or not move_duration: move_duration = 'infinite'
+
+ # Get cascading
+ if cascading == False:
+ cascading = '0'
+ else:
+ if editcreate != 'sysop' or move != 'sysop' or not self.exists():
+ # You can't protect a page as autoconfirmed and cascading, prevent the error
+ # Cascade only available exists page, create prot. not.
+ cascading = '0'
+ output(u"NOTE: The page can't be protected with cascading and not also with only-sysop. Set cascading \"off\"")
+ else:
+ cascading = '1'
+
+ if unprotect:
+ address = self.site().unprotect_address(self.urlname())
+ else:
+ address = self.site().protect_address(self.urlname())
+
+ predata = {}
+ if self.site().versionnumber >= 10:
+ predata['mwProtect-cascade'] = cascading
+
+ predata['mwProtect-reason'] = reason
+
+ if not self.exists(): #and self.site().versionnumber() >= :
+ #create protect
+ predata['mwProtect-level-create'] = editcreate
+ predata['wpProtectExpirySelection-create'] = editcreate_duration
+ else:
+ #edit/move Protect
+ predata['mwProtect-level-edit'] = editcreate
+ predata['mwProtect-level-move'] = move
+
+ if self.site().versionnumber() >= 14:
+ predata['wpProtectExpirySelection-edit'] = editcreate_duration
+ predata['wpProtectExpirySelection-move'] = move_duration
+ else:
+ predata['mwProtect-expiry'] = editcreate_duration
+
+ if token:
+ predata['wpEditToken'] = token
+
+ response, data = self.site().postForm(address, predata, sysop=True)
+
+ if response.code == 302 and not data:
+ output(u'Changed protection level of page %s.' % self.title(asLink=True))
+ return True
+ else:
+ #Normally, we expect a 302 with no data, so this means an error
+ self.site().checkBlocks(sysop = True)
+ output(u'Failed to change protection level of page %s:'
+ % self.title(asLink=True))
+ output(u"HTTP response code %s" % response.code)
+ output(data)
+ return False
+
+ def removeImage(self, image, put=False, summary=None, safe=True):
+ """Remove all occurrences of an image from this Page."""
+ # TODO: this should be grouped with other functions that operate on
+ # wiki-text rather than the Page object
+ return self.replaceImage(image, None, put, summary, safe)
+
+ def replaceImage(self, image, replacement=None, put=False, summary=None,
+ safe=True):
+ """Replace all occurences of an image by another image.
+
+ Giving None as argument for replacement will delink instead of
+ replace.
+
+ The argument image must be without namespace and all spaces replaced
+ by underscores.
+
+ If put is False, the new text will be returned. If put is True, the
+ edits will be saved to the wiki and True will be returned on succes,
+ and otherwise False. Edit errors propagate.
+
+ """
+ # TODO: this should be grouped with other functions that operate on
+ # wiki-text rather than the Page object
+
+ # Copyright (c) Orgullomoore, Bryan
+
+ # TODO: document and simplify the code
+ site = self.site()
+
+ text = self.get()
+ new_text = text
+
+ def capitalizationPattern(s):
+ """
+ Given a string, creates a pattern that matches the string, with
+ the first letter case-insensitive if capitalization is switched
+ on on the site you're working on.
+ """
+ if self.site().nocapitalize:
+ return re.escape(s)
+ else:
+ return ur'(?:[%s%s]%s)' % (re.escape(s[0].upper()), re.escape(s[0].lower()), re.escape(s[1:]))
+
+ namespaces = set(site.namespace(6, all = True) + site.namespace(-2, all = True))
+ # note that the colon is already included here
+ namespacePattern = ur'\s*(?:%s)\s*\:\s*' % u'|'.join(namespaces)
+
+ imagePattern = u'(%s)' % capitalizationPattern(image).replace(r'\_', '[ _]')
+
+ def filename_replacer(match):
+ if replacement is None:
+ return u''
+ else:
+ old = match.group()
+ return old[:match.start('filename')] + replacement + old[match.end('filename'):]
+
+ # The group params contains parameters such as thumb and 200px, as well
+ # as the image caption. The caption can contain wiki links, but each
+ # link has to be closed properly.
+ paramPattern = r'(?:\|(?:(?!\[\[).|\[\[.*?\]\])*?)'
+ rImage = re.compile(ur'\[\[(?P<namespace>%s)(?P<filename>%s)(?P<params>%s*?)\]\]' % (namespacePattern, imagePattern, paramPattern))
+ if replacement is None:
+ new_text = rImage.sub('', new_text)
+ else:
+ new_text = rImage.sub('[[\g<namespace>%s\g<params>]]' % replacement, new_text)
+
+ # Remove the image from galleries
+ galleryR = re.compile(r'(?is)<gallery>(?P<items>.*?)</gallery>')
+ galleryItemR = re.compile(r'(?m)^%s?(?P<filename>%s)\s*(?P<label>\|.*?)?\s*$' % (namespacePattern, imagePattern))
+
+ def gallery_replacer(match):
+ return ur'<gallery>%s</gallery>' % galleryItemR.sub(filename_replacer, match.group('items'))
+
+ new_text = galleryR.sub(gallery_replacer, new_text)
+
+ if (text == new_text) or (not safe):
+ # All previous steps did not work, so the image is
+ # likely embedded in a complicated template.
+ # Note: this regular expression can't handle nested templates.
+ templateR = re.compile(ur'(?s)\{\{(?P<contents>.*?)\}\}')
+ fileReferenceR = re.compile(u'%s(?P<filename>(?:%s)?)' % (namespacePattern, imagePattern))
+
+ def template_replacer(match):
+ return fileReferenceR.sub(filename_replacer, match.group(0))
+
+ new_text = templateR.sub(template_replacer, new_text)
+
+ if put:
+ if text != new_text:
+ # Save to the wiki
+ self.put(new_text, summary)
+ return True
+ return False
+ else:
+ return new_text
+
+ ## @since 10310
+ # @remarks needed by various bots
+ def purgeCache(self):
+ """Purges the page cache with API.
+ ( non-api purge can be done with Page.purge_address() )
+ """
+
+ # Make sure we re-raise an exception we got on an earlier attempt
+ if hasattr(self, '_getexception'):
+ return self._getexception
+
+ # call the wiki to execute the request
+ params = {
+ u'action' : u'purge',
+ u'titles' : self.title(),
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Purging page cache for %s." % self.title(asLink=True))
+
+ result = query.GetData(params, self.site())
+ r = result[u'purge'][0]
+
+ # store and return info
+ if (u'missing' in r):
+ self._getexception = pywikibot.NoPage
+ raise pywikibot.NoPage(self.site(), self.title(asLink=True),"Page does not exist. Was not able to purge cache!" )
+
+ return (u'purged' in r)
+
+
+class ImagePage(Page):
+ """A subclass of Page representing an image descriptor wiki page.
+
+ Supports the same interface as Page, with the following added methods:
+
+ getImagePageHtml : Download image page and return raw HTML text.
+ fileURL : Return the URL for the image described on this
+ page.
+ fileIsOnCommons : Return True if image stored on Wikimedia
+ Commons.
+ fileIsShared : Return True if image stored on Wikitravel
+ shared repository.
+ getFileMd5Sum : Return image file's MD5 checksum.
+ getFileVersionHistory : Return the image file's version history.
+ getFileVersionHistoryTable: Return the version history in the form of a
+ wiki table.
+ usingPages : Yield Pages on which the image is displayed.
+ globalUsage : Yield Pages on which the image is used globally
+
+ """
+ def __init__(self, site, title, insite = None):
+ Page.__init__(self, site, title, insite, defaultNamespace=6)
+ if self.namespace() != 6:
+ raise ValueError(u'BUG: %s is not in the image namespace!' % title)
+ self._imagePageHtml = None
+ self._local = None
+ self._latestInfo = {}
+ self._infoLoaded = False
+
+ def getImagePageHtml(self):
+ """
+ Download the image page, and return the HTML, as a unicode string.
+
+ Caches the HTML code, so that if you run this method twice on the
+ same ImagePage object, the page will only be downloaded once.
+ """
+ if not self._imagePageHtml:
+ path = self.site().get_address(self.urlname())
+ self._imagePageHtml = self.site().getUrl(path)
+ return self._imagePageHtml
+
+ def _loadInfo(self, limit=1):
+ params = {
+ 'action': 'query',
+ 'prop': 'imageinfo',
+ 'titles': self.title(),
+ 'iiprop': ['timestamp', 'user', 'comment', 'url', 'size',
+ 'dimensions', 'sha1', 'mime', 'metadata', 'archivename',
+ 'bitdepth'],
+ 'iilimit': limit,
+ }
+ try:
+ data = query.GetData(params, self.site())
+ except NotImplementedError:
+ output("API not work, loading page HTML.")
+ self.getImagePageHtml()
+ return
+
+ if 'error' in data:
+ raise RuntimeError("%s" %data['error'])
+ count = 0
+ pageInfo = data['query']['pages'].values()[0]
+ self._local = pageInfo["imagerepository"] != "shared"
+ if data['query']['pages'].keys()[0] == "-1":
+ if 'missing' in pageInfo and self._local:
+ raise NoPage(self.site(), unicode(self),
+ "Page does not exist.")
+ elif 'invalid' in pageInfo:
+ raise BadTitle('BadTitle: %s' % self)
+ infos = []
+
+ try:
+ while True:
+ for info in pageInfo['imageinfo']:
+ count += 1
+ if count == 1 and 'iistart' not in params:
+ # count 1 and no iicontinue mean first image revision is latest.
+ self._latestInfo = info
+ infos.append(info)
+ if limit == 1:
+ break
+
+ if 'query-continue' in data and limit != 1:
+ params['iistart'] = data['query-continue']['imageinfo']['iistart']
+ else:
+ break
+ except KeyError:
+ output("Not image in imagepage")
+ self._infoLoaded = True
+ if limit > 1:
+ return infos
+
+ def fileUrl(self):
+ """Return the URL for the image described on this page."""
+ # There are three types of image pages:
+ # * normal, small images with links like: filename.png (10KB, MIME type: image/png)
+ # * normal, large images with links like: Download high resolution version (1024x768, 200 KB)
+ # * SVG images with links like: filename.svg (1KB, MIME type: image/svg)
+ # This regular expression seems to work with all of them.
+ # The part after the | is required for copying .ogg files from en:, as they do not
+ # have a "full image link" div. This might change in the future; on commons, there
+ # is a full image link for .ogg and .mid files.
+ #***********************
+ #change to API query: action=query&titles=File:wiki.jpg&prop=imageinfo&iiprop=url
+ if not self._infoLoaded:
+ self._loadInfo()
+
+ if self._infoLoaded:
+ return self._latestInfo['url']
+
+ urlR = re.compile(r'<div class="fullImageLink" id="file">.*?<a href="(?P<url>[^ ]+?)"(?! class="image")|<span class="dangerousLink"><a href="(?P<url2>.+?)"', re.DOTALL)
+ m = urlR.search(self.getImagePageHtml())
+
+ url = m.group('url') or m.group('url2')
+ return url
+
+ def fileIsOnCommons(self):
+ """Return True if the image is stored on Wikimedia Commons"""
+ if not self._infoLoaded:
+ self._loadInfo()
+
+ if self._infoLoaded:
+ return not self._local
+
+ return self.fileUrl().startswith(u'http://upload.wikimedia.org/wikipedia/commons/')
+
+ def fileIsShared(self):
+ """Return True if image is stored on Wikitravel shared repository."""
+ if 'wikitravel_shared' in self.site().shared_image_repository():
+ return self.fileUrl().startswith(u'http://wikitravel.org/upload/shared/')
+ return self.fileIsOnCommons()
+
+ # FIXME: MD5 might be performed on not complete file due to server disconnection
+ # (see bug #1795683).
+ def getFileMd5Sum(self):
+ """Return image file's MD5 checksum."""
+ f = MyURLopener.open(self.fileUrl())
+ return md5(f.read()).hexdigest()
+
+ def getFileVersionHistory(self):
+ """Return the image file's version history.
+
+ Return value is a list of tuples containing (timestamp, username,
+ resolution, filesize, comment).
+
+ """
+ result = []
+ infos = self._loadInfo(500)
+ #API query
+ if infos:
+ for i in infos:
+ result.append((i['timestamp'], i['user'], u"%s×%s" % (i['width'], i['height']), i['size'], i['comment']))
+
+ return result
+
+ #from ImagePage HTML
+ history = re.search('(?s)<table class="wikitable filehistory">.+?</table>', self.getImagePageHtml())
+ if history:
+ lineR = re.compile(r'<tr>(?:<td>.*?</td>){1,2}<td.*?><a href=".+?">(?P<datetime>.+?)</a></td><td>.*?(?P<resolution>\d+\xd7\d+) <span.*?>\((?P<filesize>.+?)\)</span></td><td><a href=".+?"(?: class="new"|) title=".+?">(?P<username>.+?)</a>.*?</td><td>(?:.*?<span class="comment">\((?P<comment>.*?)\)</span>)?</td></tr>')
+ if not lineR.search(history.group()):
+ # b/c code
+ lineR = re.compile(r'<tr>(?:<td>.*?</td>){1,2}<td><a href=".+?">(?P<datetime>.+?)</a></td><td><a href=".+?"(?: class="new"|) title=".+?">(?P<username>.+?)</a>.*?</td><td>(?P<resolution>.*?)</td><td class=".+?">(?P<filesize>.+?)</td><td>(?P<comment>.*?)</td></tr>')
+ else:
+ # backward compatible code
+ history = re.search('(?s)<ul class="special">.+?</ul>', self.getImagePageHtml())
+ if history:
+ lineR = re.compile('<li> \(.+?\) \(.+?\) <a href=".+?" title=".+?">(?P<datetime>.+?)</a> . . <a href=".+?" title=".+?">(?P<username>.+?)</a> \(.+?\) . . (?P<resolution>\d+.+?\d+) \((?P<filesize>[\d,\.]+) .+?\)( <span class="comment">(?P<comment>.*?)</span>)?</li>')
+
+ if history:
+ for match in lineR.finditer(history.group()):
+ datetime = match.group('datetime')
+ username = match.group('username')
+ resolution = match.group('resolution')
+ size = match.group('filesize')
+ comment = match.group('comment') or ''
+ result.append((datetime, username, resolution, size, comment))
+ return result
+
+ def getFirstUploader(self):
+ """ Function that uses the APIs to detect the first uploader of the image """
+ inf = self.getFileVersionHistory()[-1]
+ return [inf[1], inf[0]]
+
+ def getLatestUploader(self):
+ """ Function that uses the APIs to detect the latest uploader of the image """
+ if not self._infoLoaded:
+ self._loadInfo()
+ if self._infoLoaded:
+ return [self._latestInfo['user'], self._latestInfo['timestamp']]
+
+ inf = self.getFileVersionHistory()[0]
+ return [inf[1], inf[0]]
+
+ def getHash(self):
+ """ Function that return the Hash of an file in oder to understand if two
+ Files are the same or not.
+ """
+ if self.exists():
+ if not self._infoLoaded:
+ self._loadInfo()
+ try:
+ return self._latestInfo['sha1']
+ except (KeyError, IndexError, TypeError):
+ try:
+ self.get()
+ except NoPage:
+ output(u'%s has been deleted before getting the Hash. Skipping...' % self.title())
+ return None
+ except IsRedirectPage:
+ output("Skipping %s because it's a redirect." % self.title())
+ return None
+ else:
+ raise NoHash('No Hash found in the APIs! Maybe the regex to catch it is wrong or someone has changed the APIs structure.')
+ else:
+ output(u'File deleted before getting the Hash. Skipping...')
+ return None
+
+ def getFileVersionHistoryTable(self):
+ """Return the version history in the form of a wiki table."""
+ lines = []
+ for (datetime, username, resolution, size, comment) in self.getFileVersionHistory():
+ lines.append(u'| %s || %s || %s || %s || <nowiki>%s</nowiki>' % (datetime, username, resolution, size, comment))
+ return u'{| border="1"\n! date/time || username || resolution || size || edit summary\n|----\n' + u'\n|----\n'.join(lines) + '\n|}'
+
+ def usingPages(self):
+ if not self.site().has_api() or self.site().versionnumber() < 11:
+ for a in self._usingPagesOld():
+ yield a
+ return
+
+ params = {
+ 'action': 'query',
+ 'list': 'imageusage',
+ 'iutitle': self.title(),
+ 'iulimit': config.special_page_limit,
+ #'': '',
+ }
+
+ while True:
+ data = query.GetData(params, self.site())
+ if 'error' in data:
+ raise RuntimeError("%s" % data['error'])
+
+ for iu in data['query']["imageusage"]:
+ yield Page(self.site(), iu['title'], defaultNamespace=iu['ns'])
+
+ if 'query-continue' in data:
+ params['iucontinue'] = data['query-continue']['imageusage']['iucontinue']
+ else:
+ break
+
+ def _usingPagesOld(self):
+ """Yield Pages on which the image is displayed."""
+ titleList = re.search('(?s)<h2 id="filelinks">.+?<!-- end content -->',
+ self.getImagePageHtml()).group()
+ lineR = re.compile(
+ '<li><a href="[^\"]+" title=".+?">(?P<title>.+?)</a></li>')
+
+ for match in lineR.finditer(titleList):
+ try:
+ yield Page(self.site(), match.group('title'))
+ except InvalidTitle:
+ output(
+ u"Image description page %s contains invalid reference to [[%s]]."
+ % (self.title(), match.group('title')))
+
+ def globalUsage(self):
+ '''
+ Yield Pages on which the image is used globally.
+ Currently this probably only works on Wikimedia Commonas.
+ '''
+
+ if not self.site().has_api() or self.site().versionnumber() < 11:
+ # Not supported, just return none
+ return
+
+ params = {
+ 'action': 'query',
+ 'prop': 'globalusage',
+ 'titles': self.title(),
+ 'gulimit': config.special_page_limit,
+ #'': '',
+ }
+
+ while True:
+ data = query.GetData(params, self.site())
+ if 'error' in data:
+ raise RuntimeError("%s" % data['error'])
+
+ for (page, globalusage) in data['query']['pages'].items():
+ for gu in globalusage['globalusage']:
+ #FIXME : Should have a cleaner way to get the wiki where the image is used
+ siteparts = gu['wiki'].split('.')
+ if len(siteparts)==3:
+ if siteparts[0] in self.site().fam().alphabetic and siteparts[1] in ['wikipedia', 'wiktionary', 'wikibooks', 'wikiquote','wikisource']:
+ code = siteparts[0]
+ fam = siteparts[1]
+ elif siteparts[0] in ['meta', 'incubator'] and siteparts[1]==u'wikimedia':
+ code = code = siteparts[0]
+ fam = code = siteparts[0]
+ else:
+ code = None
+ fam = None
+ if code and fam:
+ site = getSite(code=code, fam=fam)
+ yield Page(site, gu['title'])
+
+ if 'query-continue' in data:
+ params['gucontinue'] = data['query-continue']['globalusage']['gucontinue']
+ else:
+ break
+
+
+class _GetAll(object):
+ """For internal use only - supports getall() function"""
+ def __init__(self, site, pages, throttle, force):
+ self.site = site
+ self.pages = []
+ self.throttle = throttle
+ self.force = force
+ self.sleeptime = 15
+
+ for page in pages:
+ if (not hasattr(page, '_contents') and not hasattr(page, '_getexception')) or force:
+ self.pages.append(page)
+ elif verbose:
+ output(u"BUGWARNING: %s already done!" % page.title(asLink=True))
+
+ def sleep(self):
+ time.sleep(self.sleeptime)
+ if self.sleeptime <= 60:
+ self.sleeptime += 15
+ elif self.sleeptime < 360:
+ self.sleeptime += 60
+
+ def run(self):
+ if self.pages:
+ # Sometimes query does not contains revisions
+ if self.site.has_api() and debug:
+ while True:
+ try:
+ data = self.getDataApi()
+ except (socket.error, httplib.BadStatusLine, ServerError):
+ # Print the traceback of the caught exception
+ s = ''.join(traceback.format_exception(*sys.exc_info()))
+ if not isinstance(s, unicode):
+ s = s.decode('utf-8')
+ output(u'%s\nDBG> got network error in _GetAll.run. ' \
+ 'Sleeping for %d seconds...' % (s, self.sleeptime))
+ self.sleep()
+ else:
+ if 'error' in data:
+ raise RuntimeError(data['error'])
+ else:
+ break
+
+ self.headerDoneApi(data['query'])
+ if 'normalized' in data['query']:
+ self._norm = dict([(x['from'],x['to']) for x in data['query']['normalized']])
+ for vals in data['query']['pages'].values():
+ self.oneDoneApi(vals)
+ else: #read pages via Special:Export
+ while True:
+ try:
+ data = self.getData()
+ except (socket.error, httplib.BadStatusLine, ServerError):
+ # Print the traceback of the caught exception
+ s = ''.join(traceback.format_exception(*sys.exc_info()))
+ if not isinstance(s, unicode):
+ s = s.decode('utf-8')
+ output(u'%s\nDBG> got network error in _GetAll.run. ' \
+ 'Sleeping for %d seconds...' % (s, self.sleeptime))
+ self.sleep()
+ else:
+ if "<title>Wiki does not exist</title>" in data:
+ raise NoSuchSite(u'Wiki %s does not exist yet' % self.site)
+ elif "</mediawiki>" not in data[-20:]:
+ # HTML error Page got thrown because of an internal
+ # error when fetching a revision.
+ output(u'Received incomplete XML data. ' \
+ 'Sleeping for %d seconds...' % self.sleeptime)
+ self.sleep()
+ elif "<siteinfo>" not in data: # This probably means we got a 'temporary unaivalable'
+ output(u'Got incorrect export page. ' \
+ 'Sleeping for %d seconds...' % self.sleeptime)
+ self.sleep()
+ else:
+ break
+ R = re.compile(r"\s*<\?xml([^>]*)\?>(.*)",re.DOTALL)
+ m = R.match(data)
+ if m:
+ data = m.group(2)
+ handler = xmlreader.MediaWikiXmlHandler()
+ handler.setCallback(self.oneDone)
+ handler.setHeaderCallback(self.headerDone)
+ #f = open("backup.txt", "w")
+ #f.write(data)
+ #f.close()
+ try:
+ xml.sax.parseString(data, handler)
+ except (xml.sax._exceptions.SAXParseException, ValueError), err:
+ debugDump( 'SaxParseBug', self.site, err, data )
+ raise
+ except PageNotFound:
+ return
+ # All of the ones that have not been found apparently do not exist
+
+ for pl in self.pages:
+ if not hasattr(pl,'_contents') and not hasattr(pl,'_getexception'):
+ pl._getexception = NoPage
+
+ def oneDone(self, entry):
+ title = entry.title
+ username = entry.username
+ ipedit = entry.ipedit
+ timestamp = entry.timestamp
+ text = entry.text
+ editRestriction = entry.editRestriction
+ moveRestriction = entry.moveRestriction
+ revisionId = entry.revisionid
+
+ page = Page(self.site, title)
+ successful = False
+ for page2 in self.pages:
+ if page2.sectionFreeTitle() == page.sectionFreeTitle():
+ if not (hasattr(page2,'_contents') or \
+ hasattr(page2, '_getexception')) or self.force:
+ page2.editRestriction = entry.editRestriction
+ page2.moveRestriction = entry.moveRestriction
+ if editRestriction == 'autoconfirmed':
+ page2._editrestriction = True
+ page2._permalink = entry.revisionid
+ page2._userName = username
+ page2._ipedit = ipedit
+ page2._revisionId = revisionId
+ page2._editTime = timestamp
+ page2._versionhistory = [
+ (revisionId,
+ time.strftime("%Y-%m-%dT%H:%M:%SZ",
+ time.strptime(str(timestamp),
+ "%Y%m%d%H%M%S")),
+ username, entry.comment)]
+ section = page2.section()
+ # Store the content
+ page2._contents = text
+ m = self.site.redirectRegex().match(text)
+ if m:
+ ## output(u"%s is a redirect" % page2.title(asLink=True))
+ redirectto = m.group(1)
+ if section and not "#" in redirectto:
+ redirectto += "#" + section
+ page2._getexception = IsRedirectPage
+ page2._redirarg = redirectto
+
+ # This is used for checking deletion conflict.
+ # Use the data loading time.
+ page2._startTime = time.strftime('%Y%m%d%H%M%S',
+ time.gmtime())
+ if section:
+ m = re.search("=+[ ']*%s[ ']*=+" % re.escape(section), text)
+ if not m:
+ try:
+ page2._getexception
+ output(u"WARNING: Section not found: %s" % page2)
+ except AttributeError:
+ # There is no exception yet
+ page2._getexception = SectionError
+ successful = True
+ # Note that there is no break here. The reason is that there
+ # might be duplicates in the pages list.
+ if not successful:
+ output(u"BUG>> title %s (%s) not found in list" % (title, page))
+ output(u'Expected one of: %s'
+ % u','.join([unicode(page2) for page2 in self.pages]))
+ raise PageNotFound
+
+ def headerDone(self, header):
+ # Verify version
+ version = header.generator
+ p = re.compile('^MediaWiki (.+)$')
+ m = p.match(version)
+ if m:
+ version = m.group(1)
+ # only warn operator when versionnumber has been changed
+ versionnumber = self.site.family.versionnumber
+ if version != self.site.version() and \
+ versionnumber(self.site.lang,
+ version=version) != versionnumber(self.site.lang):
+ output(u'WARNING: Family file %s contains version number %s, but it should be %s'
+ % (self.site.family.name, self.site.version(), version))
+
+ # Verify case
+ if self.site.nocapitalize:
+ case = 'case-sensitive'
+ else:
+ case = 'first-letter'
+ if case != header.case.strip():
+ output(u'WARNING: Family file %s contains case %s, but it should be %s' % (self.site.family.name, case, header.case.strip()))
+
+ # Verify namespaces
+ lang = self.site.lang
+ ids = header.namespaces.keys()
+ ids.sort()
+ for id in ids:
+ nshdr = header.namespaces[id]
+ if self.site.family.isDefinedNSLanguage(id, lang):
+ ns = self.site.namespace(id) or u''
+ if ns != nshdr:
+ try:
+ dflt = self.site.family.namespace('_default', id)
+ except KeyError:
+ dflt = u''
+ if not ns and not dflt:
+ flag = u"is not set, but should be '%s'" % nshdr
+ elif dflt == ns:
+ flag = u"is set to default ('%s'), but should be '%s'" % (ns, nshdr)
+ elif dflt == nshdr:
+ flag = u"is '%s', but should be removed (default value '%s')" % (ns, nshdr)
+ else:
+ flag = u"is '%s', but should be '%s'" % (ns, nshdr)
+ output(u"WARNING: Outdated family file %s: namespace['%s'][%i] %s" % (self.site.family.name, lang, id, flag))
+ #self.site.family.namespaces[id][lang] = nshdr
+ else:
+ output(u"WARNING: Missing namespace in family file %s: namespace['%s'][%i] (it is set to '%s')" % (self.site.family.name, lang, id, nshdr))
+ for id in self.site.family.namespaces:
+ if self.site.family.isDefinedNSLanguage(id, lang) and id not in header.namespaces:
+ output(u"WARNING: Family file %s includes namespace['%s'][%i], but it should be removed (namespace doesn't exist in the site)" % (self.site.family.name, lang, id))
+
+ def getData(self):
+ address = self.site.export_address()
+ pagenames = [page.sectionFreeTitle() for page in self.pages]
+ # We need to use X convention for requested page titles.
+ if self.site.lang == 'eo':
+ pagenames = [encodeEsperantoX(pagetitle) for pagetitle in pagenames]
+ pagenames = u'\r\n'.join(pagenames)
+ if type(pagenames) is not unicode:
+ output(u'Warning: xmlreader.WikipediaXMLHandler.getData() got non-unicode page names. Please report this.')
+ print pagenames
+ # convert Unicode string to the encoding used on that wiki
+ pagenames = pagenames.encode(self.site.encoding())
+ predata = {
+ 'action': 'submit',
+ 'pages': pagenames,
+ 'curonly': 'True',
+ }
+ # Slow ourselves down
+ get_throttle(requestsize = len(self.pages))
+ # Now make the actual request to the server
+ now = time.time()
+ response, data = self.site.postForm(address, predata)
+ # The XML parser doesn't expect a Unicode string, but an encoded one,
+ # so we'll encode it back.
+ data = data.encode(self.site.encoding())
+ #get_throttle.setDelay(time.time() - now)
+ return data
+
+ def oneDoneApi(self, data):
+ title = data['title']
+ if not ('missing' in data or 'invalid' in data):
+ revisionId = data['lastrevid']
+ rev = None
+ try:
+ rev = data['revisions']
+ except KeyError:
+ raise KeyError(
+ u'NOTE: Last revision of [[%s]] not found' % title)
+ else:
+ username = rev[0]['user']
+ ipedit = 'anon' in rev[0]
+ timestamp = rev[0]['timestamp']
+ text = rev[0]['*']
+ editRestriction = ''
+ moveRestriction = ''
+ for revs in data['protection']:
+ if revs['type'] == 'edit':
+ editRestriction = revs['level']
+ elif revs['type'] == 'move':
+ moveRestriction = revs['level']
+
+ page = Page(self.site, title)
+ successful = False
+ for page2 in self.pages:
+ if hasattr(self, '_norm') and page2.sectionFreeTitle() in self._norm:
+ page2._title = self._norm[page2.sectionFreeTitle()]
+
+ if page2.sectionFreeTitle() == page.sectionFreeTitle():
+ if 'missing' in data:
+ page2._getexception = NoPage
+ successful = True
+ break
+
+ if 'invalid' in data:
+ page2._getexception = BadTitle
+ successful = True
+ break
+
+ if not (hasattr(page2,'_contents') or hasattr(page2,'_getexception')) or self.force:
+ page2.editRestriction = editRestriction
+ page2.moveRestriction = moveRestriction
+ if editRestriction == 'autoconfirmed':
+ page2._editrestriction = True
+ page2._permalink = revisionId
+ if rev:
+ page2._userName = username
+ page2._ipedit = ipedit
+ page2._editTime = timestamp
+ page2._contents = text
+ else:
+ raise KeyError(
+ u'BUG?>>: Last revision of [[%s]] not found'
+ % title)
+ page2._revisionId = revisionId
+ section = page2.section()
+ if 'redirect' in data:
+ ## output(u"%s is a redirect" % page2.title(asLink=True))
+ m = self.site.redirectRegex().match(text)
+ redirectto = m.group(1)
+ if section and not "#" in redirectto:
+ redirectto += "#" + section
+ page2._getexception = IsRedirectPage
+ page2._redirarg = redirectto
+
+ # This is used for checking deletion conflict.
+ # Use the data loading time.
+ page2._startTime = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ if section:
+ m = re.search("=+[ ']*%s[ ']*=+" % re.escape(section), text)
+ if not m:
+ try:
+ page2._getexception
+ output(u"WARNING: Section not found: %s"
+ % page2)
+ except AttributeError:
+ # There is no exception yet
+ page2._getexception = SectionError
+ successful = True
+ # Note that there is no break here. The reason is that there
+ # might be duplicates in the pages list.
+ if not successful:
+ output(u"BUG>> title %s (%s) not found in list" % (title, page))
+ output(u'Expected one of: %s'
+ % u','.join([unicode(page2) for page2 in self.pages]))
+ raise PageNotFound
+
+ def headerDoneApi(self, header):
+ p = re.compile('^MediaWiki (.+)$')
+ m = p.match(header['general']['generator'])
+ if m:
+ version = m.group(1)
+ # only warn operator when versionnumber has been changed
+ versionnumber = self.site.family.versionnumber
+ if version != self.site.version() and \
+ versionnumber(self.site.lang,
+ version=version) != versionnumber(self.site.lang):
+ output(u'WARNING: Family file %s contains version number %s, but it should be %s'
+ % (self.site.family.name, self.site.version(), version))
+
+ # Verify case
+ if self.site.nocapitalize:
+ case = 'case-sensitive'
+ else:
+ case = 'first-letter'
+ if case != header['general']['case'].strip():
+ output(u'WARNING: Family file %s contains case %s, but it should be %s' % (self.site.family.name, case, header.case.strip()))
+
+ # Verify namespaces
+ lang = self.site.lang
+ ids = header['namespaces'].keys()
+ ids.sort()
+ for id in ids:
+ nshdr = header['namespaces'][id]['*']
+ id = header['namespaces'][id]['id']
+ if self.site.family.isDefinedNSLanguage(id, lang):
+ ns = self.site.namespace(id) or u''
+ if ns != nshdr:
+ try:
+ dflt = self.site.family.namespace('_default', id)
+ except KeyError:
+ dflt = u''
+ if not ns and not dflt:
+ flag = u"is not set, but should be '%s'" % nshdr
+ elif dflt == ns:
+ flag = u"is set to default ('%s'), but should be '%s'" % (ns, nshdr)
+ elif dflt == nshdr:
+ flag = u"is '%s', but should be removed (default value '%s')" % (ns, nshdr)
+ else:
+ flag = u"is '%s', but should be '%s'" % (ns, nshdr)
+ output(u"WARNING: Outdated family file %s: namespace['%s'][%i] %s" % (self.site.family.name, lang, id, flag))
+ #self.site.family.namespaces[id][lang] = nshdr
+ else:
+ output(u"WARNING: Missing namespace in family file %s: namespace['%s'][%i] (it is set to '%s')" % (self.site.family.name, lang, id, nshdr))
+ for id in self.site.family.namespaces:
+ if self.site.family.isDefinedNSLanguage(id, lang) and u'%i' % id not in header['namespaces']:
+ output(u"WARNING: Family file %s includes namespace['%s'][%i], but it should be removed (namespace doesn't exist in the site)" % (self.site.family.name, lang, id ) )
+
+ def getDataApi(self):
+ pagenames = [page.sectionFreeTitle() for page in self.pages]
+ params = {
+ 'action': 'query',
+ 'meta':'siteinfo',
+ 'prop': ['info', 'revisions'],
+ 'titles': pagenames,
+ 'siprop': ['general', 'namespaces'],
+ 'rvprop': ['content', 'timestamp', 'user', 'comment', 'size'],#'ids',
+ 'inprop': ['protection', 'subjectid'], #, 'talkid', 'url', 'readable'
+ }
+
+ # Slow ourselves down
+ get_throttle(requestsize = len(self.pages))
+ # Now make the actual request to the server
+ now = time.time()
+
+ #get_throttle.setDelay(time.time() - now)
+ return query.GetData(params, self.site)
+
+def getall(site, pages, throttle=True, force=False):
+ """Bulk-retrieve a group of pages from site
+
+ Arguments: site = Site object
+ pages = iterable that yields Page objects
+
+ """
+ # TODO: why isn't this a Site method?
+ pages = list(pages) # if pages is an iterator, we need to make it a list
+ output(u'Getting %d page%s %sfrom %s...'
+ % (len(pages), (u'', u's')[len(pages) != 1],
+ (u'', u'via API ')[site.has_api() and debug], site))
+ limit = config.special_page_limit / 4 # default is 500/4, but It might have good point for server.
+ if len(pages) > limit:
+ # separate export pages for bulk-retrieve
+
+ for pagg in range(0, len(pages), limit):
+ if pagg == range(0, len(pages), limit)[-1]: #latest retrieve
+ k = pages[pagg:]
+ output(u'Getting pages %d - %d of %d...' % (pagg + 1, len(pages), len(pages)))
+ _GetAll(site, k, throttle, force).run()
+ pages[pagg:] = k
+ else:
+ k = pages[pagg:pagg + limit]
+ output(u'Getting pages %d - %d of %d...' % (pagg + 1, pagg + limit, len(pages)))
+ _GetAll(site, k, throttle, force).run()
+ pages[pagg:pagg + limit] = k
+ get_throttle(requestsize = len(pages) / 10) # one time to retrieve is 7.7 sec.
+ else:
+ _GetAll(site, pages, throttle, force).run()
+
+
+# Library functions
+
+def setAction(s):
+ """Set a summary to use for changed page submissions"""
+ global action
+ action = s
+
+# Default action
+setAction('Wikipedia python library')
+
+def setUserAgent(s):
+ """Set a User-agent: header passed to the HTTP server"""
+ global useragent
+ useragent = s
+
+# Default User-agent
+setUserAgent('PythonWikipediaBot/1.0')
+
+def url2link(percentname, insite, site):
+ """Convert urlname of a wiki page into interwiki link format.
+
+ 'percentname' is the page title as given by Page.urlname();
+ 'insite' specifies the target Site;
+ 'site' is the Site on which the page is found.
+
+ """
+ # Note: this is only needed if linking between wikis that use different
+ # encodings, so it is now largely obsolete. [CONFIRM]
+ percentname = percentname.replace('_', ' ')
+ x = url2unicode(percentname, site = site)
+ return unicode2html(x, insite.encoding())
+
+def decodeEsperantoX(text):
+ """
+ Decode Esperanto text encoded using the x convention.
+
+ E.g., Cxefpagxo and CXefpagXo will both be converted to Ĉefpaĝo.
+ Note that to encode non-Esperanto words like Bordeaux, one uses a
+ double x, i.e. Bordeauxx or BordeauxX.
+
+ """
+ chars = {
+ u'c': u'ĉ',
+ u'C': u'Ĉ',
+ u'g': u'ĝ',
+ u'G': u'Ĝ',
+ u'h': u'ĥ',
+ u'H': u'Ĥ',
+ u'j': u'ĵ',
+ u'J': u'Ĵ',
+ u's': u'ŝ',
+ u'S': u'Ŝ',
+ u'u': u'ŭ',
+ u'U': u'Ŭ',
+ }
+ for latin, esperanto in chars.iteritems():
+ # A regular expression that matches a letter combination which IS
+ # encoded using x-convention.
+ xConvR = re.compile(latin + '[xX]+')
+ pos = 0
+ result = ''
+ # Each matching substring will be regarded exactly once.
+ while True:
+ match = xConvR.search(text[pos:])
+ if match:
+ old = match.group()
+ if len(old) % 2 == 0:
+ # The first two chars represent an Esperanto letter.
+ # Following x's are doubled.
+ new = esperanto + ''.join([old[2 * i]
+ for i in xrange(1, len(old)/2)])
+ else:
+ # The first character stays latin; only the x's are doubled.
+ new = latin + ''.join([old[2 * i + 1]
+ for i in xrange(0, len(old)/2)])
+ result += text[pos : match.start() + pos] + new
+ pos += match.start() + len(old)
+ else:
+ result += text[pos:]
+ text = result
+ break
+ return text
+
+def encodeEsperantoX(text):
+ """
+ Convert standard wikitext to the Esperanto x-encoding.
+
+ Double X-es where necessary so that we can submit a page to an Esperanto
+ wiki. Again, we have to keep stupid stuff like cXxXxxX in mind. Maybe
+ someone wants to write about the Sony Cyber-shot DSC-Uxx camera series on
+ eo: ;)
+ """
+ # A regular expression that matches a letter combination which is NOT
+ # encoded in x-convention.
+ notXConvR = re.compile('[cghjsuCGHJSU][xX]+')
+ pos = 0
+ result = ''
+ while True:
+ match = notXConvR.search(text[pos:])
+ if match:
+ old = match.group()
+ # the first letter stays; add an x after each X or x.
+ new = old[0] + ''.join([old[i] + 'x' for i in xrange(1, len(old))])
+ result += text[pos : match.start() + pos] + new
+ pos += match.start() + len(old)
+ else:
+ result += text[pos:]
+ text = result
+ break
+ return text
+
+######## Unicode library functions ########
+
+def UnicodeToAsciiHtml(s):
+ """Convert unicode to a bytestring using HTML entities."""
+ html = []
+ for c in s:
+ cord = ord(c)
+ if 31 < cord < 128:
+ html.append(c)
+ else:
+ html.append('&#%d;'%cord)
+ return ''.join(html)
+
+def url2unicode(title, site, site2 = None):
+ """Convert url-encoded text to unicode using site's encoding.
+
+ If site2 is provided, try its encodings as well. Uses the first encoding
+ that doesn't cause an error.
+
+ """
+ # create a list of all possible encodings for both hint sites
+ encList = [site.encoding()] + list(site.encodings())
+ if site2 and site2 <> site:
+ encList.append(site2.encoding())
+ encList += list(site2.encodings())
+ firstException = None
+ # try to handle all encodings (will probably retry utf-8)
+ for enc in encList:
+ try:
+ t = title.encode(enc)
+ t = urllib.unquote(t)
+ return unicode(t, enc)
+ except UnicodeError, ex:
+ if not firstException:
+ firstException = ex
+ pass
+ # Couldn't convert, raise the original exception
+ raise firstException
+
+def unicode2html(x, encoding):
+ """
+ Ensure unicode string is encodable, or else convert to ASCII for HTML.
+
+ Arguments are a unicode string and an encoding. Attempt to encode the
+ string into the desired format; if that doesn't work, encode the unicode
+ into html &#; entities. If it does work, return it unchanged.
+
+ """
+ try:
+ x.encode(encoding)
+ except UnicodeError:
+ x = UnicodeToAsciiHtml(x)
+ return x
+
+def html2unicode(text, ignore = []):
+ """Return text, replacing HTML entities by equivalent unicode characters."""
+ # This regular expression will match any decimal and hexadecimal entity and
+ # also entities that might be named entities.
+ entityR = re.compile(
+ r'&(?:amp;)?(#(?P<decimal>\d+)|#x(?P<hex>[0-9a-fA-F]+)|(?P<name>[A-Za-z]+));')
+ # These characters are Html-illegal, but sadly you *can* find some of
+ # these and converting them to unichr(decimal) is unsuitable
+ convertIllegalHtmlEntities = {
+ 128 : 8364, # €
+ 130 : 8218, # ‚
+ 131 : 402, # ƒ
+ 132 : 8222, # „
+ 133 : 8230, # …
+ 134 : 8224, # †
+ 135 : 8225, # ‡
+ 136 : 710, # ˆ
+ 137 : 8240, # ‰
+ 138 : 352, # Š
+ 139 : 8249, # ‹
+ 140 : 338, # Œ
+ 142 : 381, # Ž
+ 145 : 8216, # ‘
+ 146 : 8217, # ’
+ 147 : 8220, # “
+ 148 : 8221, # ”
+ 149 : 8226, # •
+ 150 : 8211, # –
+ 151 : 8212, # —
+ 152 : 732, # ˜
+ 153 : 8482, # ™
+ 154 : 353, # š
+ 155 : 8250, # ›
+ 156 : 339, # œ
+ 158 : 382, # ž
+ 159 : 376 # Ÿ
+ }
+ #ensuring that illegal   and , which have no known values,
+ #don't get converted to unichr(129), unichr(141) or unichr(157)
+ ignore = set(ignore) | set([129, 141, 157])
+ result = u''
+ i = 0
+ found = True
+ while found:
+ text = text[i:]
+ match = entityR.search(text)
+ if match:
+ unicodeCodepoint = None
+ if match.group('decimal'):
+ unicodeCodepoint = int(match.group('decimal'))
+ elif match.group('hex'):
+ unicodeCodepoint = int(match.group('hex'), 16)
+ elif match.group('name'):
+ name = match.group('name')
+ if name in htmlentitydefs.name2codepoint:
+ # We found a known HTML entity.
+ unicodeCodepoint = htmlentitydefs.name2codepoint[name]
+ result += text[:match.start()]
+ try:
+ unicodeCodepoint = convertIllegalHtmlEntities[unicodeCodepoint]
+ except KeyError:
+ pass
+ if unicodeCodepoint and unicodeCodepoint not in ignore and (WIDEBUILD or unicodeCodepoint < 65534):
+ result += unichr(unicodeCodepoint)
+ else:
+ # Leave the entity unchanged
+ result += text[match.start():match.end()]
+ i = match.end()
+ else:
+ result += text
+ found = False
+ return result
+
+# Warning! _familyCache does not necessarily have to be consistent between
+# two statements. Always ensure that a local reference is created when
+# accessing Family objects
+_familyCache = weakref.WeakValueDictionary()
+def Family(fam=None, fatal=True, force=False):
+ """Import the named family.
+
+ @param fam: family name (if omitted, uses the configured default)
+ @type fam: str
+ @param fatal: if True, the bot will stop running if the given family is
+ unknown. If False, it will only raise a ValueError exception.
+ @param fatal: bool
+ @return: a Family instance configured for the named family.
+
+ """
+ if fam is None:
+ fam = config.family
+
+ family = _familyCache.get(fam)
+ if family and not force:
+ return family
+
+ try:
+ # search for family module in the 'families' subdirectory
+ sys.path.append(config.datafilepath('families'))
+ myfamily = __import__('%s_family' % fam)
+ except ImportError:
+ if fatal:
+ output(u"""\
+Error importing the %s family. This probably means the family
+does not exist. Also check your configuration file."""
+ % fam)
+ import traceback
+ traceback.print_stack()
+ sys.exit(1)
+ else:
+ raise ValueError("Family %s does not exist" % repr(fam))
+
+ family = myfamily.Family()
+ _familyCache[fam] = family
+ return family
+
+
+class Site(object):
+ """A MediaWiki site. Do not instantiate directly; use getSite() function.
+
+ Constructor takes three arguments; only code is mandatory:
+ see __init__() param
+
+ Methods:
+
+ language: This Site's language code.
+ family: This Site's Family object.
+ sitename: A string representing this Site.
+ languages: A list of all languages contained in this site's Family.
+ validLanguageLinks: A list of language codes that can be used in interwiki
+ links.
+
+ loggedInAs: return current username, or None if not logged in.
+ forceLogin: require the user to log in to the site
+ messages: return True if there are new messages on the site
+ cookies: return user's cookies as a string
+
+ getUrl: retrieve an URL from the site
+ urlEncode: Encode a query to be sent using an http POST request.
+ postForm: Post form data to an address at this site.
+ postData: Post encoded form data to an http address at this site.
+
+ namespace(num): Return local name of namespace 'num'.
+ normalizeNamespace(value): Return preferred name for namespace 'value' in
+ this Site's language.
+ namespaces: Return list of canonical namespace names for this Site.
+ getNamespaceIndex(name): Return the int index of namespace 'name', or None
+ if invalid.
+
+ redirect: Return the localized redirect tag for the site.
+ redirectRegex: Return compiled regular expression matching on redirect
+ pages.
+ mediawiki_message: Retrieve the text of a specified MediaWiki message
+ has_mediawiki_message: True if this site defines specified MediaWiki
+ message
+ has_api: True if this site's family provides api interface
+
+ shared_image_repository: Return tuple of image repositories used by this
+ site.
+ category_on_one_line: Return True if this site wants all category links
+ on one line.
+ interwiki_putfirst: Return list of language codes for ordering of
+ interwiki links.
+ linkto(title): Return string in the form of a wikilink to 'title'
+ isInterwikiLink(s): Return True if 's' is in the form of an interwiki
+ link.
+ getSite(lang): Return Site object for wiki in same family, language
+ 'lang'.
+ version: Return MediaWiki version string from Family file.
+ versionnumber: Return int identifying the MediaWiki version.
+ live_version: Return version number read from Special:Version.
+ checkCharset(charset): Warn if charset doesn't match family file.
+ server_time: returns server time (currently userclock depending)
+
+ getParsedString: Parses the string with API and returns html content.
+ getExpandedString: Expands the string with API and returns wiki content.
+
+ linktrail: Return regex for trailing chars displayed as part of a link.
+ disambcategory: Category in which disambiguation pages are listed.
+
+ Methods that yield Page objects derived from a wiki's Special: pages
+ (note, some methods yield other information in a tuple along with the
+ Pages; see method docs for details) --
+
+ search(query): query results from Special:Search
+ allpages(): Special:Allpages
+ prefixindex(): Special:Prefixindex
+ protectedpages(): Special:ProtectedPages
+ newpages(): Special:Newpages
+ newimages(): Special:Log&type=upload
+ longpages(): Special:Longpages
+ shortpages(): Special:Shortpages
+ categories(): Special:Categories (yields Category objects)
+ deadendpages(): Special:Deadendpages
+ ancientpages(): Special:Ancientpages
+ lonelypages(): Special:Lonelypages
+ recentchanges(): Special:Recentchanges
+ unwatchedpages(): Special:Unwatchedpages (sysop accounts only)
+ uncategorizedcategories(): Special:Uncategorizedcategories (yields
+ Category objects)
+ uncategorizedpages(): Special:Uncategorizedpages
+ uncategorizedimages(): Special:Uncategorizedimages (yields
+ ImagePage objects)
+ uncategorizedtemplates(): Special:UncategorizedTemplates
+ unusedcategories(): Special:Unusuedcategories (yields Category)
+ unusedfiles(): Special:Unusedimages (yields ImagePage)
+ randompage: Special:Random
+ randomredirectpage: Special:RandomRedirect
+ withoutinterwiki: Special:Withoutinterwiki
+ linksearch: Special:Linksearch
+
+ Convenience methods that provide access to properties of the wiki Family
+ object; all of these are read-only and return a unicode string unless
+ noted --
+
+ encoding: The current encoding for this site.
+ encodings: List of all historical encodings for this site.
+ category_namespace: Canonical name of the Category namespace on this
+ site.
+ category_namespaces: List of all valid names for the Category
+ namespace.
+ image_namespace: Canonical name of the Image namespace on this site.
+ template_namespace: Canonical name of the Template namespace on this
+ site.
+ protocol: Protocol ('http' or 'https') for access to this site.
+ hostname: Host portion of site URL.
+ path: URL path for index.php on this Site.
+ dbName: MySQL database name.
+
+ Methods that return addresses to pages on this site (usually in
+ Special: namespace); these methods only return URL paths, they do not
+ interact with the wiki --
+
+ export_address: Special:Export.
+ query_address: URL path + '?' for query.php
+ api_address: URL path + '?' for api.php
+ apipath: URL path for api.php
+ move_address: Special:Movepage.
+ delete_address(s): Delete title 's'.
+ undelete_view_address(s): Special:Undelete for title 's'
+ undelete_address: Special:Undelete.
+ protect_address(s): Protect title 's'.
+ unprotect_address(s): Unprotect title 's'.
+ put_address(s): Submit revision to page titled 's'.
+ get_address(s): Retrieve page titled 's'.
+ nice_get_address(s): Short URL path to retrieve page titled 's'.
+ edit_address(s): Edit form for page titled 's'.
+ purge_address(s): Purge cache and retrieve page 's'.
+ block_address: Block an IP address.
+ unblock_address: Unblock an IP address.
+ blocksearch_address(s): Search for blocks on IP address 's'.
+ linksearch_address(s): Special:Linksearch for target 's'.
+ search_address(q): Special:Search for query 'q'.
+ allpages_address(s): Special:Allpages.
+ newpages_address: Special:Newpages.
+ longpages_address: Special:Longpages.
+ shortpages_address: Special:Shortpages.
+ unusedfiles_address: Special:Unusedimages.
+ categories_address: Special:Categories.
+ deadendpages_address: Special:Deadendpages.
+ ancientpages_address: Special:Ancientpages.
+ lonelypages_address: Special:Lonelypages.
+ protectedpages_address: Special:ProtectedPages
+ unwatchedpages_address: Special:Unwatchedpages.
+ uncategorizedcategories_address: Special:Uncategorizedcategories.
+ uncategorizedimages_address: Special:Uncategorizedimages.
+ uncategorizedpages_address: Special:Uncategorizedpages.
+ uncategorizedtemplates_address: Special:UncategorizedTemplates.
+ unusedcategories_address: Special:Unusedcategories.
+ withoutinterwiki_address: Special:Withoutinterwiki.
+ references_address(s): Special:Whatlinksere for page 's'.
+ allmessages_address: Special:Allmessages.
+ upload_address: Special:Upload.
+ double_redirects_address: Special:Doubleredirects.
+ broken_redirects_address: Special:Brokenredirects.
+ random_address: Special:Random.
+ randomredirect_address: Special:Random.
+ login_address: Special:Userlogin.
+ captcha_image_address(id): Special:Captcha for image 'id'.
+ watchlist_address: Special:Watchlist editor.
+ contribs_address(target): Special:Contributions for user 'target'.
+
+ """
+
+ @deprecate_arg("persistent_http", None)
+ def __init__(self, code, fam=None, user=None):
+ """
+ @param code: the site's language code
+ @type code: str
+ @param fam: wiki family name (optional)
+ @type fam: str or Family
+ @param user: bot user name (optional)
+ @type user: str
+
+ """
+ self.__code = code.lower()
+ if isinstance(fam, basestring) or fam is None:
+ self.__family = Family(fam, fatal = False)
+ else:
+ self.__family = fam
+
+ # if we got an outdated language code, use the new one instead.
+ if self.__code in self.__family.obsolete:
+ if self.__family.obsolete[self.__code] is not None:
+ self.__code = self.__family.obsolete[self.__code]
+ else:
+ # no such language anymore
+ raise NoSuchSite("Language %s in family %s is obsolete"
+ % (self.__code, self.__family.name))
+ if self.__code not in self.languages():
+ if self.__code == 'zh-classic' \
+ and 'zh-classical' in self.languages():
+ self.__code = 'zh-classical'
+ # database hack (database is varchar[10], so zh-classical
+ # is cut to zh-classic)
+ elif self.__family.name in self.__family.langs.keys() \
+ or len(self.__family.langs) == 1:
+ self.__code = self.__family.name
+ else:
+ raise NoSuchSite("Language %s does not exist in family %s"
+ % (self.__code, self.__family.name))
+
+ self._mediawiki_messages = {}
+ self._info = {}
+ self._userName = [None, None]
+ self.nocapitalize = self.code in self.family.nocapitalize
+ self.user = user
+ self._userData = [False, False]
+ self._isLoggedIn = [None, None]
+ self._isBlocked = [None, None]
+ self._messages = [None, None]
+ self._rights = [None, None]
+ self._token = [None, None]
+ self._patrolToken = [None, None]
+ self._cookies = [None, None]
+ # Calculating valid languages took quite long, so we calculate it once
+ # in initialization instead of each time it is used.
+ self._validlanguages = []
+ for language in self.languages():
+ if not language[0].upper() + language[1:] in self.namespaces():
+ self._validlanguages.append(language)
+
+ def __call__(self):
+ """Since the Page.site() method has a property decorator, return the
+ site object for backwards-compatibility if Page.site() call is still
+ used instead of Page.site as recommended.
+
+ """
+## # DEPRECATED warning. Should be uncommented if scripts are actualized
+## pywikibot.output('Page.site() method is DEPRECATED, '
+## 'use Page.site instead.')
+ return self
+
+ @property
+ def family(self):
+ """The Family object for this Site's wiki family."""
+
+ return self.__family
+
+ @property
+ def code(self):
+ """The identifying code for this Site.
+
+ By convention, this is usually an ISO language code, but it does
+ not have to be.
+
+ """
+ return self.__code
+
+ @property
+ def lang(self):
+ """The ISO language code for this Site.
+
+ Presumed to be equal to the wiki prefix, but this can be overridden.
+
+ """
+ return self.__code
+
+ def __cmp__(self, other):
+ """Perform equality and inequality tests on Site objects."""
+
+ if not isinstance(other, Site):
+ return 1
+ if self.family.name == other.family.name:
+ return cmp(self.code ,other.code)
+ return cmp(self.family.name, other.family.name)
+
+ def _userIndex(self, sysop = False):
+ """Returns the internal index of the user."""
+ if sysop:
+ return 1
+ else:
+ return 0
+
+ def username(self, sysop = False):
+ return self._userName[self._userIndex(sysop = sysop)]
+
+ def sitename(self):
+ """Return string representing this Site's name and code."""
+
+ return self.family.name+':'+self.code
+
+ def __repr__(self):
+ return '%s:%s' % (self.family.name, self.code)
+
+ def __hash__(self):
+ return hash(repr(self))
+
+ def linktrail(self):
+ """Return regex for trailing chars displayed as part of a link.
+
+ Returns a string, not a compiled regular expression object.
+
+ This reads from the family file, and ''not'' from
+ [[MediaWiki:Linktrail]], because the MW software currently uses a
+ built-in linktrail from its message files and ignores the wiki
+ value.
+
+ """
+ return self.family.linktrail(self.code)
+
+ def languages(self):
+ """Return list of all valid language codes for this site's Family."""
+
+ return self.family.iwkeys
+
+ def validLanguageLinks(self):
+ """Return list of language codes that can be used in interwiki links."""
+ return self._validlanguages
+
+ def namespaces(self):
+ """Return list of canonical namespace names for this Site."""
+
+ # n.b.: this does not return namespace numbers; to determine which
+ # numeric namespaces the framework recognizes for this Site (which
+ # may or may not actually exist on the wiki), use
+ # self.family.namespaces.keys()
+
+ if self in _namespaceCache:
+ return _namespaceCache[self]
+ else:
+ nslist = []
+ for n in self.family.namespaces:
+ try:
+ ns = self.family.namespace(self.lang, n)
+ except KeyError:
+ # No default namespace defined
+ continue
+ if ns is not None:
+ nslist.append(self.family.namespace(self.lang, n))
+ _namespaceCache[self] = nslist
+ return nslist
+
+ def redirect(self, default=False):
+ """Return the localized redirect tag for the site.
+
+ """
+ # return the magic word without the preceding '#' character
+ if default or self.versionnumber() <= 13:
+ return u'REDIRECT'
+ else:
+ return self.getmagicwords('redirect')[0].lstrip("#")
+
+ def loggedInAs(self, sysop = False):
+ """Return the current username if logged in, otherwise return None.
+
+ Checks if we're logged in by loading a page and looking for the login
+ link. We assume that we're not being logged out during a bot run, so
+ loading the test page is only required once.
+
+ """
+ index = self._userIndex(sysop)
+ if self._isLoggedIn[index] is None:
+ # Load the details only if you don't know the login status.
+ # Don't load them just because the other details aren't known.
+ self._load(sysop = sysop)
+ if self._isLoggedIn[index]:
+ return self._userName[index]
+ else:
+ return None
+
+ def forceLogin(self, sysop = False):
+ """Log the user in if not already logged in."""
+ if not self.loggedInAs(sysop = sysop):
+ loginMan = login.LoginManager(site = self, sysop = sysop)
+ #loginMan.logout()
+ if loginMan.login(retry = True):
+ index = self._userIndex(sysop)
+ self._isLoggedIn[index] = True
+ self._userName[index] = loginMan.username
+ # We know nothing about the new user (but its name)
+ # Old info is about the anonymous user
+ self._userData[index] = False
+
+ def checkBlocks(self, sysop = False):
+ """Check if the user is blocked, and raise an exception if so."""
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ if self._isBlocked[index]:
+ # User blocked
+ raise UserBlocked('User is blocked in site %s' % self)
+
+ def isBlocked(self, sysop = False):
+ """Check if the user is blocked."""
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ if self._isBlocked[index]:
+ # User blocked
+ return True
+ else:
+ return False
+
+ def _getBlock(self, sysop = False):
+ """Get user block data from the API."""
+ try:
+ params = {
+ 'action': 'query',
+ 'meta': 'userinfo',
+ 'uiprop': 'blockinfo',
+ }
+ data = query.GetData(params, self)
+ if not data or 'error' in data:
+ return False
+ if self.versionnumber() == 11: # fix for version 1.11 API.
+ data = data['userinfo']
+ else:
+ data = data['query']['userinfo']
+ return 'blockedby' in data
+ except NotImplementedError:
+ return False
+
+ def isAllowed(self, right, sysop = False):
+ """Check if the user has a specific right.
+ Among possible rights:
+ * Actions: edit, move, delete, protect, upload
+ * User levels: autoconfirmed, sysop, bot, empty string (always true)
+ """
+ if right == '' or right is None:
+ return True
+ else:
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ # Handle obsolete editusercssjs permission
+ if right in ['editusercss', 'edituserjs'] \
+ and right not in self._rights[index]:
+ return 'editusercssjs' in self._rights[index]
+ return right in self._rights[index]
+
+ def server_time(self):
+ """returns a datetime object representing server time"""
+ # It is currently user-clock depending
+ return self.family.server_time()
+
+ def messages(self, sysop = False):
+ """Returns true if the user has new messages, and false otherwise."""
+ self._load(sysop = sysop)
+ index = self._userIndex(sysop)
+ return self._messages[index]
+
+ def cookies(self, sysop = False):
+ """Return a string containing the user's current cookies."""
+ self._loadCookies(sysop = sysop)
+ index = self._userIndex(sysop)
+ if self._cookies[index]:
+ #convert cookies dictionary data to string.
+ outputDatas = ""
+ for k, v in self._cookies[index].iteritems():
+ if v:
+ outputDatas += "%s=%s; " % (k,v)
+ else:
+ # protection for value ''
+ outputDatas += "%s=none; " % k
+ return outputDatas
+ else:
+ return None
+
+ def _loadCookies(self, sysop = False):
+ """
+ Retrieve session cookies for login
+ if family datas define the cross projects, this function will search
+ the central login file made by self or cross available project
+ functioin will read the cookiedata if got one of them is exist
+ """
+ index = self._userIndex(sysop)
+ if self._cookies[index] is not None:
+ return
+ try:
+ if sysop:
+ try:
+ username = config.sysopnames[self.family.name][self.lang]
+ except KeyError:
+ raise NoUsername("""\
+You tried to perform an action that requires admin privileges, but you haven't
+entered your sysop name in your user-config.py. Please add
+sysopnames['%s']['%s']='name' to your user-config.py"""
+ % (self.family.name, self.lang))
+ else:
+ username = config.usernames[self.family.name][self.lang]
+ except KeyError:
+ self._cookies[index] = None
+ self._isLoggedIn[index] = False
+ else:
+ # check central login data if cross_projects is available.
+ localFn = '%s-%s-%s-login.data' % (self.family.name, self.lang, username)
+ localPa = config.datafilepath('login-data', localFn)
+ if self.family.cross_projects:
+ for proj in [self.family.name] + self.family.cross_projects:
+ #find all central data in all cross_projects
+ centralFn = '%s-%s-central-login.data' % (proj, username)
+ centralPa = config.datafilepath('login-data', centralFn)
+ if os.path.exists(centralPa):
+ self._cookies[index] = self._readCookies(centralFn)
+ break
+
+ if os.path.exists(localPa):
+ #read and dump local logindata into self._cookies[index]
+ # if self._cookies[index] is not availabe, read the local data and set the dictionary.
+ if type(self._cookies[index]) == dict:
+ for k, v in self._readCookies(localFn).iteritems():
+ if k not in self._cookies[index]:
+ self._cookies[index][k] = v
+ else:
+ self._cookies[index] = dict([(k,v) for k,v in self._readCookies(localFn).iteritems()])
+ #self._cookies[index] = query.CombineParams(self._cookies[index], self._readCookies(localFn))
+ elif not os.path.exists(localPa) and not self.family.cross_projects:
+ #keep anonymous mode if not login and centralauth not enable
+ self._cookies[index] = None
+ self._isLoggedIn[index] = False
+
+ def _readCookies(self, filename):
+ """read login cookie file and return a dictionary."""
+ try:
+ f = open( config.datafilepath('login-data', filename), 'r')
+ ck = re.compile("(.*?)=(.*?)\r?\n")
+ data = dict([(x[0],x[1]) for x in ck.findall(f.read())])
+ #data = dict(ck.findall(f.read()))
+ f.close()
+ return data
+ except IOError:
+ return None
+
+ def _setupCookies(self, datas, sysop = False):
+ """save the cookie dictionary to files
+ if cross_project enable, savefiles will separate two, centraldata and localdata.
+ """
+ index = self._userIndex(sysop)
+ if not self._cookies[index]:
+ self._cookies[index] = datas
+ cache = {0:"",1:""} #0 is central auth, 1 is local.
+ if not self.username(sysop):
+ if not self._cookies[index]:
+ return
+ elif self.family.cross_projects_cookie_username in self._cookies[index]:
+ # Using centralauth to cross login data, it's not necessary to forceLogin, but Site() didn't know it.
+ # So we need add centralauth username data into siteattribute
+ self._userName[index] = self._cookies[index][self.family.cross_projects_cookie_username]
+
+
+ for k, v in datas.iteritems():
+ #put key and values into save cache
+ if self.family.cross_projects and k in self.family.cross_projects_cookies:
+ cache[0] += "%s=%s\n" % (k,v)
+ else:
+ cache[1] += "%s=%s\n" % (k,v)
+
+ # write the data.
+ if self.family.cross_projects and cache[0]:
+ filename = '%s-%s-central-login.data' % (self.family.name, self.username(sysop))
+ f = open(config.datafilepath('login-data', filename), 'w')
+ f.write(cache[0])
+ f.close()
+
+ filename = '%s-%s-%s-login.data' % (self.family.name, self.lang, self.username(sysop))
+ f = open(config.datafilepath('login-data', filename), 'w')
+ f.write(cache[1])
+ f.close()
+
+ def _removeCookies(self, name):
+ # remove cookies.
+ # ToDo: remove all local datas if cross_projects enable.
+ #
+ if self.family.cross_projects:
+ file = config.datafilepath('login-data', '%s-%s-central-login.data' % (self.family.name, name))
+ if os.path.exists(file):
+ os.remove( file )
+ file = config.datafilepath('login-data', '%s-%s-%s-login.data' % (self.family.name, self.lang, name))
+ if os.path.exists(file):
+ os.remove(file)
+
+ def updateCookies(self, datas, sysop = False):
+ """Check and update the current cookies datas and save back to files."""
+ index = self._userIndex(sysop)
+ if not self._cookies[index]:
+ self._setupCookies(datas, sysop)
+
+ for k, v in datas.iteritems():
+ if k in self._cookies[index]:
+ if v != self._cookies[index][k]:
+ self._cookies[index][k] = v
+ else:
+ self._cookies[index][k] = v
+
+ self._setupCookies(self._cookies[index], sysop)
+
+ def urlEncode(self, query):
+ """Encode a query so that it can be sent using an http POST request."""
+ if not query:
+ return None
+ if hasattr(query, 'iteritems'):
+ iterator = query.iteritems()
+ else:
+ iterator = iter(query)
+ l = []
+ wpEditToken = None
+ for key, value in iterator:
+ if isinstance(key, unicode):
+ key = key.encode('utf-8')
+ if isinstance(value, unicode):
+ value = value.encode('utf-8')
+ key = urllib.quote(key)
+ value = urllib.quote(value)
+ if key == 'wpEditToken':
+ wpEditToken = value
+ continue
+ l.append(key + '=' + value)
+
+ # wpEditToken is explicitly added as last value.
+ # If a premature connection abort occurs while putting, the server will
+ # not have received an edit token and thus refuse saving the page
+ if wpEditToken is not None:
+ l.append('wpEditToken=' + wpEditToken)
+ return '&'.join(l)
+
+ def solveCaptcha(self, data):
+ if type(data) == dict: # API Mode result
+ if 'edit' in data and data['edit']['result'] != u"Success":
+ data = data['edit']
+ if "captcha" in data:
+ data = data['captcha']
+ captype = data['type']
+ id = data['id']
+ if captype in ['simple', 'math', 'question']:
+ answer = input('What is the answer to the captcha "%s" ?' % data['question'])
+ elif captype == 'image':
+ url = self.protocol() + '://' + self.hostname() + self.captcha_image_address(id)
+ answer = ui.askForCaptcha(url)
+ else: #no captcha id result, maybe ReCaptcha.
+ raise CaptchaError('We have been prompted for a ReCaptcha, but pywikipedia does not yet support ReCaptchas')
+ return {'id':id, 'answer':answer}
+ return None
+ else:
+ captchaW = re.compile('<label for="wpCaptchaWord">(?P<question>[^<]*)</label>')
+ captchaR = re.compile('<input type="hidden" name="wpCaptchaId" id="wpCaptchaId" value="(?P<id>\d+)" />')
+ match = captchaR.search(data)
+ if match:
+ id = match.group('id')
+ match = captchaW.search(data)
+ if match:
+ answer = input('What is the answer to the captcha "%s" ?' % match.group('question'))
+ else:
+ if not config.solve_captcha:
+ raise CaptchaError(id)
+ url = self.protocol() + '://' + self.hostname() + self.captcha_image_address(id)
+ answer = ui.askForCaptcha(url)
+ return {'id':id, 'answer':answer}
+ Recaptcha = re.compile('<script type="text/javascript" src="http://api\.recaptcha\.net/[^"]*"></script>')
+ if Recaptcha.search(data):
+ raise CaptchaError('We have been prompted for a ReCaptcha, but pywikipedia does not yet support ReCaptchas')
+ return None
+
+ def postForm(self, address, predata, sysop = False, cookies = None):
+ """Post http form data to the given address at this site.
+
+ address - the absolute path without hostname.
+ predata - a dict or any iterable that can be converted to a dict,
+ containing keys and values for the http form.
+ cookies - the cookies to send with the form. If None, send self.cookies
+
+ Return a (response, data) tuple, where response is the HTTP
+ response object and data is a Unicode string containing the
+ body of the response.
+
+ """
+ if ('action' in predata) and pywikibot.simulate and \
+ (predata['action'] in pywikibot.config.actions_to_block) and \
+ (address not in [self.export_address()]):
+ pywikibot.output(u'\03{lightyellow}SIMULATION: %s action blocked.\03{default}'%\
+ predata['action'])
+ import StringIO
+ f_dummy = StringIO.StringIO()
+ f_dummy.__dict__.update({u'code': 0, u'msg': u''})
+ return f_dummy, u''
+
+ data = self.urlEncode(predata)
+ try:
+ if cookies:
+ return self.postData(address, data, sysop=sysop,
+ cookies=cookies)
+ else:
+ return self.postData(address, data, sysop=sysop,
+ cookies=self.cookies(sysop = sysop))
+ except socket.error, e:
+ raise ServerError(e)
+
+ def postData(self, address, data,
+ contentType = 'application/x-www-form-urlencoded',
+ sysop = False, compress = True, cookies = None):
+ """Post encoded data to the given http address at this site.
+
+ address is the absolute path without hostname.
+ data is an ASCII string that has been URL-encoded.
+
+ Returns a (response, data) tuple where response is the HTTP
+ response object and data is a Unicode string containing the
+ body of the response.
+ """
+
+ if address[-1] == "?":
+ address = address[:-1]
+
+ headers = {
+ 'User-agent': useragent,
+ 'Content-Length': str(len(data)),
+ 'Content-type':contentType,
+ }
+ if cookies:
+ headers['Cookie'] = cookies
+
+ if compress:
+ headers['Accept-encoding'] = 'gzip'
+ #print '%s' % headers
+
+ url = '%s://%s%s' % (self.protocol(), self.hostname(), address)
+ # Try to retrieve the page until it was successfully loaded (just in
+ # case the server is down or overloaded).
+ # Wait for retry_idle_time minutes (growing!) between retries.
+ retry_idle_time = 1
+ retry_attempt = 0
+ while True:
+ try:
+ request = urllib2.Request(url, data, headers)
+ f = MyURLopener.open(request)
+
+ # read & info can raise socket.error
+ text = f.read()
+ headers = f.info()
+ break
+ except KeyboardInterrupt:
+ raise
+ except urllib2.HTTPError, e:
+ if e.code in [401, 404]:
+ raise PageNotFound(u'Page %s could not be retrieved. Check your family file ?' % url)
+ # just check for HTTP Status 500 (Internal Server Error)?
+ elif e.code in [500, 502, 504]:
+ output(u'HTTPError: %s %s' % (e.code, e.msg))
+ if config.retry_on_fail:
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise MaxTriesExceededError()
+ output(u"WARNING: Could not open '%s'.\nMaybe the server is down. Retrying in %i minutes..."
+ % (url, retry_idle_time))
+ time.sleep(retry_idle_time * 60)
+ # Next time wait longer, but not longer than half an hour
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ continue
+ raise
+ else:
+ output(u"Result: %s %s" % (e.code, e.msg))
+ raise
+ except Exception, e:
+ output(u'%s' %e)
+ if pywikibot.verbose:
+ import traceback
+ traceback.print_exc()
+
+ if config.retry_on_fail:
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise MaxTriesExceededError()
+ output(u"WARNING: Could not open '%s'. Maybe the server or\n your connection is down. Retrying in %i minutes..."
+ % (url, retry_idle_time))
+ time.sleep(retry_idle_time * 60)
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ continue
+ raise
+
+ # check cookies return or not, if return, send its to update.
+ if hasattr(f, 'sheaders'):
+ ck = f.sheaders
+ else:
+ ck = f.info().getallmatchingheaders('set-cookie')
+ if ck:
+ Reat=re.compile(': (.*?)=(.*?);')
+ tmpc = {}
+ for d in ck:
+ m = Reat.search(d)
+ if m: tmpc[m.group(1)] = m.group(2)
+ if self.cookies(sysop):
+ self.updateCookies(tmpc, sysop)
+
+ resContentType = headers.get('content-type', '')
+ contentEncoding = headers.get('content-encoding', '')
+
+ # Ensure that all sent data is received
+ # In rare cases we found a douple Content-Length in the header.
+ # We need to split it to get a value
+ content_length = int(headers.get('content-length', '0').split(',')[0])
+ if content_length != len(text) and 'content-length' in headers:
+ output(
+ u'Warning! len(text) does not match content-length: %s != %s'
+ % (len(text), content_length))
+ return self.postData(address, data, contentType, sysop, compress,
+ cookies)
+
+ if compress and contentEncoding == 'gzip':
+ text = decompress_gzip(text)
+
+ R = re.compile('charset=([^\'\";]+)')
+ m = R.search(resContentType)
+ if m:
+ charset = m.group(1)
+ else:
+ if verbose:
+ output(u"WARNING: No character set found.")
+ # UTF-8 as default
+ charset = 'utf-8'
+ # Check if this is the charset we expected
+ self.checkCharset(charset)
+ # Convert HTML to Unicode
+ try:
+ text = unicode(text, charset, errors = 'strict')
+ except UnicodeDecodeError, e:
+ print e
+ output(u'ERROR: Invalid characters found on %s://%s%s, replaced by \\ufffd.'
+ % (self.protocol(), self.hostname(), address))
+ # We use error='replace' in case of bad encoding.
+ text = unicode(text, charset, errors = 'replace')
+
+ # If a wiki page, get user data
+ self._getUserDataOld(text, sysop = sysop)
+
+ return f, text
+
+ #@deprecated("pywikibot.comms.http.request") # in 'trunk' not yet...
+ def getUrl(self, path, retry = None, sysop = False, data = None, compress = True,
+ no_hostname = False, cookie_only=False, refer=None, back_response=False):
+ """
+ Low-level routine to get a URL from the wiki. Tries to login if it is
+ another wiki.
+
+ Parameters:
+ path - The absolute path, without the hostname.
+ retry - If True, retries loading the page when a network error
+ occurs.
+ sysop - If True, the sysop account's cookie will be used.
+ data - An optional dict providing extra post request parameters.
+ cookie_only - Only return the cookie the server sent us back
+
+ Returns the HTML text of the page converted to unicode.
+ """
+ from pywikibot.comms import http
+
+ f, text = http.request(self, path, retry, sysop, data, compress,
+ no_hostname, cookie_only, refer, back_response = True)
+
+ # If a wiki page, get user data
+ self._getUserDataOld(text, sysop = sysop)
+
+ if back_response:
+ return f, text
+
+ return text
+
+ def _getUserData(self, text, sysop = False, force = True):
+ """
+ Get the user data from an API query dict.
+
+ Parameters:
+ * text - the page text
+ * sysop - is the user a sysop?
+ """
+
+ index = self._userIndex(sysop)
+ # Check for blocks
+
+ if 'blockedby' in text and not self._isBlocked[index]:
+ # Write a warning if not shown earlier
+ if sysop:
+ account = 'Your sysop account'
+ else:
+ account = 'Your account'
+ output(u'\nWARNING: %s on %s is blocked by %s.\nReason: %s\nEditing using this account will stop the run.\n'
+ % (account, self, text['blockedby'], text['blockreason']))
+ self._isBlocked[index] = 'blockedby' in text
+
+ # Check for new messages, the data must had key 'messages' in dict.
+ if 'messages' in text:
+ if not self._messages[index]:
+ # User has *new* messages
+ if sysop:
+ output(u'NOTE: You have new messages in your sysop account on %s' % self)
+ else:
+ output(u'NOTE: You have new messages on %s' % self)
+ self._messages[index] = True
+ else:
+ self._messages[index] = False
+
+ # Don't perform other checks if the data was already loaded
+ if self._userData[index] and not force:
+ return
+
+ # Get username.
+ # The data in anonymous mode had key 'anon'
+ # if 'anon' exist, username is IP address, not to collect it right now
+ if not 'anon' in text:
+ self._isLoggedIn[index] = True
+ self._userName[index] = text['name']
+ else:
+ self._isLoggedIn[index] = False
+ self._userName[index] = None
+
+ # Get user groups and rights
+ if 'groups' in text:
+ self._rights[index] = []
+ for group in text['groups']:
+ # Convert dictionaries to list items (bug 3311663)
+ if isinstance(group, dict):
+ self._rights[index].extend(group.keys())
+ else:
+ self._rights[index].append(group)
+ self._rights[index].extend(text['rights'])
+ # Warnings
+ # Don't show warnings for not logged in users, they will just fail to
+ # do any action
+ if self._isLoggedIn[index]:
+ if 'bot' not in self._rights[index] and config.notify_unflagged_bot:
+ # Sysop + bot flag = Sysop flag in MediaWiki < 1.7.1?
+ if sysop:
+ output(u'Note: Your sysop account on %s does not have a bot flag. Its edits will be visible in the recent changes.' % self)
+ else:
+ output(u'WARNING: Your account on %s does not have a bot flag. Its edits will be visible in the recent changes and it may get blocked.' % self)
+ if sysop and 'sysop' not in self._rights[index]:
+ output(u'WARNING: Your sysop account on %s does not seem to have sysop rights. You may not be able to perform any sysop-restricted actions using it.' % self)
+ else:
+ # 'groups' is not exists, set default rights
+ self._rights[index] = []
+ if self._isLoggedIn[index]:
+ # Logged in user
+ self._rights[index].append('user')
+ # Assume bot, and thus autoconfirmed
+ self._rights[index].extend(['bot', 'autoconfirmed'])
+ if sysop:
+ # Assume user reported as a sysop indeed has the sysop rights
+ self._rights[index].append('sysop')
+ # Assume the user has the default rights if API not query back
+ self._rights[index].extend(['read', 'createaccount', 'edit', 'upload', 'createpage', 'createtalk', 'move', 'upload'])
+ #remove Duplicate rights
+ self._rights[index] = list(set(self._rights[index]))
+
+ # Get token
+ if 'preferencestoken' in text:
+ self._token[index] = text['preferencestoken']
+ if self._rights[index] is not None:
+ # Token and rights are loaded - user data is now loaded
+ self._userData[index] = True
+ elif self.versionnumber() < 14:
+ # uiprop 'preferencestoken' is start from 1.14, if 1.8~13, we need to use other way to get token
+ params = {
+ 'action': 'query',
+ 'prop': 'info',
+ 'titles':'Non-existing page',
+ 'intoken': 'edit',
+ }
+ data = query.GetData(params, self, sysop=sysop)['query']['pages'].values()[0]
+ if 'edittoken' in data:
+ self._token[index] = data['edittoken']
+ self._userData[index] = True
+ else:
+ output(u'WARNING: Token not found on %s. You will not be able to edit any page.' % self)
+ else:
+ if not self._isBlocked[index]:
+ output(u'WARNING: Token not found on %s. You will not be able to edit any page.' % self)
+
+ def _getUserDataOld(self, text, sysop = False, force = True):
+ """
+ Get the user data from a wiki page data.
+
+ Parameters:
+ * text - the page text
+ * sysop - is the user a sysop?
+ """
+
+ index = self._userIndex(sysop)
+
+ if '<div id="globalWrapper">' not in text:
+ # Not a wiki page
+ return
+ # Check for blocks - but only if version is 1.11 (userinfo is available)
+ # and the user data was not yet loaded
+ if self.versionnumber() >= 11 and (not self._userData[index] or force):
+ blocked = self._getBlock(sysop = sysop)
+ if blocked and not self._isBlocked[index]:
+ # Write a warning if not shown earlier
+ if sysop:
+ account = 'Your sysop account'
+ else:
+ account = 'Your account'
+ output(u'WARNING: %s on %s is blocked. Editing using this account will stop the run.' % (account, self))
+ self._isBlocked[index] = blocked
+
+ # Check for new messages
+ if '<div class="usermessage">' in text:
+ if not self._messages[index]:
+ # User has *new* messages
+ if sysop:
+ output(u'NOTE: You have new messages in your sysop account on %s' % self)
+ else:
+ output(u'NOTE: You have new messages on %s' % self)
+ self._messages[index] = True
+ else:
+ self._messages[index] = False
+ # Don't perform other checks if the data was already loaded
+ if self._userData[index] and not force:
+ return
+
+ # Search for the the user page link at the top.
+ # Note that the link of anonymous users (which doesn't exist at all
+ # in Wikimedia sites) has the ID pt-anonuserpage, and thus won't be
+ # found here.
+ userpageR = re.compile('<li id="pt-userpage".*?><a href=".+?".*?>(?P<username>.+?)</a></li>')
+ m = userpageR.search(text)
+ if m:
+ self._isLoggedIn[index] = True
+ self._userName[index] = m.group('username')
+ else:
+ self._isLoggedIn[index] = False
+ # No idea what is the user name, and it isn't important
+ self._userName[index] = None
+
+ if self.family.name == 'wikitravel': # fix for Wikitravel's user page link.
+ self = self.family.user_page_link(self,index)
+
+ # Check user groups, if possible (introduced in 1.10)
+ groupsR = re.compile(r'var wgUserGroups = \[\"(.+)\"\];')
+ m = groupsR.search(text)
+ checkLocal = True
+ if default_code in self.family.cross_allowed: # if current languages in cross allowed list, check global bot flag.
+ globalgroupsR = re.compile(r'var wgGlobalGroups = \[\"(.+)\"\];')
+ mg = globalgroupsR.search(text)
+ if mg: # the account had global permission
+ globalRights = mg.group(1)
+ globalRights = globalRights.split('","')
+ self._rights[index] = globalRights
+ if self._isLoggedIn[index]:
+ if 'Global_bot' in globalRights: # This account has the global bot flag, no need to check local flags.
+ checkLocal = False
+ else:
+ output(u'Your bot account does not have global the bot flag, checking local flag.')
+ else:
+ if verbose: output(u'Note: this language does not allow global bots.')
+ if m and checkLocal:
+ rights = m.group(1)
+ rights = rights.split('", "')
+ if '*' in rights:
+ rights.remove('*')
+ self._rights[index] = rights
+ # Warnings
+ # Don't show warnings for not logged in users, they will just fail to
+ # do any action
+ if self._isLoggedIn[index]:
+ if 'bot' not in self._rights[index] and config.notify_unflagged_bot:
+ # Sysop + bot flag = Sysop flag in MediaWiki < 1.7.1?
+ if sysop:
+ output(u'Note: Your sysop account on %s does not have a bot flag. Its edits will be visible in the recent changes.' % self)
+ else:
+ output(u'WARNING: Your account on %s does not have a bot flag. Its edits will be visible in the recent changes and it may get blocked.' % self)
+ if sysop and 'sysop' not in self._rights[index]:
+ output(u'WARNING: Your sysop account on %s does not seem to have sysop rights. You may not be able to perform any sysop-restricted actions using it.' % self)
+ else:
+ # We don't have wgUserGroups, and can't check the rights
+ self._rights[index] = []
+ if self._isLoggedIn[index]:
+ # Logged in user
+ self._rights[index].append('user')
+ # Assume bot, and thus autoconfirmed
+ self._rights[index].extend(['bot', 'autoconfirmed'])
+ if sysop:
+ # Assume user reported as a sysop indeed has the sysop rights
+ self._rights[index].append('sysop')
+ # Assume the user has the default rights
+ self._rights[index].extend(['read', 'createaccount', 'edit', 'upload', 'createpage', 'createtalk', 'move', 'upload'])
+ if 'bot' in self._rights[index] or 'sysop' in self._rights[index]:
+ self._rights[index].append('apihighlimits')
+ if 'sysop' in self._rights[index]:
+ self._rights[index].extend(['delete', 'undelete', 'block', 'protect', 'import', 'deletedhistory', 'unwatchedpages'])
+
+ # Search for a token
+ tokenR = re.compile(r"\<input type='hidden' value=\"(.*?)\" name=\"wpEditToken\"")
+ tokenloc = tokenR.search(text)
+ if tokenloc:
+ self._token[index] = tokenloc.group(1)
+ if self._rights[index] is not None:
+ # In this case, token and rights are loaded - user data is now loaded
+ self._userData[index] = True
+ else:
+ # Token not found
+ # Possible reason for this is the user is blocked, don't show a
+ # warning in this case, otherwise do show a warning
+ # Another possible reason is that the page cannot be edited - ensure
+ # there is a textarea and the tab "view source" is not shown
+ if u'<textarea' in text and u'<li id="ca-viewsource"' not in text and not self._isBlocked[index]:
+ # Token not found
+ output(u'WARNING: Token not found on %s. You will not be able to edit any page.' % self)
+
+ def siteinfo(self, key = 'general', force = False, dump = False):
+ """Get Mediawiki Site informations by API
+ dump - return all siteinfo datas
+
+ some siprop params is huge data for MediaWiki, they take long times to read by testment.
+ these params could get, but only one by one.
+
+ """
+ # protection for key in other datatype
+ if type(key) not in [str, unicode]:
+ key = 'general'
+
+ if self._info and key in self._info and not force:
+ if dump:
+ return self._info
+ else:
+ return self._info[key]
+
+ params = {
+ 'action':'query',
+ 'meta':'siteinfo',
+ 'siprop':['general', 'namespaces', ],
+ }
+ #ver 1.10 handle
+ if self.versionnumber() > 10:
+ params['siprop'].extend(['statistics', ])
+ if key in ['specialpagealiases', 'interwikimap', 'namespacealiases', 'usergroups', ]:
+ if verbose: print 'getting huge siprop %s...' % key
+ params['siprop'] = [key]
+
+ #ver 1.13 handle
+ if self.versionnumber() > 13:
+ if key not in ['specialpagealiases', 'interwikimap', 'namespacealiases', 'usergroups', ]:
+ params['siprop'].extend(['fileextensions', 'rightsinfo', ])
+ if key in ['magicwords', 'extensions', ]:
+ if verbose: print 'getting huge siprop %s...' % key
+ params['siprop'] = [key]
+ try:
+ data = query.GetData(params, self)['query']
+ except NotImplementedError:
+ return None
+
+ if not hasattr(self, '_info'):
+ self._info = data
+ else:
+ if key == 'magicwords':
+ if self.versionnumber() <= 13:
+ return None #Not implemented
+ self._info[key]={}
+ for entry in data[key]:
+ self._info[key][entry['name']] = entry['aliases']
+ else:
+ for k, v in data.iteritems():
+ self._info[k] = v
+ #data pre-process
+ if dump:
+ return self._info
+ else:
+ return self._info.get(key)
+
+ def mediawiki_message(self, key, forceReload = False):
+ """Return the MediaWiki message text for key "key" """
+ # Allmessages is retrieved once for all per created Site object
+ if (not self._mediawiki_messages) or forceReload:
+ api = self.has_api()
+ if verbose:
+ output(u"Retrieving mediawiki messages from Special:Allmessages")
+ # Only MediaWiki r27393/1.12 and higher support XML output for Special:Allmessages
+ if self.versionnumber() < 12:
+ usePHP = True
+ else:
+ usePHP = False
+ elementtree = True
+ try:
+ try:
+ from xml.etree.cElementTree import XML # 2.5
+ except ImportError:
+ try:
+ from cElementTree import XML
+ except ImportError:
+ from elementtree.ElementTree import XML
+ except ImportError:
+ if verbose:
+ output(u'Elementtree was not found, using BeautifulSoup instead')
+ elementtree = False
+
+ if config.use_diskcache and not api:
+ import diskcache
+ _dict = lambda x : diskcache.CachedReadOnlyDictI(x, prefix = "msg-%s-%s-" % (self.family.name, self.lang))
+ else:
+ _dict = dict
+
+ retry_idle_time = 1
+ retry_attempt = 0
+ while True:
+ if api and self.versionnumber() >= 12 or self.versionnumber() >= 16:
+ params = {
+ 'action': 'query',
+ 'meta': 'allmessages',
+ 'ammessages': key,
+ }
+ datas = query.GetData(params, self)['query']['allmessages'][0]
+ if "missing" in datas:
+ raise KeyError("message is not exist.")
+ elif datas['name'] not in self._mediawiki_messages:
+ self._mediawiki_messages[datas['name']] = datas['*']
+ #self._mediawiki_messages = _dict([(tag['name'].lower(), tag['*'])
+ # for tag in datas if not 'missing' in tag])
+ elif usePHP:
+ phppage = self.getUrl(self.get_address("Special:Allmessages") + "&ot=php")
+ Rphpvals = re.compile(r"(?ms)'([^']*)' => '(.*?[^\\])',")
+ # Previous regexp don't match empty messages. Fast workaround...
+ phppage = re.sub("(?m)^('.*?' =>) '',", r"\1 ' ',", phppage)
+ self._mediawiki_messages = _dict([(name.strip().lower(),
+ html2unicode(message.replace("\\'", "'")))
+ for (name, message) in Rphpvals.findall(phppage)])
+ else:
+ xml = self.getUrl(self.get_address("Special:Allmessages") + "&ot=xml")
+ # xml structure is :
+ # <messages lang="fr">
+ # <message name="about">À propos</message>
+ # ...
+ # </messages>
+ if elementtree:
+ decode = xml.encode(self.encoding())
+
+ # Skip extraneous data such as PHP warning or extra
+ # whitespaces added from some MediaWiki extensions
+ xml_dcl_pos = decode.find('<?xml')
+ if xml_dcl_pos > 0:
+ decode = decode[xml_dcl_pos:]
+
+ tree = XML(decode)
+ self._mediawiki_messages = _dict([(tag.get('name').lower(), tag.text)
+ for tag in tree.getiterator('message')])
+ else:
+ tree = BeautifulStoneSoup(xml)
+ self._mediawiki_messages = _dict([(tag.get('name').lower(), html2unicode(tag.string))
+ for tag in tree.findAll('message') if tag.string])
+
+ if not self._mediawiki_messages:
+ # No messages could be added.
+ # We assume that the server is down.
+ # Wait some time, then try again.
+ output(u'WARNING: No messages found in Special:Allmessages. Maybe the server is down. Retrying in %i minutes...' % retry_idle_time)
+ time.sleep(retry_idle_time * 60)
+ # Next time wait longer, but not longer than half an hour
+ retry_attempt += 1
+ if retry_attempt > config.maxretries:
+ raise ServerError()
+ retry_idle_time *= 2
+ if retry_idle_time > 30:
+ retry_idle_time = 30
+ continue
+ break
+
+ if self.family.name == 'wikitravel': # fix for Wikitravel's mediawiki message setting
+ self = self.family.mediawiki_message(self)
+
+ key = key.lower()
+ try:
+ return self._mediawiki_messages[key]
+ except KeyError:
+ if not forceReload:
+ return self.mediawiki_message(key, True)
+ else:
+ raise KeyError("MediaWiki key '%s' does not exist on %s" % (key, self))
+
+ def has_mediawiki_message(self, key):
+ """Return True if this site defines a MediaWiki message for 'key'."""
+ #return key in self._mediawiki_messages
+ try:
+ v = self.mediawiki_message(key)
+ return True
+ except KeyError:
+ return False
+
+ def has_api(self):
+ """Return True if this sites family has api interface."""
+ try:
+ if config.use_api:
+ x = self.apipath()
+ del x
+ return True
+ except NotImplementedError:
+ pass
+ return False
+
+ def _load(self, sysop = False, force = False):
+ """
+ Loads user data.
+ This is only done if we didn't do get any page yet and the information
+ is requested, otherwise we should already have this data.
+
+ Parameters:
+ * sysop - Get sysop user data?
+ """
+ index = self._userIndex(sysop)
+ if self._userData[index] and not force:
+ return
+ if verbose:
+ output(u'Getting information for site %s' % self)
+
+ # Get data
+ # API Userinfo is available from version 1.11
+ # preferencetoken available from 1.14
+ if self.has_api() and self.versionnumber() >= 11:
+ #Query userinfo
+ params = {
+ 'action': 'query',
+ 'meta': 'userinfo',
+ 'uiprop': ['blockinfo','groups','rights','hasmsg'],
+ }
+ if self.versionnumber() >= 12:
+ params['uiprop'].append('ratelimits')
+ if self.versionnumber() >= 14:
+ params['uiprop'].append('preferencestoken')
+
+ data = query.GetData(params, self, sysop=sysop)
+
+ # Show the API error code instead making an index error
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+
+ if self.versionnumber() == 11:
+ text = data['userinfo']
+ else:
+ text = data['query']['userinfo']
+
+ self._getUserData(text, sysop = sysop, force = force)
+ else:
+ url = self.edit_address('Non-existing_page')
+ text = self.getUrl(url, sysop = sysop)
+
+ self._getUserDataOld(text, sysop = sysop, force = force)
+
+ def search(self, key, number=10, namespaces=None):
+ """
+ Yield search results for query.
+ Use API when enabled use_api and version >= 1.11,
+ or use Special:Search.
+ """
+ if self.has_api() and self.versionnumber() >= 11:
+ #Yield search results (using api) for query.
+ params = {
+ 'action': 'query',
+ 'list': 'search',
+ 'srsearch': key,
+ }
+ if number:
+ params['srlimit'] = number
+ if namespaces:
+ params['srnamespace'] = namespaces
+
+ offset = 0
+ while offset < number or not number:
+ params['sroffset'] = offset
+ data = query.GetData(params, self)
+ if 'error'in data:
+ raise NotImplementedError('%s' % data['error']['info'])
+ data = data['query']
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+ if not data['search']:
+ break
+ for s in data['search']:
+ offset += 1
+ page = Page(self, s['title'])
+ if self.versionnumber() >= 16:
+ yield page, s['snippet'], '', s['size'], s['wordcount'], s['timestamp']
+ else:
+ yield page, '', '', '', '', ''
+ else:
+ #Yield search results (using Special:Search page) for query.
+ throttle = True
+ path = self.search_address(urllib.quote_plus(key.encode('utf-8')),
+ n=number, ns=namespaces)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(ur'<li><a href=".+?" title="(?P<title>.+?)">.+?</a>',
+ re.DOTALL)
+ for m in entryR.finditer(html):
+ page = Page(self, m.group('title'))
+ yield page, '', '', '', '', ''
+
+ # TODO: avoid code duplication for the following methods
+
+ def logpages(self, number = 50, mode = '', title = None, user = None, repeat = False,
+ namespace = [], start = None, end = None, tag = None, newer = False, dump = False):
+
+ if not self.has_api() or self.versionnumber() < 11 or \
+ mode not in ('block', 'protect', 'rights', 'delete', 'upload',
+ 'move', 'import', 'patrol', 'merge', 'suppress',
+ 'review', 'stable', 'gblblock', 'renameuser',
+ 'globalauth', 'gblrights', 'abusefilter', 'newusers'):
+ raise NotImplementedError, mode
+ params = {
+ 'action' : 'query',
+ 'list' : 'logevents',
+ 'letype' : mode,
+ 'lelimit' : int(number),
+ 'ledir' : 'older',
+ 'leprop' : ['ids', 'title', 'type', 'user', 'timestamp', 'comment', 'details',],
+ }
+
+ if number > config.special_page_limit:
+ params['lelimit'] = config.special_page_limit
+ if number > 5000 and self.isAllowed('apihighlimits'):
+ params['lelimit'] = 5000
+ if newer:
+ params['ledir'] = 'newer'
+ if user:
+ params['leuser'] = user
+ if title:
+ params['letitle'] = title
+ if start:
+ params['lestart'] = start
+ if end:
+ params['leend'] = end
+ if tag and self.versionnumber() >= 16: # tag support from mw:r58399
+ params['letag'] = tag
+
+ nbresults = 0
+ while True:
+ result = query.GetData(params, self)
+ if 'error' in result or 'warnings' in result:
+ output('%s' % result)
+ raise Error
+ for c in result['query']['logevents']:
+ if (not namespace or c['ns'] in namespace) and \
+ not 'actionhidden' in c.keys():
+ if dump:
+ # dump result only.
+ yield c
+ else:
+ if c['ns'] == 6:
+ p_ret = ImagePage(self, c['title'])
+ else:
+ p_ret = Page(self, c['title'], defaultNamespace=c['ns'])
+
+ yield (p_ret, c['user'],
+ parsetime2stamp(c['timestamp']),
+ c['comment'], )
+
+ nbresults += 1
+ if nbresults >= number:
+ break
+ if 'query-continue' in result and nbresults < number:
+ params['lestart'] = result['query-continue']['logevents']['lestart']
+ elif repeat:
+ nbresults = 0
+ try:
+ params.pop('lestart')
+ except KeyError:
+ pass
+ else:
+ break
+ return
+
+ def newpages(self, number = 10, get_redirect = False, repeat = False, namespace = 0, rcshow = ['!bot','!redirect'], user = None, returndict = False):
+ """Yield new articles (as Page objects) from Special:Newpages.
+
+ Starts with the newest article and fetches the number of articles
+ specified in the first argument. If repeat is True, it fetches
+ Newpages again. If there is no new page, it blocks until there is
+ one, sleeping between subsequent fetches of Newpages.
+
+ The objects yielded are dependent on parmater returndict.
+ When true, it yields a tuple composed of a Page object and a dict of attributes.
+ When false, it yields a tuple composed of the Page object,
+ timestamp (unicode), length (int), an empty unicode string, username
+ or IP address (str), comment (unicode).
+
+ """
+ # TODO: in recent MW versions Special:Newpages takes a namespace parameter,
+ # and defaults to 0 if not specified.
+ # TODO: Detection of unregistered users is broken
+ # TODO: Repeat mechanism doesn't make much sense as implemented;
+ # should use both offset and limit parameters, and have an
+ # option to fetch older rather than newer pages
+ seen = set()
+ while True:
+ if self.has_api() and self.versionnumber() >= 10:
+ params = {
+ 'action': 'query',
+ 'list': 'recentchanges',
+ 'rctype': 'new',
+ 'rcnamespace': namespace,
+ 'rclimit': int(number),
+ 'rcprop': ['ids','title','timestamp','sizes','user','comment'],
+ 'rcshow': rcshow,
+ }
+ if user: params['rcuser'] = user
+ data = query.GetData(params, self)['query']['recentchanges']
+
+ for np in data:
+ if np['pageid'] not in seen:
+ seen.add(np['pageid'])
+ page = Page(self, np['title'], defaultNamespace=np['ns'])
+ if returndict:
+ yield page, np
+ else:
+ yield page, np['timestamp'], np['newlen'], u'', np['user'], np['comment']
+ else:
+ path = self.newpages_address(n=number, namespace=namespace)
+ # The throttling is important here, so always enabled.
+ get_throttle()
+ html = self.getUrl(path)
+
+ entryR = re.compile('<li[^>]*>(?P<date>.+?) \S*?<a href=".+?"'
+ ' title="(?P<title>.+?)">.+?</a>.+?[\(\[](?P<length>[\d,.]+)[^\)\]]*[\)\]]'
+ ' .?<a href=".+?" title=".+?:(?P<username>.+?)">')
+ for m in entryR.finditer(html):
+ date = m.group('date')
+ title = m.group('title')
+ title = title.replace('"', '"')
+ length = int(re.sub("[,.]", "", m.group('length')))
+ loggedIn = u''
+ username = m.group('username')
+ comment = u''
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, date, length, loggedIn, username, comment
+ if not repeat:
+ break
+
+ def longpages(self, number = 10, repeat = False):
+ """Yield Pages from Special:Longpages.
+
+ Return values are a tuple of Page object, length(int).
+
+ """
+ #TODO: should use offset and limit parameters; 'repeat' as now
+ # implemented is fairly useless
+ # this comment applies to all the XXXXpages methods following, as well
+ seen = set()
+ path = self.longpages_address(n=number)
+ entryR = re.compile(ur'<li>\(<a href=".+?" title=".+?">.+?</a>\) .<a href=".+?" title="(?P<title>.+?)">.+?</a> .\[(?P<length>[\d.,]+).*?\]</li>', re.UNICODE)
+
+ while True:
+ get_throttle()
+ html = self.getUrl(path)
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ length = int(re.sub('[.,]', '', m.group('length')))
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, length
+ if not repeat:
+ break
+
+ def shortpages(self, number = 10, repeat = False):
+ """Yield Pages and lengths from Special:Shortpages."""
+ throttle = True
+ seen = set()
+ path = self.shortpages_address(n = number)
+ entryR = re.compile(ur'<li>\(<a href=".+?" title=".+?">.+?</a>\) .<a href=".+?" title="(?P<title>.+?)">.+?</a> .\[(?P<length>[\d.,]+).*?\]</li>', re.UNICODE)
+
+ while True:
+ get_throttle()
+ html = self.getUrl(path)
+
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ length = int(re.sub('[., ]', '', m.group('length')))
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, length
+ if not repeat:
+ break
+
+ def categories(self, number=10, repeat=False):
+ """Yield Category objects from Special:Categories"""
+ import catlib
+ seen = set()
+ while True:
+ path = self.categories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a>.*?</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def deadendpages(self, number = 10, repeat = False):
+ """Yield Page objects retrieved from Special:Deadendpages."""
+ seen = set()
+ while True:
+ path = self.deadendpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def ancientpages(self, number = 10, repeat = False):
+ """Yield Pages, datestamps from Special:Ancientpages."""
+ seen = set()
+ while True:
+ path = self.ancientpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile('<li><a href=".+?" title="(?P<title>.+?)">.+?</a> (?P<date>.+?)</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ date = m.group('date')
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page, date
+ if not repeat:
+ break
+
+ def lonelypages(self, number = 10, repeat = False):
+ """Yield Pages retrieved from Special:Lonelypages."""
+ throttle = True
+ seen = set()
+ while True:
+ path = self.lonelypages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def unwatchedpages(self, number = 10, repeat = False):
+ """Yield Pages from Special:Unwatchedpages (requires Admin privileges)."""
+ seen = set()
+ while True:
+ path = self.unwatchedpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path, sysop = True)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a>.+?</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def uncategorizedcategories(self, number = 10, repeat = False):
+ """Yield Categories from Special:Uncategorizedcategories."""
+ import catlib
+ seen = set()
+ while True:
+ path = self.uncategorizedcategories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def newimages(self, number = 100, lestart = None, leend = None, leuser = None, letitle = None, repeat = False):
+ """
+ Yield ImagePages from APIs, call: action=query&list=logevents&letype=upload&lelimit=500
+
+ Options directly from APIs:
+ ---
+ Parameters:
+ Default: ids|title|type|user|timestamp|comment|details
+ lestart - The timestamp to start enumerating from.
+ leend - The timestamp to end enumerating.
+ ledir - In which direction to enumerate.
+ One value: newer, older
+ Default: older
+ leuser - Filter entries to those made by the given user.
+ letitle - Filter entries to those related to a page.
+ lelimit - How many total event entries to return.
+ No more than 500 (5000 for bots) allowed.
+ Default: 10
+ """
+
+ for o, u, t, c in self.logpages(number = number, mode = 'upload', title = letitle, user = leuser,
+ repeat = repeat, start = lestart, end = leend):
+ yield o, t, u, c
+ return
+
+ def recentchanges(self, number=100, rcstart=None, rcend=None, rcshow=None,
+ rcdir='older', rctype='edit|new', namespace=None,
+ includeredirects=True, repeat=False, user=None,
+ returndict=False):
+ """
+ Yield recent changes as Page objects
+ uses API call: action=query&list=recentchanges&rctype=edit|new&rclimit=500
+
+ Starts with the newest change and fetches the number of changes
+ specified in the first argument. If repeat is True, it fetches
+ again.
+
+ Options directly from APIs:
+ ---
+ Parameters:
+ rcstart - The timestamp to start enumerating from.
+ rcend - The timestamp to end enumerating.
+ rcdir - In which direction to enumerate.
+ One value: newer, older
+ Default: older
+ rcnamespace - Filter log entries to only this namespace(s)
+ Values (separate with '|'):
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
+ rcprop - Include additional pieces of information
+ Values (separate with '|'):
+ user, comment, flags, timestamp, title, ids, sizes,
+ redirect, patrolled, loginfo
+ Default: title|timestamp|ids
+ rcshow - Show only items that meet this criteria.
+ For example, to see only minor edits done by
+ logged-in users, set show=minor|!anon
+ Values (separate with '|'):
+ minor, !minor, bot, !bot, anon, !anon,
+ redirect, !redirect, patrolled, !patrolled
+ rclimit - How many total changes to return.
+ No more than 500 (5000 for bots) allowed.
+ Default: 10
+ rctype - Which types of changes to show.
+ Values (separate with '|'): edit, new, log
+
+ The objects yielded are dependent on parmater returndict.
+ When true, it yields a tuple composed of a Page object and a dict of attributes.
+ When false, it yields a tuple composed of the Page object,
+ timestamp (unicode), length (int), an empty unicode string, username
+ or IP address (str), comment (unicode).
+
+ # TODO: Detection of unregistered users is broken
+ """
+ if rctype is None:
+ rctype = 'edit|new'
+ params = {
+ 'action' : 'query',
+ 'list' : 'recentchanges',
+ 'rcdir' : rcdir,
+ 'rctype' : rctype,
+ 'rcprop' : ['user', 'comment', 'timestamp', 'title', 'ids',
+ 'loginfo', 'sizes'], #', 'flags', 'redirect', 'patrolled'],
+ 'rcnamespace' : namespace,
+ 'rclimit' : int(number),
+ }
+ if user: params['rcuser'] = user
+ if rcstart: params['rcstart'] = rcstart
+ if rcend: params['rcend'] = rcend
+ if rcshow: params['rcshow'] = rcshow
+ if rctype: params['rctype'] = rctype
+
+ while True:
+ data = query.GetData(params, self, encodeTitle = False)
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+ try:
+ rcData = data['query']['recentchanges']
+ except KeyError:
+ raise ServerError("The APIs don't return data, the site may be down")
+
+ for i in rcData:
+ page = Page(self, i['title'], defaultNamespace=i['ns'])
+ if returndict:
+ yield page, i
+ else:
+ comment = ''
+ if 'comment' in i:
+ comment = i['comment']
+ yield page, i['timestamp'], i['newlen'], True, i['user'], comment
+ if not repeat:
+ break
+
+ def patrol(self, rcid, token = None):
+ if not self.has_api() or self.versionnumber() < 12:
+ raise Exception('patrol: no API: not implemented')
+
+ if not token:
+ token = self.getPatrolToken()
+
+ params = {
+ 'action': 'patrol',
+ 'rcid': rcid,
+ 'token': token,
+ }
+
+ result = query.GetData(params, self)
+ if 'error' in result:
+ raise RuntimeError("%s" % result['error'])
+
+ return True
+
+ def uncategorizedimages(self, number = 10, repeat = False):
+ """Yield ImagePages from Special:Uncategorizedimages."""
+ seen = set()
+ ns = self.image_namespace()
+ entryR = re.compile(
+ '<a href=".+?" title="(?P<title>%s:.+?)">.+?</a>' % ns)
+ while True:
+ path = self.uncategorizedimages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = ImagePage(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def uncategorizedpages(self, number = 10, repeat = False):
+ """Yield Pages from Special:Uncategorizedpages."""
+ seen = set()
+ while True:
+ path = self.uncategorizedpages_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def uncategorizedtemplates(self, number = 10, repeat = False):
+ """Yield Pages from Special:UncategorizedTemplates."""
+ seen = set()
+ while True:
+ path = self.uncategorizedtemplates_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def unusedcategories(self, number = 10, repeat = False):
+ """Yield Category objects from Special:Unusedcategories."""
+ import catlib
+ seen = set()
+ while True:
+ path = self.unusedcategories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile('<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def wantedcategories(self, number=10, repeat=False):
+ """Yield Category objects from Special:wantedcategories."""
+ import catlib
+ seen = set()
+ while True:
+ path = self.wantedcategories_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile(
+ '<li><a href=".+?" class="new" title="(?P<title>.+?) \(page does not exist\)">.+?</a> .+?\)</li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+
+ if title not in seen:
+ seen.add(title)
+ page = catlib.Category(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def unusedfiles(self, number = 10, repeat = False, extension = None):
+ """Yield ImagePage objects from Special:Unusedimages."""
+ seen = set()
+ ns = self.image_namespace()
+ entryR = re.compile(
+ '<a href=".+?" title="(?P<title>%s:.+?)">.+?</a>' % ns)
+ while True:
+ path = self.unusedfiles_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ for m in entryR.finditer(html):
+ fileext = None
+ title = m.group('title')
+ if extension:
+ fileext = title[len(title)-3:]
+ if title not in seen and fileext == extension:
+ ## Check whether the media is used in a Proofread page
+ # code disabled because it slows this method down, and
+ # because it is unclear what it's supposed to do.
+ #basename = title[6:]
+ #page = Page(self, 'Page:' + basename)
+
+ #if not page.exists():
+ seen.add(title)
+ image = ImagePage(self, title)
+ yield image
+ if not repeat:
+ break
+
+ def withoutinterwiki(self, number=10, repeat=False):
+ """Yield Pages without language links from Special:Withoutinterwiki."""
+ seen = set()
+ while True:
+ path = self.withoutinterwiki_address(n=number)
+ get_throttle()
+ html = self.getUrl(path)
+ entryR = re.compile('<li><a href=".+?" title="(?P<title>.+?)">.+?</a></li>')
+ for m in entryR.finditer(html):
+ title = m.group('title')
+ if title not in seen:
+ seen.add(title)
+ page = Page(self, title)
+ yield page
+ if not repeat:
+ break
+
+ def randompage(self, redirect = False):
+ if self.has_api() and self.versionnumber() >= 12:
+ params = {
+ 'action': 'query',
+ 'list': 'random',
+ #'rnnamespace': '0',
+ 'rnlimit': '1',
+ #'': '',
+ }
+ if redirect:
+ params['rnredirect'] = 1
+
+ data = query.GetData(params, self)
+ return Page(self, data['query']['random'][0]['title'])
+ else:
+ if redirect:
+ """Yield random redirect page via Special:RandomRedirect."""
+ html = self.getUrl(self.randomredirect_address())
+ else:
+ """Yield random page via Special:Random"""
+ html = self.getUrl(self.random_address())
+ m = re.search('var wgPageName = "(?P<title>.+?)";', html)
+ if m is not None:
+ return Page(self, m.group('title'))
+
+ def randomredirectpage(self):
+ return self.randompage(redirect = True)
+
+ def allpages(self, start='!', namespace=None, includeredirects=True,
+ throttle=True):
+ """
+ Yield all Pages in alphabetical order.
+
+ Parameters:
+ start Start at this page. By default, it starts at '!', and yields
+ all pages.
+ namespace Yield all pages in this namespace; defaults to 0.
+ MediaWiki software will only return pages in one namespace
+ at a time.
+
+ If includeredirects is False, redirects will not be found.
+
+ It is advised not to use this directly, but to use the
+ AllpagesPageGenerator from pagegenerators.py instead.
+
+ """
+ if namespace is None:
+ page = Page(self, start)
+ namespace = page.namespace()
+ start = page.title(withNamespace=False)
+
+ if not self.has_api():
+ for page in self._allpagesOld(start, namespace, includeredirects, throttle):
+ yield page
+ return
+
+ params = {
+ 'action' : 'query',
+ 'list' : 'allpages',
+ 'aplimit' : config.special_page_limit,
+ 'apnamespace': namespace,
+ 'apfrom' : start
+ }
+
+ if not includeredirects:
+ params['apfilterredir'] = 'nonredirects'
+ elif includeredirects == 'only':
+ params['apfilterredir'] = 'redirects'
+
+ while True:
+ if throttle:
+ get_throttle()
+ data = query.GetData(params, self)
+ if verbose:
+ print 'DEBUG allpages>>> data.keys()', data.keys()
+ if 'warnings' in data:
+ warning = data['warnings']['allpages']['*']
+ raise RuntimeError("API query warning: %s" % warning)
+ if 'error' in data:
+ raise RuntimeError("API query error: %s" % data)
+ if not 'allpages' in data['query']:
+ raise RuntimeError("API query error, no pages found: %s" % data)
+ count = 0
+ for p in data['query']['allpages']:
+ count += 1
+ yield Page(self, p['title'])
+ if count >= config.special_page_limit:
+ break
+ if 'query-continue' in data and count < params['aplimit']:
+ # get the continue key for backward compatibility with pre 1.20wmf8
+ contKey = data['query-continue']['allpages'].keys()[0]
+ params[contKey] = data['query-continue']['allpages'][contKey]
+ else:
+ break
+
+ def _allpagesOld(self, start='!', namespace=0, includeredirects=True,
+ throttle=True):
+ """
+ Yield all Pages from Special:Allpages.
+
+ This method doesn't work with MediaWiki 1.14 because of a change to
+ Special:Allpages. It is only left here for compatibility with older
+ MediaWiki versions, which don't support the API.
+
+ Parameters:
+ start Start at this page. By default, it starts at '!', and yields
+ all pages.
+ namespace Yield all pages in this namespace; defaults to 0.
+ MediaWiki software will only return pages in one namespace
+ at a time.
+
+ If includeredirects is False, redirects will not be found.
+ If includeredirects equals the string 'only', only redirects
+ will be found. Note that this has not been tested on older
+ versions of the MediaWiki code.
+
+ It is advised not to use this directly, but to use the
+ AllpagesPageGenerator from pagegenerators.py instead.
+
+ """
+ monobook_error = True
+ if start == '':
+ start='!'
+
+ while True:
+ # encode Non-ASCII characters in hexadecimal format (e.g. %F6)
+ start = start.encode(self.encoding())
+ start = urllib.quote(start)
+ # load a list which contains a series of article names (always 480)
+ path = self.allpages_address(start, namespace)
+ output(u'Retrieving Allpages special page for %s from %s, namespace %i' % (repr(self), start, namespace))
+ returned_html = self.getUrl(path)
+ # Try to find begin and end markers
+ try:
+ # In 1.4, another table was added above the navigational links
+ if self.versionnumber() >= 4:
+ begin_s = '</table><hr /><table'
+ end_s = '</table'
+ else:
+ begin_s = '<table'
+ end_s = '</table'
+ ibegin = returned_html.index(begin_s)
+ iend = returned_html.index(end_s,ibegin + 3)
+ except ValueError:
+ if monobook_error:
+ raise ServerError("Couldn't extract allpages special page. Make sure you're using MonoBook skin.")
+ else:
+ # No list of wikilinks
+ break
+ monobook_error = False
+ # remove the irrelevant sections
+ returned_html = returned_html[ibegin:iend]
+ if self.versionnumber()==2:
+ R = re.compile('/wiki/(.*?)\" *class=[\'\"]printable')
+ elif self.versionnumber()<5:
+ # Apparently the special code for redirects was added in 1.5
+ R = re.compile('title ?=\"(.*?)\"')
+ elif not includeredirects:
+ R = re.compile('\<td(?: width="33%")?\>\<a href=\"\S*\" +title ?="(.*?)"')
+ elif includeredirects == 'only':
+ R = re.compile('\<td(?: width="33%")?>\<[^\<\>]*allpagesredirect\"\>\<a href=\"\S*\" +title ?="(.*?)"')
+ else:
+ R = re.compile('title ?=\"(.*?)\"')
+ # Count the number of useful links on this page
+ n = 0
+ for hit in R.findall(returned_html):
+ # count how many articles we found on the current page
+ n = n + 1
+ if self.versionnumber()==2:
+ yield Page(self, url2link(hit, site = self, insite = self))
+ else:
+ yield Page(self, hit)
+ # save the last hit, so that we know where to continue when we
+ # finished all articles on the current page. Append a '!' so that
+ # we don't yield a page twice.
+ start = Page(self, hit).title(withNamespace=False) + '!'
+ # A small shortcut: if there are less than 100 pages listed on this
+ # page, there is certainly no next. Probably 480 would do as well,
+ # but better be safe than sorry.
+ if n < 100:
+ if (not includeredirects) or includeredirects == 'only':
+ # Maybe there were only so few because the rest is or is not a redirect
+ R = re.compile('title ?=\"(.*?)\"')
+ allLinks = R.findall(returned_html)
+ if len(allLinks) < 100:
+ break
+ elif n == 0:
+ # In this special case, no pages of the requested type
+ # were found, and "start" will remain and be double-encoded.
+ # Use the last page as the start of the next page.
+ start = Page(self,
+ allLinks[-1]).title(
+ withNamespace=False) + '!'
+ else:
+ break
+ #else:
+ # # Don't send a new request if "Next page (pagename)" isn't present
+ # Rnonext = re.compile(r'title="(Special|%s):.+?">%s</a></td></tr></table>' % (
+ # self.mediawiki_message('nstab-special'),
+ # re.escape(self.mediawiki_message('nextpage')).replace('\$1', '.*?')))
+ # if not Rnonext.search(full_returned_html):
+ # break
+
+ def prefixindex(self, prefix, namespace=0, includeredirects=True):
+ """Yield all pages with a given prefix.
+
+ Parameters:
+ prefix The prefix of the pages.
+ namespace Namespace number; defaults to 0.
+ MediaWiki software will only return pages in one namespace
+ at a time.
+
+ If includeredirects is False, redirects will not be found.
+ If includeredirects equals the string 'only', only redirects
+ will be found. Note that this has not been tested on older
+ versions of the MediaWiki code.
+
+ It is advised not to use this directly, but to use the
+ PrefixingPageGenerator from pagegenerators.py instead.
+ """
+ for page in self.allpages(start = prefix, namespace = namespace, includeredirects = includeredirects):
+ if page.title(withNamespace=False).startswith(prefix):
+ yield page
+ else:
+ break
+
+ def protectedpages(self, namespace = None, type = 'edit', lvl = 0):
+ """ Yield all the protected pages, using Special:ProtectedPages
+ * namespace is a namespace number
+ * type can be 'edit' or 'move
+ * lvl : protection level, can be 0, 'autoconfirmed', or 'sysop'
+ """
+ # Avoid problems of encoding and stuff like that, let it divided please
+ url = self.protectedpages_address()
+ url += '&type=%s&level=%s' % (type, lvl)
+ if namespace is not None: # /!\ if namespace seems simpler, but returns false when ns=0
+
+ url += '&namespace=%s' % namespace
+ parser_text = self.getUrl(url)
+ while 1:
+ #<li><a href="/wiki/Pagina_principale" title="Pagina principale">Pagina principale</a> <small>(6.522 byte)</small> (protetta)</li>
+ m = re.findall(r'<li><a href=".*?" title=".*?">(.*?)</a>.*?<small>\((.*?)\)</small>.*?\((.*?)\)</li>', parser_text)
+ for data in m:
+ title = data[0]
+ size = data[1]
+ status = data[2]
+ yield Page(self, title)
+ nextpage = re.findall(r'<.ul>\(.*?\).*?\(.*?\).*?\(<a href="(.*?)".*?</a>\) +?\(<a href=', parser_text)
+ if nextpage != []:
+ parser_text = self.getUrl(nextpage[0].replace('&', '&'))
+ continue
+ else:
+ break
+
+ def linksearch(self, siteurl, limit=500):
+ """Yield Pages from results of Special:Linksearch for 'siteurl'."""
+ cache = []
+ R = re.compile('title ?=\"([^<>]*?)\">[^<>]*</a></li>')
+ urlsToRetrieve = [siteurl]
+ if not siteurl.startswith('*.'):
+ urlsToRetrieve.append('*.' + siteurl)
+
+ if self.has_api() and self.versionnumber() >= 11:
+ output(u'Querying API exturlusage...')
+ for url in urlsToRetrieve:
+ params = {
+ 'action': 'query',
+ 'list' : 'exturlusage',
+ 'eulimit': limit,
+ 'euquery': url,
+ }
+ count = 0
+ while True:
+ data = query.GetData(params, self)
+ if data['query']['exturlusage'] == []:
+ break
+ for pages in data['query']['exturlusage']:
+ count += 1
+ if not siteurl in pages['title']:
+ # the links themselves have similar form
+ if pages['pageid'] not in cache:
+ cache.append(pages['pageid'])
+ yield Page(self, pages['title'], defaultNamespace=pages['ns'])
+ if count >= limit:
+ break
+
+ if 'query-continue' in data and count < limit:
+ params['euoffset'] = data[u'query-continue'][u'exturlusage'][u'euoffset']
+ else:
+ break
+ else:
+ output(u'Querying [[Special:Linksearch]]...')
+ for url in urlsToRetrieve:
+ offset = 0
+ while True:
+ path = self.linksearch_address(url, limit=limit, offset=offset)
+ get_throttle()
+ html = self.getUrl(path)
+ #restricting the HTML source :
+ #when in the source, this div marks the beginning of the input
+ loc = html.find('<div class="mw-spcontent">')
+ if loc > -1:
+ html = html[loc:]
+ #when in the source, marks the end of the linklist
+ loc = html.find('<div class="printfooter">')
+ if loc > -1:
+ html = html[:loc]
+
+ #our regex fetches internal page links and the link they contain
+ links = R.findall(html)
+ if not links:
+ #no more page to be fetched for that link
+ break
+ for title in links:
+ if not siteurl in title:
+ # the links themselves have similar form
+ if title in cache:
+ continue
+ else:
+ cache.append(title)
+ yield Page(self, title)
+ offset += limit
+
+ def linkto(self, title, othersite = None):
+ """Return unicode string in the form of a wikilink to 'title'
+
+ Use optional Site argument 'othersite' to generate an interwiki link
+ from the other site to the current site.
+
+ """
+ if othersite and othersite.lang != self.lang:
+ return u'[[%s:%s]]' % (self.lang, title)
+ else:
+ return u'[[%s]]' % title
+
+ def isInterwikiLink(self, s):
+ """Return True if s is in the form of an interwiki link.
+
+ Interwiki links have the form "foo:bar" or ":foo:bar" where foo is a
+ known language code or family. Called recursively if the first part
+ of the link refers to this site's own family and/or language.
+
+ """
+ s = s.replace("_", " ").strip(" ").lstrip(":")
+ if not ':' in s:
+ return False
+ first, rest = s.split(':',1)
+ # interwiki codes are case-insensitive
+ first = first.lower().strip(" ")
+ # commons: forwards interlanguage links to wikipedia:, etc.
+ if self.family.interwiki_forward:
+ interlangTargetFamily = Family(self.family.interwiki_forward)
+ else:
+ interlangTargetFamily = self.family
+ if self.getNamespaceIndex(first):
+ return False
+ if first in interlangTargetFamily.langs:
+ if first == self.lang:
+ return self.isInterwikiLink(rest)
+ else:
+ return True
+ if first in self.family.get_known_families(site = self):
+ if first == self.family.name:
+ return self.isInterwikiLink(rest)
+ else:
+ return True
+ return False
+
+ def getmagicwords(self, word):
+ """Return list of localized "word" magic words for the site."""
+ if self.versionnumber() <= 13:
+ raise NotImplementedError
+ return self.siteinfo('magicwords').get(word)
+
+ def redirectRegex(self):
+ """Return a compiled regular expression matching on redirect pages.
+
+ Group 1 in the regex match object will be the target title.
+
+ """
+ #NOTE: this is needed, since the API can give false positives!
+ default = 'REDIRECT'
+ keywords = self.versionnumber() > 13 and self.getmagicwords('redirect')
+ if keywords:
+ pattern = r'(?:' + '|'.join(keywords) + ')'
+ else:
+ # no localized keyword for redirects
+ pattern = r'#%s' % default
+ if self.versionnumber() > 12:
+ # in MW 1.13 (at least) a redirect directive can follow whitespace
+ prefix = r'\s*'
+ else:
+ prefix = r'[\r\n]*'
+ # A redirect starts with hash (#), followed by a keyword, then
+ # arbitrary stuff, then a wikilink. The wikilink may contain
+ # a label, although this is not useful.
+ return re.compile(prefix + pattern
+ + '\s*:?\s*\[\[(.+?)(?:\|.*?)?\]\]',
+ re.IGNORECASE | re.UNICODE | re.DOTALL)
+
+ def pagenamecodes(self, default=True):
+ """Return list of localized PAGENAME tags for the site."""
+ return self.versionnumber() > 13 and self.getmagicwords('pagename') \
+ or u'PAGENAME'
+
+ def pagename2codes(self, default=True):
+ """Return list of localized PAGENAMEE tags for the site."""
+ return self.versionnumber() > 13 and self.getmagicwords('pagenamee') \
+ or u'PAGENAMEE'
+
+ def resolvemagicwords(self, wikitext):
+ """Replace the {{ns:xx}} marks in a wikitext with the namespace names"""
+
+ defaults = []
+ for namespace in self.family.namespaces.itervalues():
+ value = namespace.get('_default', None)
+ if value:
+ if isinstance(value, list):
+ defaults.append(value[0])
+ else:
+ defaults.append(value)
+
+ named = re.compile(u'{{ns:(' + '|'.join(defaults) + ')}}', re.I)
+
+ def replacenamed(match):
+ return self.normalizeNamespace(match.group(1))
+
+ wikitext = named.sub(replacenamed, wikitext)
+
+ numbered = re.compile('{{ns:(-?\d{1,2})}}', re.I)
+
+ def replacenumbered(match):
+ return self.namespace(int(match.group(1)))
+
+ return numbered.sub(replacenumbered, wikitext)
+
+ # The following methods are for convenience, so that you can access
+ # methods of the Family class easier.
+ def encoding(self):
+ """Return the current encoding for this site."""
+ return self.family.code2encoding(self.lang)
+
+ def encodings(self):
+ """Return a list of all historical encodings for this site."""
+ return self.family.code2encodings(self.lang)
+
+ def category_namespace(self):
+ """Return the canonical name of the Category namespace on this site."""
+ # equivalent to self.namespace(14)?
+ return self.family.category_namespace(self.lang)
+
+ def category_namespaces(self):
+ """Return a list of all valid names for the Category namespace."""
+ return self.family.category_namespaces(self.lang)
+
+ def category_redirects(self):
+ return self.family.category_redirects(self.lang)
+
+ def image_namespace(self, fallback = '_default'):
+ """Return the canonical name of the Image namespace on this site."""
+ # equivalent to self.namespace(6)?
+ return self.family.image_namespace(self.lang, fallback)
+
+ def template_namespace(self, fallback = '_default'):
+ """Return the canonical name of the Template namespace on this site."""
+ # equivalent to self.namespace(10)?
+ return self.family.template_namespace(self.lang, fallback)
+
+ def export_address(self):
+ """Return URL path for Special:Export."""
+ return self.family.export_address(self.lang)
+
+ def query_address(self):
+ """Return URL path + '?' for query.php (if enabled on this Site)."""
+ return self.family.query_address(self.lang)
+
+ def api_address(self):
+ """Return URL path + '?' for api.php (if enabled on this Site)."""
+ return self.family.api_address(self.lang)
+
+ def apipath(self):
+ """Return URL path for api.php (if enabled on this Site)."""
+ return self.family.apipath(self.lang)
+
+ def scriptpath(self):
+ """Return URL prefix for scripts on this site ({{SCRIPTPATH}} value)"""
+ return self.family.scriptpath(self.lang)
+
+ def protocol(self):
+ """Return protocol ('http' or 'https') for access to this site."""
+ return self.family.protocol(self.lang)
+
+ def hostname(self):
+ """Return host portion of site URL."""
+ return self.family.hostname(self.lang)
+
+ def path(self):
+ """Return URL path for index.php on this Site."""
+ return self.family.path(self.lang)
+
+ def dbName(self):
+ """Return MySQL database name."""
+ return self.family.dbName(self.lang)
+
+ def move_address(self):
+ """Return URL path for Special:Movepage."""
+ return self.family.move_address(self.lang)
+
+ def delete_address(self, s):
+ """Return URL path to delete title 's'."""
+ return self.family.delete_address(self.lang, s)
+
+ def undelete_view_address(self, s, ts=''):
+ """Return URL path to view Special:Undelete for title 's'
+
+ Optional argument 'ts' returns path to view specific deleted version.
+
+ """
+ return self.family.undelete_view_address(self.lang, s, ts)
+
+ def undelete_address(self):
+ """Return URL path to Special:Undelete."""
+ return self.family.undelete_address(self.lang)
+
+ def protect_address(self, s):
+ """Return URL path to protect title 's'."""
+ return self.family.protect_address(self.lang, s)
+
+ def unprotect_address(self, s):
+ """Return URL path to unprotect title 's'."""
+ return self.family.unprotect_address(self.lang, s)
+
+ def put_address(self, s):
+ """Return URL path to submit revision to page titled 's'."""
+ return self.family.put_address(self.lang, s)
+
+ def get_address(self, s):
+ """Return URL path to retrieve page titled 's'."""
+ title = s.replace(' ', '_')
+ return self.family.get_address(self.lang, title)
+
+ def nice_get_address(self, s):
+ """Return shorter URL path to retrieve page titled 's'."""
+ return self.family.nice_get_address(self.lang, s)
+
+ def edit_address(self, s):
+ """Return URL path for edit form for page titled 's'."""
+ return self.family.edit_address(self.lang, s)
+
+ def watch_address(self, s):
+ """Return URL path for watching the titled 's'."""
+ return self.family.watch_address(self.lang, s)
+
+ def unwatch_address(self, s):
+ """Return URL path for unwatching the titled 's'."""
+ return self.family.unwatch_address(self.lang, s)
+
+ def purge_address(self, s):
+ """Return URL path to purge cache and retrieve page 's'."""
+ return self.family.purge_address(self.lang, s)
+
+ def block_address(self):
+ """Return path to block an IP address."""
+ return self.family.block_address(self.lang)
+
+ def unblock_address(self):
+ """Return path to unblock an IP address."""
+ return self.family.unblock_address(self.lang)
+
+ def blocksearch_address(self, s):
+ """Return path to search for blocks on IP address 's'."""
+ return self.family.blocksearch_address(self.lang, s)
+
+ def linksearch_address(self, s, limit=500, offset=0):
+ """Return path to Special:Linksearch for target 's'."""
+ return self.family.linksearch_address(self.lang, s, limit=limit, offset=offset)
+
+ def search_address(self, q, n=50, ns=0):
+ """Return path to Special:Search for query 'q'."""
+ return self.family.search_address(self.lang, q, n, ns)
+
+ def allpages_address(self, s, ns = 0):
+ """Return path to Special:Allpages."""
+ return self.family.allpages_address(self.lang, start=s, namespace = ns)
+
+ def log_address(self, n=50, mode = '', user = ''):
+ """Return path to Special:Log."""
+ return self.family.log_address(self.lang, n, mode, user)
+
+ def newpages_address(self, n=50, namespace=0):
+ """Return path to Special:Newpages."""
+ return self.family.newpages_address(self.lang, n, namespace)
+
+ def longpages_address(self, n=500):
+ """Return path to Special:Longpages."""
+ return self.family.longpages_address(self.lang, n)
+
+ def shortpages_address(self, n=500):
+ """Return path to Special:Shortpages."""
+ return self.family.shortpages_address(self.lang, n)
+
+ def unusedfiles_address(self, n=500):
+ """Return path to Special:Unusedimages."""
+ return self.family.unusedfiles_address(self.lang, n)
+
+ def categories_address(self, n=500):
+ """Return path to Special:Categories."""
+ return self.family.categories_address(self.lang, n)
+
+ def deadendpages_address(self, n=500):
+ """Return path to Special:Deadendpages."""
+ return self.family.deadendpages_address(self.lang, n)
+
+ def ancientpages_address(self, n=500):
+ """Return path to Special:Ancientpages."""
+ return self.family.ancientpages_address(self.lang, n)
+
+ def lonelypages_address(self, n=500):
+ """Return path to Special:Lonelypages."""
+ return self.family.lonelypages_address(self.lang, n)
+
+ def protectedpages_address(self, n=500):
+ """Return path to Special:ProtectedPages"""
+ return self.family.protectedpages_address(self.lang, n)
+
+ def unwatchedpages_address(self, n=500):
+ """Return path to Special:Unwatchedpages."""
+ return self.family.unwatchedpages_address(self.lang, n)
+
+ def uncategorizedcategories_address(self, n=500):
+ """Return path to Special:Uncategorizedcategories."""
+ return self.family.uncategorizedcategories_address(self.lang, n)
+
+ def uncategorizedimages_address(self, n=500):
+ """Return path to Special:Uncategorizedimages."""
+ return self.family.uncategorizedimages_address(self.lang, n)
+
+ def uncategorizedpages_address(self, n=500):
+ """Return path to Special:Uncategorizedpages."""
+ return self.family.uncategorizedpages_address(self.lang, n)
+
+ def uncategorizedtemplates_address(self, n=500):
+ """Return path to Special:Uncategorizedpages."""
+ return self.family.uncategorizedtemplates_address(self.lang, n)
+
+ def unusedcategories_address(self, n=500):
+ """Return path to Special:Unusedcategories."""
+ return self.family.unusedcategories_address(self.lang, n)
+
+ def wantedcategories_address(self, n=500):
+ """Return path to Special:Wantedcategories."""
+ return self.family.wantedcategories_address(self.lang, n)
+
+ def withoutinterwiki_address(self, n=500):
+ """Return path to Special:Withoutinterwiki."""
+ return self.family.withoutinterwiki_address(self.lang, n)
+
+ def references_address(self, s):
+ """Return path to Special:Whatlinksere for page 's'."""
+ return self.family.references_address(self.lang, s)
+
+ def allmessages_address(self):
+ """Return path to Special:Allmessages."""
+ return self.family.allmessages_address(self.lang)
+
+ def upload_address(self):
+ """Return path to Special:Upload."""
+ return self.family.upload_address(self.lang)
+
+ def double_redirects_address(self, default_limit = True):
+ """Return path to Special:Doubleredirects."""
+ return self.family.double_redirects_address(self.lang, default_limit)
+
+ def broken_redirects_address(self, default_limit = True):
+ """Return path to Special:Brokenredirects."""
+ return self.family.broken_redirects_address(self.lang, default_limit)
+
+ def random_address(self):
+ """Return path to Special:Random."""
+ return self.family.random_address(self.lang)
+
+ def randomredirect_address(self):
+ """Return path to Special:RandomRedirect."""
+ return self.family.randomredirect_address(self.lang)
+
+ def login_address(self):
+ """Return path to Special:Userlogin."""
+ return self.family.login_address(self.lang)
+
+ def captcha_image_address(self, id):
+ """Return path to Special:Captcha for image 'id'."""
+ return self.family.captcha_image_address(self.lang, id)
+
+ def watchlist_address(self):
+ """Return path to Special:Watchlist editor."""
+ return self.family.watchlist_address(self.lang)
+
+ def contribs_address(self, target, limit=500, offset=''):
+ """Return path to Special:Contributions for user 'target'."""
+ return self.family.contribs_address(self.lang,target,limit,offset)
+
+ def globalusers_address(self, target='', limit=500, offset='', group=''):
+ """Return path to Special:GlobalUsers for user 'target' and/or group 'group'."""
+ return self.family.globalusers_address(self.lang, target, limit, offset, group)
+
+ def version(self):
+ """Return MediaWiki version number as a string."""
+ return self.family.version(self.lang)
+
+ def versionnumber(self):
+ """Return an int identifying MediaWiki version.
+
+ Currently this is implemented as returning the minor version
+ number; i.e., 'X' in version '1.X.Y'
+
+ """
+ return self.family.versionnumber(self.lang)
+
+ def live_version(self):
+ """Return the 'real' version number found on [[Special:Version]]
+
+ Return value is a tuple (int, int, str) of the major and minor
+ version numbers and any other text contained in the version.
+
+ """
+ global htmldata
+ if not hasattr(self, "_mw_version"):
+ PATTERN = r"^(?:: )?([0-9]+)\.([0-9]+)(.*)$"
+ versionpage = self.getUrl(self.get_address("Special:Version"))
+ htmldata = BeautifulSoup(versionpage, convertEntities="html")
+ # try to find the live version
+ versionlist = []
+ # 1st try is for mw < 1.17wmf1
+ versionlist.append(lambda: htmldata.findAll(
+ text="MediaWiki")[1].parent.nextSibling )
+ # 2nd try is for mw >=1.17wmf1
+ versionlist.append(lambda: htmldata.body.table.findAll(
+ 'td')[1].contents[0] )
+ # 3rd uses family file which is not live
+ versionlist.append(lambda: self.family.version(self.lang) )
+ for versionfunc in versionlist:
+ try:
+ versionstring = versionfunc()
+ except:
+ continue
+ m = re.match(PATTERN, str(versionstring).strip())
+ if m:
+ break
+ else:
+ raise Error(u'Cannot find any live version!')
+ self._mw_version = (int(m.group(1)), int(m.group(2)), m.group(3))
+ return self._mw_version
+
+ def checkCharset(self, charset):
+ """Warn if charset returned by wiki doesn't match family file."""
+ fromFamily = self.encoding()
+ assert fromFamily.lower() == charset.lower(), \
+ "charset for %s changed from %s to %s" \
+ % (repr(self), fromFamily, charset)
+ if fromFamily.lower() != charset.lower():
+ raise ValueError(
+"code2encodings has wrong charset for %s. It should be %s, but is %s"
+ % (repr(self), charset, self.encoding()))
+
+ def shared_image_repository(self):
+ """Return a tuple of image repositories used by this site."""
+ return self.family.shared_image_repository(self.lang)
+
+ def category_on_one_line(self):
+ """Return True if this site wants all category links on one line."""
+ return self.lang in self.family.category_on_one_line
+
+ def interwiki_putfirst(self):
+ """Return list of language codes for ordering of interwiki links."""
+ return self.family.interwiki_putfirst.get(self.lang, None)
+
+ def interwiki_putfirst_doubled(self, list_of_links):
+ # TODO: is this even needed? No family in the framework has this
+ # dictionary defined!
+ if self.lang in self.family.interwiki_putfirst_doubled:
+ if len(list_of_links) >= self.family.interwiki_putfirst_doubled[self.lang][0]:
+ list_of_links2 = []
+ for lang in list_of_links:
+ list_of_links2.append(lang.language())
+ list = []
+ for lang in self.family.interwiki_putfirst_doubled[self.lang][1]:
+ try:
+ list.append(list_of_links[list_of_links2.index(lang)])
+ except ValueError:
+ pass
+ return list
+ else:
+ return False
+ else:
+ return False
+
+ def getSite(self, code):
+ """Return Site object for language 'code' in this Family."""
+ return getSite(code = code, fam = self.family, user=self.user)
+
+ def namespace(self, num, all = False):
+ """Return string containing local name of namespace 'num'.
+
+ If optional argument 'all' is true, return a tuple of all recognized
+ values for this namespace.
+
+ """
+ return self.family.namespace(self.lang, num, all = all)
+
+ def normalizeNamespace(self, value):
+ """Return canonical name for namespace 'value' in this Site's language.
+
+ 'Value' should be a string or unicode.
+ If no match, return 'value' unmodified.
+
+ """
+ if not self.nocapitalize:
+ # make sure first letter gets normalized; there is at least
+ # one case ("İ") in which s.lower().upper() != s
+ value = value[0].lower().upper() + value[1:]
+ return self.family.normalizeNamespace(self.lang, value)
+
+ def getNamespaceIndex(self, namespace):
+ """Given a namespace name, return its int index, or None if invalid."""
+ return self.family.getNamespaceIndex(self.lang, namespace)
+
+ def language(self):
+ """Return Site's language code."""
+ return self.lang
+
+ def fam(self):
+ """Return Family object for this Site."""
+ return self.family
+
+ def disambcategory(self):
+ """Return Category in which disambig pages are listed."""
+ import catlib
+ try:
+ return catlib.Category(self,
+ self.namespace(14)+':'+self.family.disambcatname[self.lang])
+ except KeyError:
+ raise NoPage
+
+ def getToken(self, getalways = True, getagain = False, sysop = False):
+ index = self._userIndex(sysop)
+ if getagain or (getalways and self._token[index] is None):
+ output(u'Getting a token.')
+ self._load(sysop = sysop, force = True)
+ if self._token[index] is not None:
+ return self._token[index]
+ else:
+ return False
+
+ def getPatrolToken(self, sysop = False):
+ index = self._userIndex(sysop)
+
+ if self._patrolToken[index] is None:
+ output(u'Getting a patrol token.')
+ params = {
+ 'action' : 'query',
+ 'list' : 'recentchanges',
+ 'rcshow' : '!patrolled',
+ 'rctoken' : 'patrol',
+ 'rclimit' : 1,
+ }
+ data = query.GetData(params, self, encodeTitle = False)
+ if 'error' in data:
+ raise RuntimeError('%s' % data['error'])
+ try:
+ rcData = data['query']['recentchanges']
+ except KeyError:
+ raise ServerError("The APIs don't return data, the site may be down")
+
+ self._patrolToken[index] = rcData[0]['patroltoken']
+
+ return self._patrolToken[index]
+
+ def getFilesFromAnHash(self, hash_found = None):
+ """ Function that uses APIs to give the images that has the same hash. Useful
+ to find duplicates or nowcommons.
+
+ NOTE: it returns also the image itself, if you don't want it, just
+ filter the list returned.
+
+ NOTE 2: it returns the image WITHOUT the image namespace.
+ """
+ if self.versionnumber() < 12:
+ return None
+
+ if hash_found is None: # If the hash is none return None and not continue
+ return None
+ # Now get all the images with the same hash
+ #action=query&format=xml&list=allimages&aisha1=%s
+ image_namespace = "%s:" % self.image_namespace() # Image:
+ params = {
+ 'action' :'query',
+ 'list' :'allimages',
+ 'aisha1' :hash_found,
+ }
+ allimages = query.GetData(params, self, encodeTitle = False)['query']['allimages']
+ files = list()
+ for imagedata in allimages:
+ image = imagedata[u'name']
+ files.append(image)
+ return files
+
+ def getParsedString(self, string, keeptags = [u'*']):
+ """Parses the string with API and returns html content.
+
+ @param string: String that should be parsed.
+ @type string: string
+ @param keeptags: Defines which tags (wiki, HTML) should NOT be removed.
+ @type keeptags: list
+
+ Returns the string given, parsed through the wiki parser.
+ """
+
+ if not self.has_api():
+ raise Exception('parse: no API: not implemented')
+
+ # call the wiki to get info
+ params = {
+ u'action' : u'parse',
+ u'text' : string,
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Parsing string through the wiki parser via API.")
+
+ result = query.GetData(params, self)
+ r = result[u'parse'][u'text'][u'*']
+
+ # disable/remove comments
+ r = pywikibot.removeDisabledParts(r, tags = ['comments']).strip()
+
+ # disable/remove ALL tags
+ if not (keeptags == [u'*']):
+ r = removeHTMLParts(r, keeptags = keeptags).strip()
+
+ return r
+
+ def getExpandedString(self, string):
+ """Expands the string with API and returns wiki content.
+
+ @param string: String that should be expanded.
+ @type string: string
+
+ Returns the string given, expanded through the wiki parser.
+ """
+
+ if not self.has_api():
+ raise Exception('expandtemplates: no API: not implemented')
+
+ # call the wiki to get info
+ params = {
+ u'action' : u'expandtemplates',
+ u'text' : string,
+ }
+
+ pywikibot.get_throttle()
+ pywikibot.output(u"Expanding string through the wiki parser via API.")
+
+ result = query.GetData(params, self)
+ r = result[u'expandtemplates'][u'*']
+
+ return r
+
+# Caches to provide faster access
+_sites = {}
+_namespaceCache = {}
+
+@deprecate_arg("persistent_http", None)
+def getSite(code=None, fam=None, user=None, noLogin=False):
+ if code is None:
+ code = default_code
+ if fam is None:
+ fam = default_family
+ key = '%s:%s:%s' % (fam, code, user)
+ if key not in _sites:
+ _sites[key] = Site(code=code, fam=fam, user=user)
+ ret = _sites[key]
+ if not ret.family.isPublic(code) and not noLogin:
+ ret.forceLogin()
+ return ret
+
+def setSite(site):
+ global default_code, default_family
+ default_code = site.language()
+ default_family = site.family
+
+# Command line parsing and help
+
+def calledModuleName():
+ """Return the name of the module calling this function.
+
+ This is required because the -help option loads the module's docstring
+ and because the module name will be used for the filename of the log.
+
+ """
+ # get commandline arguments
+ called = sys.argv[0].strip()
+ if ".py" in called: # could end with .pyc, .pyw, etc. on some platforms
+ # clip off the '.py?' filename extension
+ called = called[:called.rindex('.py')]
+ return os.path.basename(called)
+
+def _decodeArg(arg):
+ # We may pass a Unicode string to a script upon importing and calling
+ # main() from another script.
+ if isinstance(arg,unicode):
+ return arg
+ if sys.platform == 'win32':
+ if config.console_encoding in ('cp437', 'cp850'):
+ # Western Windows versions give parameters encoded as windows-1252
+ # even though the console encoding is cp850 or cp437.
+ return unicode(arg, 'windows-1252')
+ elif config.console_encoding == 'cp852':
+ # Central/Eastern European Windows versions give parameters encoded
+ # as windows-1250 even though the console encoding is cp852.
+ return unicode(arg, 'windows-1250')
+ else:
+ return unicode(arg, config.console_encoding)
+ else:
+ # Linux uses the same encoding for both.
+ # I don't know how non-Western Windows versions behave.
+ return unicode(arg, config.console_encoding)
+
+def handleArgs(*args):
+ """Handle standard command line arguments, return the rest as a list.
+
+ Takes the commandline arguments, converts them to Unicode, processes all
+ global parameters such as -lang or -log. Returns a list of all arguments
+ that are not global. This makes sure that global arguments are applied
+ first, regardless of the order in which the arguments were given.
+
+ args may be passed as an argument, thereby overriding sys.argv
+
+ """
+ global default_code, default_family, verbose, debug, simulate
+ # get commandline arguments if necessary
+ if not args:
+ args = sys.argv[1:]
+ # get the name of the module calling this function. This is
+ # required because the -help option loads the module's docstring and because
+ # the module name will be used for the filename of the log.
+ moduleName = calledModuleName()
+ nonGlobalArgs = []
+ username = None
+ do_help = False
+ for arg in args:
+ arg = _decodeArg(arg)
+ if arg == '-help':
+ do_help = True
+ elif arg.startswith('-dir:'):
+ pass # config_dir = arg[5:] // currently handled in wikipediatools.py - possibly before this routine is called.
+ elif arg.startswith('-family:'):
+ default_family = arg[8:]
+ elif arg.startswith('-lang:'):
+ default_code = arg[6:]
+ elif arg.startswith("-user:"):
+ username = arg[len("-user:") : ]
+ elif arg.startswith('-putthrottle:'):
+ config.put_throttle = int(arg[len("-putthrottle:") : ])
+ put_throttle.setDelay()
+ elif arg.startswith('-pt:'):
+ config.put_throttle = int(arg[len("-pt:") : ])
+ put_throttle.setDelay()
+ elif arg.startswith("-maxlag:"):
+ config.maxlag = int(arg[len("-maxlag:") : ])
+ elif arg == '-log':
+ setLogfileStatus(True)
+ elif arg.startswith('-log:'):
+ setLogfileStatus(True, arg[5:])
+ elif arg == '-nolog':
+ setLogfileStatus(False)
+ elif arg in ['-verbose', '-v']:
+ verbose += 1
+ elif arg == '-daemonize':
+ import daemonize
+ daemonize.daemonize()
+ elif arg.startswith('-daemonize:'):
+ import daemonize
+ daemonize.daemonize(redirect_std = arg[11:])
+ elif arg in ['-cosmeticchanges', '-cc']:
+ config.cosmetic_changes = not config.cosmetic_changes
+ output(u'NOTE: option cosmetic_changes is %s\n' % config.cosmetic_changes)
+ elif arg == '-simulate':
+ simulate = True
+ elif arg == '-dry':
+ output(u"Usage of -dry is deprecated; use -simulate instead.")
+ simulate = True
+ # global debug option for development purposes. Normally does nothing.
+ elif arg == '-debug':
+ debug = True
+ config.special_page_limit = 500
+ else:
+ # the argument is not global. Let the specific bot script care
+ # about it.
+ nonGlobalArgs.append(arg)
+
+ if username:
+ config.usernames[default_family][default_code] = username
+
+ # TEST for bug #3081100
+ if unicode_error:
+ output("""
+
+================================================================================
+\03{lightyellow}WARNING:\03{lightred} your python version might trigger issue #3081100\03{default}
+More information: See https://sourceforge.net/support/tracker.php?aid=3081100
+\03{lightyellow}Please update python to 2.7.2+ if you are running on wikimedia sites!\03{default}
+================================================================================
+
+""")
+ if verbose:
+ output(u'Pywikipediabot %s' % (version.getversion()))
+ output(u'Python %s' % sys.version)
+
+ if do_help:
+ showHelp()
+ sys.exit(0)
+ return nonGlobalArgs
+
+def showHelp(moduleName=None):
+ # the parameter moduleName is deprecated and should be left out.
+ moduleName = moduleName or calledModuleName()
+ try:
+ moduleName = moduleName[moduleName.rindex("\\")+1:]
+ except ValueError: # There was no \ in the module name, so presumably no problem
+ pass
+
+ globalHelp = u'''
+Global arguments available for all bots:
+
+-dir:PATH Read the bot's configuration data from directory given by
+ PATH, instead of from the default directory.
+
+-lang:xx Set the language of the wiki you want to work on, overriding
+ the configuration in user-config.py. xx should be the
+ language code.
+
+-family:xyz Set the family of the wiki you want to work on, e.g.
+ wikipedia, wiktionary, wikitravel, ...
+ This will override the configuration in user-config.py.
+
+-user:xyz Log in as user 'xyz' instead of the default username.
+
+-daemonize:xyz Immediately return control to the terminal and redirect
+ stdout and stderr to xyz (only use for bots that require
+ no input from stdin).
+
+-help Show this help text.
+
+-log Enable the logfile, using the default filename
+ "%s.log"
+ Logs will be stored in the logs subdirectory.
+
+-log:xyz Enable the logfile, using 'xyz' as the filename.
+
+-maxlag Sets a new maxlag parameter to a number of seconds. Defer bot
+ edits during periods of database server lag. Default is set by
+ config.py
+
+-nolog Disable the logfile (if it is enabled by default).
+
+-putthrottle:n Set the minimum time (in seconds) the bot will wait between
+-pt:n saving pages.
+
+-verbose Have the bot provide additional output that may be
+-v useful in debugging.
+
+-cosmeticchanges Toggles the cosmetic_changes setting made in config.py or
+-cc user_config.py to its inverse and overrules it. All other
+ settings and restrictions are untouched.
+
+-simulate Disables writing to the server. Useful for testing and
+(-dry) debugging of new code (if given, doesn't do any real
+ changes, but only shows what would have been changed).
+ DEPRECATED: please use -simulate instead of -dry
+''' % moduleName
+ output(globalHelp, toStdout=True)
+ try:
+ exec('import %s as module' % moduleName)
+ helpText = module.__doc__.decode('utf-8')
+ if hasattr(module, 'docuReplacements'):
+ for key, value in module.docuReplacements.iteritems():
+ helpText = helpText.replace(key, value.strip('\n\r'))
+ output(helpText, toStdout=True)
+ except:
+ output(u'Sorry, no help available for %s' % moduleName)
+
+#########################
+# Interpret configuration
+#########################
+
+# search for user interface module in the 'userinterfaces' subdirectory
+sys.path.append(config.datafilepath('userinterfaces'))
+exec "import %s_interface as uiModule" % config.userinterface
+ui = uiModule.UI()
+verbose = 0
+debug = False
+simulate = False
+
+# TEST for bug #3081100
+unicode_error = __import__('unicodedata').normalize(
+ 'NFC',
+ u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
+ ) != u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
+if unicode_error:
+ print u'unicode test: triggers problem #3081100'
+
+default_family = config.family
+default_code = config.mylang
+logfile = None
+# Check
+
+# if the default family+wiki is a non-public one,
+# getSite will try login in. We don't want that, the module
+# is not yet loaded.
+getSite(noLogin=True)
+
+# Set socket timeout
+socket.setdefaulttimeout(config.socket_timeout)
+
+def writeToCommandLogFile():
+ """
+ Save the name of the called module along with all parameters to
+ logs/commands.log so that the user can look it up later to track errors
+ or report bugs.
+ """
+ modname = os.path.basename(sys.argv[0])
+ # put quotation marks around all parameters
+ args = [_decodeArg(modname)] + [_decodeArg('"%s"' % s) for s in sys.argv[1:]]
+ commandLogFilename = config.datafilepath('logs', 'commands.log')
+ try:
+ commandLogFile = codecs.open(commandLogFilename, 'a', 'utf-8')
+ except IOError:
+ commandLogFile = codecs.open(commandLogFilename, 'w', 'utf-8')
+ # add a timestamp in ISO 8601 formulation
+ isoDate = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
+ commandLogFile.write("%s r%s Python %s "
+ % (isoDate, version.getversiondict()['rev'],
+ sys.version.split()[0]))
+ s = u' '.join(args)
+ commandLogFile.write(s + os.linesep)
+ commandLogFile.close()
+
+def setLogfileStatus(enabled, logname = None):
+ global logfile
+ if enabled:
+ if not logname:
+ logname = '%s.log' % calledModuleName()
+ logfn = config.datafilepath('logs', logname)
+ try:
+ logfile = codecs.open(logfn, 'a', 'utf-8')
+ except IOError:
+ logfile = codecs.open(logfn, 'w', 'utf-8')
+ else:
+ # disable the log file
+ logfile = None
+
+if '*' in config.log or calledModuleName() in config.log:
+ setLogfileStatus(True)
+
+writeToCommandLogFile()
+
+colorTagR = re.compile('\03{.*?}', re.UNICODE)
+
+def log(text):
+ """Write the given text to the logfile."""
+ if logfile:
+ # remove all color markup
+ plaintext = colorTagR.sub('', text)
+ # save the text in a logfile (will be written in utf-8)
+ logfile.write(plaintext)
+ logfile.flush()
+
+output_lock = threading.Lock()
+input_lock = threading.Lock()
+output_cache = []
+
+def output(text, decoder=None, newline=True, toStdout=False, **kwargs):
+ """Output a message to the user via the userinterface.
+
+ Works like print, but uses the encoding used by the user's console
+ (console_encoding in the configuration file) instead of ASCII.
+ If decoder is None, text should be a unicode string. Otherwise it
+ should be encoded in the given encoding.
+
+ If newline is True, a linebreak will be added after printing the text.
+
+ If toStdout is True, the text will be sent to standard output,
+ so that it can be piped to another process. All other text will
+ be sent to stderr. See: http://en.wikipedia.org/wiki/Pipeline_%28Unix%29
+
+ text can contain special sequences to create colored output. These
+ consist of the escape character \03 and the color name in curly braces,
+ e. g. \03{lightpurple}. \03{default} resets the color.
+
+ """
+ output_lock.acquire()
+ try:
+ if decoder:
+ text = unicode(text, decoder)
+ elif type(text) is not unicode:
+ if verbose and sys.platform != 'win32':
+ print "DBG> BUG: Non-unicode (%s) passed to wikipedia.output without decoder!" % type(text)
+ print traceback.print_stack()
+ print "DBG> Attempting to recover, but please report this problem"
+ try:
+ text = unicode(text, 'utf-8')
+ except UnicodeDecodeError:
+ text = unicode(text, 'iso8859-1')
+ if newline:
+ text += u'\n'
+ log(text)
+ if input_lock.locked():
+ cache_output(text, toStdout = toStdout)
+ else:
+ ui.output(text, toStdout = toStdout)
+ finally:
+ output_lock.release()
+
+def cache_output(*args, **kwargs):
+ output_cache.append((args, kwargs))
+
+def flush_output_cache():
+ while(output_cache):
+ (args, kwargs) = output_cache.pop(0)
+ ui.output(*args, **kwargs)
+
+# User input functions
+
+def input(question, password = False):
+ """Ask the user a question, return the user's answer.
+
+ Parameters:
+ * question - a unicode string that will be shown to the user. Don't add a
+ space after the question mark/colon, this method will do this
+ for you.
+ * password - if True, hides the user's input (for password entry).
+
+ Returns a unicode string.
+
+ """
+ input_lock.acquire()
+ try:
+ data = ui.input(question, password)
+ finally:
+ flush_output_cache()
+ input_lock.release()
+
+ return data
+
+def inputChoice(question, answers, hotkeys, default = None):
+ """Ask the user a question with several options, return the user's choice.
+
+ The user's input will be case-insensitive, so the hotkeys should be
+ distinctive case-insensitively.
+
+ Parameters:
+ * question - a unicode string that will be shown to the user. Don't add a
+ space after the question mark, this method will do this
+ for you.
+ * answers - a list of strings that represent the options.
+ * hotkeys - a list of one-letter strings, one for each answer.
+ * default - an element of hotkeys, or None. The default choice that will
+ be returned when the user just presses Enter.
+
+ Returns a one-letter string in lowercase.
+
+ """
+ input_lock.acquire()
+ try:
+ data = ui.inputChoice(question, answers, hotkeys, default).lower()
+ finally:
+ flush_output_cache()
+ input_lock.release()
+
+ return data
+
+
+page_put_queue = Queue.Queue(config.max_queue_size)
+def async_put():
+ """Daemon; take pages from the queue and try to save them on the wiki."""
+ while True:
+ (page, newtext, comment, watchArticle,
+ minorEdit, force, callback) = page_put_queue.get()
+ if page is None:
+ # an explicit end-of-Queue marker is needed for compatibility
+ # with Python 2.4; in 2.5, we could use the Queue's task_done()
+ # and join() methods
+ return
+ try:
+ page.put(newtext, comment, watchArticle, minorEdit, force)
+ error = None
+ except Exception, error:
+ pass
+ if callback is not None:
+ callback(page, error)
+ # if callback is provided, it is responsible for exception handling
+ continue
+ if isinstance(error, SpamfilterError):
+ output(u"Saving page %s prevented by spam filter: %s"
+ % (page, error.url))
+ elif isinstance(error, PageNotSaved):
+ output(u"Saving page %s failed: %s" % (page, error))
+ elif isinstance(error, LockedPage):
+ output(u"Page %s is locked; not saved." % page)
+ elif isinstance(error, NoUsername):
+ output(u"Page %s not saved; sysop privileges required." % page)
+ elif error is not None:
+ tb = traceback.format_exception(*sys.exc_info())
+ output(u"Saving page %s failed:\n%s" % (page, "".join(tb)))
+
+_putthread = threading.Thread(target=async_put)
+# identification for debugging purposes
+_putthread.setName('Put-Thread')
+_putthread.setDaemon(True)
+## Don't start the queue if it is not necessary.
+#_putthread.start()
+
+def stopme():
+ """This should be run when a bot does not interact with the Wiki, or
+ when it has stopped doing so. After a bot has run stopme() it will
+ not slow down other bots any more.
+ """
+ get_throttle.drop()
+
+def _flush():
+ """Wait for the page-putter to flush its queue.
+
+ Called automatically upon exiting from Python.
+
+ """
+ def remaining():
+ import datetime
+ remainingPages = page_put_queue.qsize() - 1
+ # -1 because we added a None element to stop the queue
+ remainingSeconds = datetime.timedelta(
+ seconds=(remainingPages * put_throttle.getDelay(True)))
+ return (remainingPages, remainingSeconds)
+
+ page_put_queue.put((None, None, None, None, None, None, None))
+
+ if page_put_queue.qsize() > 1:
+ output(u'Waiting for %i pages to be put. Estimated time remaining: %s'
+ % remaining())
+
+ while(_putthread.isAlive()):
+ try:
+ _putthread.join(1)
+ except KeyboardInterrupt:
+ answer = inputChoice(u"""\
+There are %i pages remaining in the queue. Estimated time remaining: %s
+Really exit?"""
+ % remaining(),
+ ['yes', 'no'], ['y', 'N'], 'N')
+ if answer == 'y':
+ return
+ try:
+ get_throttle.drop()
+ except NameError:
+ pass
+ if config.use_diskcache and not config.use_api:
+ for site in _sites.itervalues():
+ if site._mediawiki_messages:
+ try:
+ site._mediawiki_messages.delete()
+ except OSError:
+ pass
+
+import atexit
+atexit.register(_flush)
+
+def debugDump(name, site, error, data):
+ import time
+ name = unicode(name)
+ error = unicode(error)
+ site = unicode(repr(site).replace(u':',u'_'))
+ filename = '%s_%s__%s.dump' % (name, site, time.asctime())
+ filename = filename.replace(' ','_').replace(':','-')
+ f = file(filename, 'wb') #trying to write it in binary
+ #f = codecs.open(filename, 'w', 'utf-8')
+ f.write(u'Error reported: %s\n\n' % error)
+ try:
+ f.write(data.encode("utf8"))
+ except UnicodeDecodeError:
+ f.write(data)
+ f.close()
+ output( u'ERROR: %s caused error %s. Dump %s created.' % (name,error,filename) )
+
+get_throttle = Throttle()
+put_throttle = Throttle(write=True)
+
+def decompress_gzip(data):
+ # Use cStringIO if available
+ # TODO: rewrite gzip.py such that it supports unseekable fileobjects.
+ if data:
+ try:
+ from cStringIO import StringIO
+ except ImportError:
+ from StringIO import StringIO
+ import gzip
+ try:
+ data = gzip.GzipFile(fileobj = StringIO(data)).read()
+ except IOError:
+ raise
+ return data
+
+def parsetime2stamp(tz):
+ s = time.strptime(tz, "%Y-%m-%dT%H:%M:%SZ")
+ return int(time.strftime("%Y%m%d%H%M%S", s))
+
+
+#Redirect Handler for urllib2
+class U2RedirectHandler(urllib2.HTTPRedirectHandler):
+
+ def redirect_request(self, req, fp, code, msg, headers, newurl):
+ newreq = urllib2.HTTPRedirectHandler.redirect_request(self, req, fp, code, msg, headers, newurl)
+ if (newreq.get_method() == "GET"):
+ for cl in "Content-Length", "Content-length", "content-length", "CONTENT-LENGTH":
+ if newreq.has_header(cl):
+ del newreq.headers[cl]
+ return newreq
+
+ def http_error_301(self, req, fp, code, msg, headers):
+ result = urllib2.HTTPRedirectHandler.http_error_301(
+ self, req, fp, code, msg, headers)
+ result.code = code
+ result.sheaders = [v for v in headers.__str__().split('\n') if v.startswith('Set-Cookie:')]
+ return result
+
+ def http_error_302(self, req, fp, code, msg, headers):
+ result = urllib2.HTTPRedirectHandler.http_error_302(
+ self, req, fp, code, msg, headers)
+ result.code = code
+ result.sheaders = [v for v in headers.__str__().split('\n') if v.startswith('Set-Cookie:')]
+ return result
+
+# Site Cookies handler
+COOKIEFILE = config.datafilepath('login-data', 'cookies.lwp')
+cj = cookielib.LWPCookieJar()
+if os.path.isfile(COOKIEFILE):
+ cj.load(COOKIEFILE)
+
+cookieProcessor = urllib2.HTTPCookieProcessor(cj)
+
+
+MyURLopener = urllib2.build_opener(U2RedirectHandler)
+
+if config.proxy['host']:
+ proxyHandler = urllib2.ProxyHandler({'http':'http://%s/' % config.proxy['host'] })
+
+ MyURLopener.add_handler(proxyHandler)
+ if config.proxy['auth']:
+ proxyAuth = urllib2.HTTPPasswordMgrWithDefaultRealm()
+ proxyAuth.add_password(None, config.proxy['host'], config.proxy['auth'][0], config.proxy['auth'][1])
+ proxyAuthHandler = urllib2.ProxyBasicAuthHandler(proxyAuth)
+
+ MyURLopener.add_handler(proxyAuthHandler)
+
+if config.authenticate:
+ passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
+ for site in config.authenticate:
+ passman.add_password(None, site, config.authenticate[site][0], config.authenticate[site][1])
+ authhandler = urllib2.HTTPBasicAuthHandler(passman)
+
+ MyURLopener.add_handler(authhandler)
+
+MyURLopener.addheaders = [('User-agent', useragent)]
+
+# This is a temporary part for the 2012 version survey
+# http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/12473
+# Upon removing the connected lines from config.py should be removed, too.
+if not config.suppresssurvey:
+ output(
+"""
+\03{lightyellow}Dear Pywikipedia user!\03{default}
+Pywikibot has detected that you use this outdated version of Python:
+%s.
+We would like to hear your voice before ceasing support of this version.
+Please update to \03{lightyellow}Python 2.7.2\03{default} or higher if possible or visit
+http://www.mediawiki.org/wiki/Pywikipediabot/Survey2012 to tell us why we
+should support your version and to learn how to hide this message.
+After collecting opinions for a time we will decide and announce the deadline
+of deprecating use of old Python versions for Pywikipedia.
+""" % sys.version)
+
+if __name__ == '__main__':
+ import doctest
+ print 'Pywikipediabot %s' % version.getversion()
+ print 'Python %s' % sys.version
+ doctest.testmod()
+
http://www.mediawiki.org/wiki/Special:Code/pywikipedia/10527
Revision: 10527
Author: xqt
Date: 2012-09-16 13:36:31 +0000 (Sun, 16 Sep 2012)
Log Message:
-----------
update (c) date
Modified Paths:
--------------
trunk/pywikipedia/LICENSE
Modified: trunk/pywikipedia/LICENSE
===================================================================
--- trunk/pywikipedia/LICENSE 2012-09-16 13:35:33 UTC (rev 10526)
+++ trunk/pywikipedia/LICENSE 2012-09-16 13:36:31 UTC (rev 10527)
@@ -1,4 +1,4 @@
-Copyright (c) 2005-2011 The PyWikipediaBot team
+Copyright (c) 2005-2012 The PyWikipediaBot team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal