Feature Requests item #1993062, was opened at 2008-06-13 16:47
Message generated for change (Comment added) made by multichill
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=1993062&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: interwiki
Group: None
Status: Open
Priority: 7
Private: No
Submitted By: Melancholie (melancholie)
Assigned to: Nobody/Anonymous (nobody)
Summary: Use API module 'parse' for retrieving interwiki links
Initial Comment:
Currently pages are retrieved in a batch by using Special:Export.
Although being fast (as only one request is done), there is a huge data overhead with this method!
Why not use the API with its 'parse' module? Only interwiki links can be fetched with that, reducing traffic (overhead) a lot!
See:
http://de.wikipedia.org/w/api.php?action=parse&format=xml&page=Test&prop=la…
Outputs could be downloaded in parallel to virtualize a batch (faster).
----
At least make this method optional (config.py) for being able of reducing data traffic, if wanted. API is just more efficient.
----------------------------------------------------------------------
>Comment By: Multichill (multichill)
Date: 2008-11-13 12:46
Message:
We are working on a rewrite. The rewrite uses the api as much as possible.
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-06-15 01:27
Message:
Logged In: YES
user_id=2089773
Originator: YES
See http://meta.wikimedia.org/wiki/Interwiki_bot_access_protocol
concerning disambiguations and redirects:
http://de.wikipedia.org/w/api.php?action=parse&format=xml&text={{:Main_Page…
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-06-14 16:38
Message:
Logged In: YES
user_id=2089773
Originator: YES
Backwards compatibility?
That's no reason for not making software more efficient, where possible
;-)
That's also why I wrote something about "optional", too.
Because for current MediaWiki wikis there is a much more efficient way of
retrieving (only) certain contents (langlinks, categories), there should be
a method of using that advantage! Will reduce load (bot owner's and
server's)...
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2008-06-13 20:44
Message:
Logged In: YES
user_id=1806226
Originator: NO
Backwards compatibility with non Wikimedia wikis?
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-06-13 17:20
Message:
Logged In: YES
user_id=2089773
Originator: YES
For not being misusable of confusing bots, the yet to be set up MediaWiki
message could contain [[foreigncode:{{CURRENTTIMESTAMP}}]] (cache issue?)
(sorry for spamming with this request ;-)
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-06-13 17:08
Message:
Logged In: YES
user_id=2089773
Originator: YES
Important note for getting pages' interwikis in a batch:
http://de.wikipedia.org/w/api.php?action=parse&text={{:Test}}{{:Bot}}{{:Hau…
Either the bot could figure out what interwikis belong together then, or
maybe a marker could placed in between:
http://de.wikipedia.org/w/api.php?action=parse&text={{:Test}}{{MediaWiki:Iw…
[[MediaWiki:Iwmarker]] (or 'Llmarker'?) would have to be set up by the
MediaWiki developers with [[en:/de:Abuse-save-mark]] as content (but this
is potentially misusable).
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-06-13 16:51
Message:
Logged In: YES
user_id=2089773
Originator: YES
Note: Maybe combine it with 'generator'.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=1993062&group_…
Feature Requests item #2255146, was opened at 2008-11-10 12:44
Message generated for change (Comment added) made by multichill
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2255146&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
Priority: 5
Private: No
Submitted By: Nobody/Anonymous (nobody)
>Assigned to: Multichill (multichill)
Summary: Page moves should not suppress redirects by default
Initial Comment:
When moving pages with Page.move(), a redirect should be created by default. Users with the 'suppressredirect' right (e.g. bots on Wikimedia projects) should be able to suppress redirects. I propose adding a parameter leaveRedirect=True (or possibly suppresRedirect=False) to the function.
----------------------------------------------------------------------
>Comment By: Multichill (multichill)
Date: 2008-11-13 12:44
Message:
In svn revision 6084 i implemented leaveRedirect. By default this is true.
def move(self, newtitle, reason=None, movetalkpage=True, sysop=False,
- throttle=True, deleteAndMove=False, safe=True,
fixredirects=True):
+ throttle=True, deleteAndMove=False, safe=True,
fixredirects=True, leaveRedirect=True):
"""Move this page to new title given by newtitle. If safe, don't
try
to move and delete if not directly requested.
@@ -2226,6 +2226,10 @@
predata['wpFixRedirects'] = '1'
else:
predata['wpFixRedirects'] = '0'
+ if leaveRedirect:
+ predata['wpLeaveRedirect'] = '1'
+ else:
+ predata['wpLeaveRedirect'] = '0'
if token:
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2255146&group_…
Feature Requests item #2269013, was opened at 2008-11-12 12:42
Message generated for change (Comment added) made by multichill
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2269013&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
Priority: 5
Private: No
Submitted By: Yann Forget (yannforget)
Assigned to: Nobody/Anonymous (nobody)
Summary: Add options -cat and -file to fixing_redirects.py
Initial Comment:
Hello,
Please add the options -cat (changing all pages in a category) and -file (changing all pages listed in a file) to fixing_redirects.py
Thanks, Yann
----------------------------------------------------------------------
>Comment By: Multichill (multichill)
Date: 2008-11-13 12:39
Message:
As discussed on irc: This is already implemented. fixing_redirects.py uses
the pagegenerators so all pagegenerators (including -cat and -file) are
available.
----------------------------------------------------------------------
Comment By: Yann Forget (yannforget)
Date: 2008-11-12 12:44
Message:
and also to movepages.py
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2269013&group_…
Bugs item #2011362, was opened at 2008-07-05 20:07
Message generated for change (Comment added) made by melancholie
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2011362&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Melancholie (melancholie)
Assigned to: Nobody/Anonymous (nobody)
Summary: Update featured.py (patch)
Initial Comment:
Both, hiwiki and yiwiki do use Template:Link_FA, the pages hi:Template:Lien AdQ and yi:Template:רא are only redirects.
By re-adding 'Link FA' to the langs arrays, Template:Link_FA will be used by the bot (otherewise those redirects).
----------------------------------------------------------------------
>Comment By: Melancholie (melancholie)
Date: 2008-11-13 12:02
Message:
update
File Added: featured.diff
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-09-07 09:22
Message:
Logged In: YES
user_id=2089773
Originator: YES
syncing
File Added: featured.diff
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-08-28 09:20
Message:
Logged In: YES
user_id=2089773
Originator: YES
Added 'cy', see
http://als.wikipedia.org/wiki/Benutzer_Diskussion:Melancholie#via_roboto_.2…
File Added: featured.diff
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-08-18 23:45
Message:
Logged In: YES
user_id=2089773
Originator: YES
File Added: featured.diff
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-08-18 23:44
Message:
Logged In: YES
user_id=2089773
Originator: YES
Added szl category
File Added: featured.diff
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-08-05 13:50
Message:
Logged In: YES
user_id=2089773
Originator: YES
added patch (working current)
File Added: featured.diff
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-07-11 07:51
Message:
Logged In: YES
user_id=2089773
Originator: YES
Maybe I was a little bit too imprecise ;-)
Here is what I mean:
templatelist = template['_default']
try:
templatelist += template[tosite.lang]
+ (u" {{%s|%s}}" % (templatelist[0],
fromsite.lang))
templatelist[0] is 'Link FA', but the localized templates used are
templatelist[1]. If there is a localized template (with Link_FA being only
a redirect) it should be used for edit. So it might be good to change the
order (make 'try:' first, add _default with +=)
For hi, yi the easiest way might be to just comment out the localized
template names, as they are only redirects to Link_FA.
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-07-10 11:59
Message:
Logged In: YES
user_id=2089773
Originator: YES
So {{Link FA}} is always the first choice?
Or does _default actually follow redirects?
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-07-10 10:41
Message:
Logged In: YES
user_id=1963242
Originator: NO
I'm not sure melancholie :)
I added {{Leam VdC}} in r5707, however, since r5669 (
https://fisheye.toolserver.org/browse/pywikipedia/trunk/pywikipedia/feature…
) it uses the '_default' entry (Link FA) AND the locale entry. The list for
hi: is for example ['Link Fa', 'Lien AdQ']
----------------------------------------------------------------------
Comment By: Melancholie (melancholie)
Date: 2008-07-10 10:28
Message:
Logged In: YES
user_id=2089773
Originator: YES
Furthermore, add http://fur.wikipedia.org/wiki/Model:Leam_VdC
[I wish I were able to do this things myself, but *still* no account yet]
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2011362&group_…
Revision: 6093
Author: russblau
Date: 2008-11-12 20:32:26 +0000 (Wed, 12 Nov 2008)
Log Message:
-----------
one more generator bug
Modified Paths:
--------------
branches/rewrite/pywikibot/data/api.py
Modified: branches/rewrite/pywikibot/data/api.py
===================================================================
--- branches/rewrite/pywikibot/data/api.py 2008-11-12 19:21:49 UTC (rev 6092)
+++ branches/rewrite/pywikibot/data/api.py 2008-11-12 20:32:26 UTC (rev 6093)
@@ -317,10 +317,10 @@
self.request = Request(**kwargs)
self.limit = None
if "generator" in kwargs:
- self.resultkey = "pages" # name of the "query"
- else: # subelement key
- self.resultkey = self.module # to look for when iterating
- self.continuekey = self.resultkey # usually the query-continue key
+ self.resultkey = "pages" # name of the "query" subelement key
+ else: # to look for when iterating
+ self.resultkey = self.module
+ self.continuekey = self.module # usually the query-continue key
# is the same as the querymodule,
# but not always
Revision: 6092
Author: russblau
Date: 2008-11-12 19:21:49 +0000 (Wed, 12 Nov 2008)
Log Message:
-----------
bugfixes in loading revision text and isEmpty()
Modified Paths:
--------------
branches/rewrite/pywikibot/__init__.py
branches/rewrite/pywikibot/page.py
branches/rewrite/pywikibot/site.py
Modified: branches/rewrite/pywikibot/__init__.py
===================================================================
--- branches/rewrite/pywikibot/__init__.py 2008-11-12 19:21:18 UTC (rev 6091)
+++ branches/rewrite/pywikibot/__init__.py 2008-11-12 19:21:49 UTC (rev 6092)
@@ -14,6 +14,7 @@
from exceptions import *
import config2
+import textlib
def deprecate_arg(old_arg, new_arg):
Modified: branches/rewrite/pywikibot/page.py
===================================================================
--- branches/rewrite/pywikibot/page.py 2008-11-12 19:21:18 UTC (rev 6091)
+++ branches/rewrite/pywikibot/page.py 2008-11-12 19:21:49 UTC (rev 6092)
@@ -279,8 +279,10 @@
"""Return True if title of this Page is in the autoFormat dictionary."""
return self.autoFormat()[0] is not None
- def get(self, force=False, get_redirect=False, throttle=None,
- sysop=False, nofollow_redirects=None, change_edit_time=None):
+ @deprecate_arg("throttle", None)
+ @deprecate_arg("nofollow_redirects", None)
+ @deprecate_arg("change_edit_time", None)
+ def get(self, force=False, get_redirect=False, sysop=False):
"""Return the wiki-text of the page.
This will retrieve the page from the server if it has not been
@@ -298,17 +300,8 @@
redirect, do not raise an exception.
@param sysop: if the user has a sysop account, use it to retrieve
this page
- @param throttle: DEPRECATED and unused
- @param nofollow_redirects: DEPRECATED and unused
- @param change_edit_time: DEPRECATED and unused
"""
- if throttle is not None:
- logger.debug("Page.get(throttle) option is deprecated.")
- if nofollow_redirects is not None:
- logger.debug("Page.get(nofollow_redirects) option is deprecated.")
- if change_edit_time is not None:
- logger.debug("Page.get(change_edit_time) option is deprecated.")
if force:
# When forcing, we retry the page no matter what. Old exceptions
# do not apply any more.
@@ -322,32 +315,27 @@
elif hasattr(self, '_getexception'):
raise self._getexception
if force or not hasattr(self, "_revid") \
- or not self._revid in self._revisions:
+ or not self._revid in self._revisions \
+ or self._revisions[self._revid].text is None:
self.site().loadrevisions(self, getText=True, sysop=sysop)
# TODO: Exception handling for no-page, redirects, etc.
return self._revisions[self._revid].text
+ @deprecate_arg("throttle", None)
+ @deprecate_arg("nofollow_redirects", None)
+ @deprecate_arg("change_edit_time", None)
def getOldVersion(self, oldid, force=False, get_redirect=False,
- throttle=None, sysop=False, nofollow_redirects=None,
- change_edit_time=None):
+ sysop=False):
"""Return text of an old revision of this page; same options as get().
@param oldid: The revid of the revision desired.
"""
- if throttle is not None:
- logger.debug(
- "Page.getOldVersion(throttle) option is deprecated.")
- if nofollow_redirects is not None:
- logger.debug(
- "Page.getOldVersion(nofollow_redirects) option is deprecated.")
- if change_edit_time is not None:
- logger.debug(
- "Page.getOldVersion(change_edit_time) option is deprecated.")
- if force or not oldid in self._revisions:
- self.site().loadrevisions(self, getText=True, ids=oldid,
- sysop=sysop)
+ if force or not oldid in self._revisions \
+ or self._revisions[oldid].text is None:
+ self.site().loadrevisions(self, getText=True, revids=oldid,
+ sysop=sysop)
# TODO: what about redirects, errors?
return self._revisions[oldid].text
@@ -368,7 +356,7 @@
def _textgetter(self):
"""Return the current (edited) wikitext, loading it if necessary."""
- if not hasattr(self, '_text'):
+ if not hasattr(self, '_text') or self._text is None:
try:
self._text = self.get()
except pywikibot.NoPage:
@@ -427,8 +415,8 @@
"""
txt = self.get()
- txt = pywikibot.removeLanguageLinks(txt, site = self.site())
- txt = pywikibot.removeCategoryLinks(txt, site = self.site())
+ txt = pywikibot.textlib.removeLanguageLinks(txt, site = self.site())
+ txt = pywikibot.textlib.removeCategoryLinks(txt, site = self.site())
if len(txt) < 4:
return True
else:
@@ -443,7 +431,9 @@
"""Return other member of the article-talk page pair for this Page.
If self is a talk page, returns the associated content page;
- otherwise, returns the associated talk page.
+ otherwise, returns the associated talk page. The returned page need
+ not actually exist on the wiki.
+
Returns None if self is a special page.
"""
@@ -824,12 +814,13 @@
limit=limit)
if getAll:
revCount = len(self._revisions)
- return [(self._revisions[rev].id,
- self._revisions[rev].timestamp,
- self._revisions[rev].user,
- self._revisions[rev].comment)
- for rev in sorted(self._revisions.keys(),
- reverse=not reverseOrder)[ : revCount]
+ return [ ( self._revisions[rev].revid,
+ self._revisions[rev].timestamp,
+ self._revisions[rev].user,
+ self._revisions[rev].comment
+ ) for rev in sorted(self._revisions.keys(),
+ reverse=not reverseOrder)[ : revCount]
+ ]
def getVersionHistoryTable(self, forceReload=False, reverseOrder=False,
getAll=False, revCount=500):
Modified: branches/rewrite/pywikibot/site.py
===================================================================
--- branches/rewrite/pywikibot/site.py 2008-11-12 19:21:18 UTC (rev 6091)
+++ branches/rewrite/pywikibot/site.py 2008-11-12 19:21:49 UTC (rev 6092)
@@ -1293,7 +1293,10 @@
rvgen = api.PropertyGenerator(u"info|revisions", titles=rvtitle,
site=self)
else:
- ids = u"|".join(unicode(r) for r in revids)
+ if isinstance(revids, (int, basestring)):
+ ids = unicode(revids)
+ else:
+ ids = u"|".join(unicode(r) for r in revids)
rvgen = api.PropertyGenerator(u"info|revisions", revids=ids,
site=self)
if getText:
@@ -1301,7 +1304,7 @@
u"ids|flags|timestamp|user|comment|content"
if section is not None:
rvgen.request[u"rvsection"] = unicode(section)
- if latest:
+ if latest or "revids" in rvgen.request:
rvgen.limit = -1 # suppress use of rvlimit parameter
elif isinstance(limit, int):
rvgen.limit = limit
Revision: 6091
Author: russblau
Date: 2008-11-12 19:21:18 +0000 (Wed, 12 Nov 2008)
Log Message:
-----------
textlib.py contains functions for altering wiki-text (all functions copied-and-pasted from wikipedia.py)
Added Paths:
-----------
branches/rewrite/pywikibot/textlib.py
Added: branches/rewrite/pywikibot/textlib.py
===================================================================
--- branches/rewrite/pywikibot/textlib.py (rev 0)
+++ branches/rewrite/pywikibot/textlib.py 2008-11-12 19:21:18 UTC (rev 6091)
@@ -0,0 +1,566 @@
+# -*- coding: utf-8 -*-
+"""
+Functions for manipulating wiki-text.
+
+Unless otherwise noted, all functions take a unicode string as the argument
+and return a unicode string.
+
+"""
+#
+# (C) Pywikipedia bot team, 2008
+#
+# Distributed under the terms of the MIT license.
+#
+__version__ = '$Id: $'
+
+
+import pywikibot
+import re
+
+
+def unescape(s):
+ """Replace escaped HTML-special characters by their originals"""
+ if '&' not in s:
+ return s
+ s = s.replace("<", "<")
+ s = s.replace(">", ">")
+ s = s.replace("'", "'")
+ s = s.replace(""", '"')
+ s = s.replace("&", "&") # Must be last
+ return s
+
+
+def replaceExcept(text, old, new, exceptions, caseInsensitive=False,
+ allowoverlap=False, marker = '', site = None):
+ """
+ Return text with 'old' replaced by 'new', ignoring specified types of text.
+
+ Skips occurences of 'old' within exceptions; e.g., within nowiki tags or
+ HTML comments. If caseInsensitive is true, then use case insensitive
+ regex matching. If allowoverlap is true, overlapping occurences are all
+ replaced (watch out when using this, it might lead to infinite loops!).
+
+ Parameters:
+ text - a unicode string
+ old - a compiled regular expression
+ new - a unicode string (which can contain regular
+ expression references), or a function which takes
+ a match object as parameter. See parameter repl of
+ re.sub().
+ exceptions - a list of strings which signal what to leave out,
+ e.g. ['math', 'table', 'template']
+ caseInsensitive - a boolean
+ marker - a string that will be added to the last replacement;
+ if nothing is changed, it is added at the end
+
+ """
+ if site is None:
+ site = pywikibot.getSite()
+
+ exceptionRegexes = {
+ 'comment': re.compile(r'(?s)<!--.*?-->'),
+ # section headers
+ 'header': re.compile(r'\r\n=+.+=+ *\r\n'),
+ 'includeonly': re.compile(r'(?is)<includeonly>.*?</includeonly>'),
+ 'math': re.compile(r'(?is)<math>.*?</math>'),
+ 'noinclude': re.compile(r'(?is)<noinclude>.*?</noinclude>'),
+ # wiki tags are ignored inside nowiki tags.
+ 'nowiki': re.compile(r'(?is)<nowiki>.*?</nowiki>'),
+ # preformatted text
+ 'pre': re.compile(r'(?ism)<pre>.*?</pre>'),
+ 'source': re.compile(r'(?is)<source .*?</source>'),
+ # inline references
+ 'ref': re.compile(r'(?ism)<ref[ >].*?</ref>'),
+ 'timeline': re.compile(r'(?is)<timeline>.*?</timeline>'),
+ # lines that start with a space are shown in a monospace font and
+ # have whitespace preserved.
+ 'startspace': re.compile(r'(?m)^ (.*?)$'),
+ # tables often have whitespace that is used to improve wiki
+ # source code readability.
+ # TODO: handle nested tables.
+ 'table': re.compile(r'(?ims)^{\|.*?^\|}|<table>.*?</table>'),
+ # templates with parameters often have whitespace that is used to
+ # improve wiki source code readability.
+ # 'template': re.compile(r'(?s){{.*?}}'),
+ # The regex above fails on nested templates. This regex can handle
+ # templates cascaded up to level 3, but no deeper. For arbitrary
+ # depth, we'd need recursion which can't be done in Python's re.
+ # After all, the language of correct parenthesis words is not regular.
+ 'template': re.compile(r'(?s){{(({{(({{.*?}})|.)*}})|.)*}}'),
+ 'hyperlink': compileLinkR(),
+ 'gallery': re.compile(r'(?is)<gallery.*?>.*?</gallery>'),
+ # this matches internal wikilinks, but also interwiki, categories, and
+ # images.
+ 'link': re.compile(r'\[\[[^\]\|]*(\|[^\]]*)?\]\]'),
+ 'interwiki': re.compile(r'(?i)\[\[(%s)\s?:[^\]]*\]\][\s]*'
+ % '|'.join(site.validLanguageLinks() + site.family.obsolete.keys())),
+
+ }
+
+ # if we got a string, compile it as a regular expression
+ if type(old) is str or type(old) is unicode:
+ if caseInsensitive:
+ old = re.compile(old, re.IGNORECASE | re.UNICODE)
+ else:
+ old = re.compile(old)
+
+ dontTouchRegexes = []
+ for exc in exceptions:
+ if isinstance(exc, str) or isinstance(exc, unicode):
+ # assume it's a reference to the exceptionRegexes dictionary
+ # defined above.
+ if not exceptionRegexes.has_key(exc):
+ raise ValueError("Unknown tag type: " + exc)
+ dontTouchRegexes.append(exceptionRegexes[exc])
+ else:
+ # assume it's a regular expression
+ dontTouchRegexes.append(exc)
+ index = 0
+ markerpos = len(text)
+ while True:
+ match = old.search(text, index)
+ if not match:
+ # nothing left to replace
+ break
+
+ # check which exception will occur next.
+ nextExceptionMatch = None
+ for dontTouchR in dontTouchRegexes:
+ excMatch = dontTouchR.search(text, index)
+ if excMatch and (
+ nextExceptionMatch is None or
+ excMatch.start() < nextExceptionMatch.start()):
+ nextExceptionMatch = excMatch
+
+ if nextExceptionMatch is not None and nextExceptionMatch.start() <= match.start():
+ # an HTML comment or text in nowiki tags stands before the next valid match. Skip.
+ index = nextExceptionMatch.end()
+ else:
+ # We found a valid match. Replace it.
+ if callable(new):
+ # the parameter new can be a function which takes the match as a parameter.
+ replacement = new(match)
+ else:
+ # it is not a function, but a string.
+
+ # it is a little hack to make \n work. It would be better to fix it
+ # previously, but better than nothing.
+ new = new.replace('\\n', '\n')
+
+ # We cannot just insert the new string, as it may contain regex
+ # group references such as \2 or \g<name>.
+ # On the other hand, this approach does not work because it can't
+ # handle lookahead or lookbehind (see bug #1731008):
+ #replacement = old.sub(new, text[match.start():match.end()])
+ #text = text[:match.start()] + replacement + text[match.end():]
+
+ # So we have to process the group references manually.
+ replacement = new
+
+ groupR = re.compile(r'\\(?P<number>\d+)|\\g<(?P<name>.+?)>')
+ while True:
+ groupMatch = groupR.search(replacement)
+ if not groupMatch:
+ break
+ groupID = groupMatch.group('name') or int(groupMatch.group('number'))
+ replacement = replacement[:groupMatch.start()] + match.group(groupID) + replacement[groupMatch.end():]
+ text = text[:match.start()] + replacement + text[match.end():]
+
+ # continue the search on the remaining text
+ if allowoverlap:
+ index = match.start() + 1
+ else:
+ index = match.start() + len(replacement)
+ markerpos = match.start() + len(replacement)
+ text = text[:markerpos] + marker + text[markerpos:]
+ return text
+
+
+def removeDisabledParts(text, tags = ['*']):
+ """
+ Return text without portions where wiki markup is disabled
+
+ Parts that can/will be removed are --
+ * HTML comments
+ * nowiki tags
+ * pre tags
+ * includeonly tags
+
+ The exact set of parts which should be removed can be passed as the
+ 'parts' parameter, which defaults to all.
+ """
+ regexes = {
+ 'comments' : r'<!--.*?-->',
+ 'includeonly': r'<includeonly>.*?</includeonly>',
+ 'nowiki': r'<nowiki>.*?</nowiki>',
+ 'pre': r'<pre>.*?</pre>',
+ 'source': r'<source .*?</source>',
+ }
+ if '*' in tags:
+ tags = regexes.keys()
+ toRemoveR = re.compile('|'.join([regexes[tag] for tag in tags]),
+ re.IGNORECASE | re.DOTALL)
+ return toRemoveR.sub('', text)
+
+
+def isDisabled(text, index, tags = ['*']):
+ """
+ Return True if text[index] is disabled, e.g. by a comment or by nowiki tags.
+
+ For the tags parameter, see removeDisabledParts() above.
+ """
+ # Find a marker that is not already in the text.
+ marker = '@@'
+ while marker in text:
+ marker += '@'
+ text = text[:index] + marker + text[index:]
+ text = removeDisabledParts(text, tags)
+ return (marker not in text)
+
+
+# Functions dealing with interwiki language links
+
+# Note - MediaWiki supports two kinds of interwiki links; interlanguage and
+# interproject. These functions only deal with links to a
+# corresponding page in another language on the same project (e.g.,
+# Wikipedia, Wiktionary, etc.) in another language. They do not find
+# or change links to a different project, or any that are formatted
+# as in-line interwiki links (e.g., "[[:es:Articulo]]". (CONFIRM)
+
+def getLanguageLinks(text, insite = None, pageLink = "[[]]"):
+ """
+ Return a dict of interlanguage links found in text.
+
+ Dict uses language codes as keys and Page objects as values.
+ Do not call this routine directly, use Page.interwiki() method
+ instead.
+
+ """
+ if insite == None:
+ insite = pywikibot.getSite()
+ result = {}
+ # Ignore interwiki links within nowiki tags, includeonly tags, pre tags,
+ # and HTML comments
+ text = removeDisabledParts(text)
+
+ # This regular expression will find every link that is possibly an
+ # interwiki link.
+ # NOTE: language codes are case-insensitive and only consist of basic latin
+ # letters and hyphens.
+ interwikiR = re.compile(r'\[\[([a-zA-Z\-]+)\s?:([^\[\]\n]*)\]\]')
+ for lang, pagetitle in interwikiR.findall(text):
+ lang = lang.lower()
+ # Check if it really is in fact an interwiki link to a known
+ # language, or if it's e.g. a category tag or an internal link
+ if lang in insite.family.obsolete:
+ lang = insite.family.obsolete[lang]
+ if lang in insite.validLanguageLinks():
+ if '|' in pagetitle:
+ # ignore text after the pipe
+ pagetitle = pagetitle[:pagetitle.index('|')]
+ # we want the actual page objects rather than the titles
+ site = insite.getSite(code = lang)
+ try:
+ result[site] = pywikibot.Page(site, pagetitle, insite = insite)
+ except InvalidTitle:
+ output(
+ u"[getLanguageLinks] Text contains invalid interwiki link [[%s:%s]]."
+ % (lang, pagetitle))
+ continue
+ return result
+
+
+def removeLanguageLinks(text, site = None, marker = ''):
+ """Return text with all interlanguage links removed.
+
+ If a link to an unknown language is encountered, a warning is printed.
+ If a marker is defined, that string is placed at the location of the
+ last occurence of an interwiki link (at the end if there are no
+ interwiki links).
+
+ """
+ if site == None:
+ site = pywikibot.getSite()
+ if not site.validLanguageLinks():
+ return text
+ # This regular expression will find every interwiki link, plus trailing
+ # whitespace.
+ languages = '|'.join(site.validLanguageLinks() + site.family.obsolete.keys())
+ interwikiR = re.compile(r'\[\[(%s)\s?:[^\]]*\]\][\s]*'
+ % languages, re.IGNORECASE)
+ text = replaceExcept(text, interwikiR, '',
+ ['nowiki', 'comment', 'math', 'pre', 'source'], marker=marker)
+ return text.strip()
+
+
+def replaceLanguageLinks(oldtext, new, site = None):
+ """Replace interlanguage links in the text with a new set of links.
+
+ 'new' should be a dict with the Site objects as keys, and Page objects
+ as values (i.e., just like the dict returned by getLanguageLinks
+ function).
+
+ """
+ # Find a marker that is not already in the text.
+ marker = '@@'
+ while marker in oldtext:
+ marker += '@'
+ if site == None:
+ site = pywikibot.getSite()
+ s = interwikiFormat(new, insite = site)
+ s2 = removeLanguageLinks(oldtext, site = site, marker = marker)
+ if s:
+ if site.language() in site.family.interwiki_attop:
+ newtext = s + site.family.interwiki_text_separator + s2.replace(marker,'').strip()
+ else:
+ # calculate what was after the language links on the page
+ firstafter = s2.find(marker) + len(marker)
+ # Is there any text in the 'after' part that means we should keep it after?
+ if "</noinclude>" in s2[firstafter:]:
+ newtext = s2[:firstafter] + s + s2[firstafter:]
+ elif site.language() in site.family.categories_last:
+ cats = getCategoryLinks(s2, site = site)
+ s2 = removeCategoryLinks(s2.replace(marker,'').strip(), site) + site.family.interwiki_text_separator + s
+ newtext = replaceCategoryLinks(s2, cats, site=site)
+ else:
+ newtext = s2.replace(marker,'').strip() + site.family.interwiki_text_separator + s
+ newtext = newtext.replace(marker,'')
+ else:
+ newtext = s2.replace(marker,'')
+ return newtext
+
+
+def interwikiFormat(links, insite = None):
+ """Convert interwiki link dict into a wikitext string.
+
+ 'links' should be a dict with the Site objects as keys, and Page
+ objects as values.
+
+ Return a unicode string that is formatted for inclusion in insite
+ (defaulting to the current site).
+ """
+ if insite is None:
+ insite = pywikibot.getSite()
+ if not links:
+ return ''
+
+ ar = interwikiSort(links.keys(), insite)
+ s = []
+ for site in ar:
+ try:
+ link = links[site].aslink(forceInterwiki=True)
+ s.append(link)
+ except AttributeError:
+ s.append(pywikibot.getSite(site).linkto(links[site],
+ othersite=insite))
+ if insite.lang in insite.family.interwiki_on_one_line:
+ sep = u' '
+ else:
+ sep = u'\r\n'
+ s=sep.join(s) + u'\r\n'
+ return s
+
+
+# Sort sites according to local interwiki sort logic
+def interwikiSort(sites, insite = None):
+ if insite is None:
+ insite = pywikibot.getSite()
+ if not sites:
+ return []
+
+ sites.sort()
+ putfirst = insite.interwiki_putfirst()
+ if putfirst:
+ #In this case I might have to change the order
+ firstsites = []
+ for code in putfirst:
+ # The code may not exist in this family?
+ if code in insite.family.obsolete:
+ code = insite.family.obsolete[code]
+ if code in insite.validLanguageLinks():
+ site = insite.getSite(code = code)
+ if site in sites:
+ del sites[sites.index(site)]
+ firstsites = firstsites + [site]
+ sites = firstsites + sites
+ if insite.interwiki_putfirst_doubled(sites): #some implementations return False
+ sites = insite.interwiki_putfirst_doubled(sites) + sites
+ return sites
+
+
+# Functions dealing with category links
+
+def getCategoryLinks(text, site):
+ """Return a list of category links found in text.
+
+ List contains Category objects.
+ Do not call this routine directly, use Page.categories() instead.
+
+ """
+ result = []
+ # Ignore category links within nowiki tags, pre tags, includeonly tags,
+ # and HTML comments
+ text = removeDisabledParts(text)
+ catNamespace = '|'.join(site.category_namespaces())
+ R = re.compile(r'\[\[\s*(?P<namespace>%s)\s*:\s*(?P<catName>.+?)'
+ r'(?:\|(?P<sortKey>.+?))?\s*\]\]'
+ % catNamespace, re.I)
+ for match in R.finditer(text):
+ cat = pywikibot.Category(site,
+ '%s:%s' % (match.group('namespace'),
+ match.group('catName')),
+ sortKey = match.group('sortKey'))
+ result.append(cat)
+ return result
+
+
+def removeCategoryLinks(text, site, marker = ''):
+ """Return text with all category links removed.
+
+ Put the string marker after the last replacement (at the end of the text
+ if there is no replacement).
+
+ """
+ # This regular expression will find every link that is possibly an
+ # interwiki link, plus trailing whitespace. The language code is grouped.
+ # NOTE: This assumes that language codes only consist of non-capital
+ # ASCII letters and hyphens.
+ catNamespace = '|'.join(site.category_namespaces())
+ categoryR = re.compile(r'\[\[\s*(%s)\s*:.*?\]\]\s*' % catNamespace, re.I)
+ text = replaceExcept(text, categoryR, '', ['nowiki', 'comment', 'math', 'pre', 'source'], marker = marker)
+ if marker:
+ #avoid having multiple linefeeds at the end of the text
+ text = re.sub('\s*%s' % re.escape(marker), '\r\n' + marker, text.strip())
+ return text.strip()
+
+
+def replaceCategoryInPlace(oldtext, oldcat, newcat, site=None):
+ """Replace the category oldcat with the category newcat and return
+ the modified text.
+
+ """
+ if site is None:
+ site = pywikibot.getSite()
+
+ catNamespace = '|'.join(site.category_namespaces())
+ title = oldcat.titleWithoutNamespace()
+ if not title:
+ return
+ # title might contain regex special characters
+ title = re.escape(title)
+ # title might not be capitalized correctly on the wiki
+ if title[0].isalpha() and not site.nocapitalize:
+ title = "[%s%s]" % (title[0].upper(), title[0].lower()) + title[1:]
+ # spaces and underscores in page titles are interchangeable, and collapsible
+ title = title.replace(r"\ ", "[ _]+").replace(r"\_", "[ _]+")
+ categoryR = re.compile(r'\[\[\s*(%s)\s*:\s*%s\s*((?:\|[^]]+)?\]\])'
+ % (catNamespace, title), re.I)
+ if newcat is None:
+ text = replaceExcept(oldtext, categoryR, '',
+ ['nowiki', 'comment', 'math', 'pre', 'source'])
+ else:
+ text = replaceExcept(oldtext, categoryR,
+ '[[%s:%s\\2' % (site.namespace(14),
+ newcat.titleWithoutNamespace()),
+ ['nowiki', 'comment', 'math', 'pre', 'source'])
+ return text
+
+
+def replaceCategoryLinks(oldtext, new, site = None, addOnly = False):
+ """Replace the category links given in the wikitext given
+ in oldtext by the new links given in new.
+
+ 'new' should be a list of Category objects.
+
+ If addOnly is True, the old category won't be deleted and
+ the category(s) given will be added
+ (and so they won't replace anything).
+ """
+
+ # Find a marker that is not already in the text.
+ marker = '@@'
+ while marker in oldtext:
+ marker += '@'
+
+ if site is None:
+ site = pywikibot.getSite()
+ if site.sitename() == 'wikipedia:de' and "{{Personendaten" in oldtext:
+ raise Error('The PyWikipediaBot is no longer allowed to touch categories on the German Wikipedia on pages that contain the person data template because of the non-standard placement of that template. See http://de.wikipedia.org/wiki/Hilfe_Diskussion:Personendaten/Archiv/bis_2006…')
+
+ s = categoryFormat(new, insite = site)
+ if addOnly:
+ s2 = oldtext
+ else:
+ s2 = removeCategoryLinks(oldtext, site = site, marker = marker)
+
+ if s:
+ if site.language() in site.family.category_attop:
+ newtext = s + site.family.category_text_separator + s2
+ else:
+ # calculate what was after the categories links on the page
+ firstafter = s2.find(marker)
+ # Is there any text in the 'after' part that means we should keep it after?
+ if "</noinclude>" in s2[firstafter:]:
+ newtext = s2[:firstafter] + s + s2[firstafter:]
+ elif site.language() in site.family.categories_last:
+ newtext = s2.replace(marker,'').strip() + site.family.category_text_separator + s
+ else:
+ interwiki = getLanguageLinks(s2)
+ s2 = removeLanguageLinks(s2.replace(marker,''), site) + site.family.category_text_separator + s
+ newtext = replaceLanguageLinks(s2, interwiki, site)
+ newtext = newtext.replace(marker,'')
+ else:
+ s2 = s2.replace(marker,'')
+ return s2
+ return newtext.strip()
+
+
+def categoryFormat(categories, insite = None):
+ """Return a string containing links to all categories in a list.
+
+ 'categories' should be a list of Category objects.
+
+ The string is formatted for inclusion in insite.
+
+ """
+ if not categories:
+ return ''
+ if insite is None:
+ insite = pywikibot.getSite()
+ catLinks = [category.aslink(noInterwiki = True) for category in categories]
+ if insite.category_on_one_line():
+ sep = ' '
+ else:
+ sep = '\r\n'
+ # Some people don't like the categories sorted
+ #catLinks.sort()
+ return sep.join(catLinks) + '\r\n'
+
+
+def compileLinkR(withoutBracketed=False, onlyBracketed=False):
+ """Return a regex that matches external links."""
+ # RFC 2396 says that URLs may only contain certain characters.
+ # For this regex we also accept non-allowed characters, so that the bot
+ # will later show these links as broken ('Non-ASCII Characters in URL').
+ # Note: While allowing parenthesis inside URLs, MediaWiki will regard
+ # right parenthesis at the end of the URL as not part of that URL.
+ # The same applies to dot, comma, colon and some other characters.
+ notAtEnd = '\]\s\)\.:;,<>"'
+ # So characters inside the URL can be anything except whitespace,
+ # closing squared brackets, quotation marks, greater than and less
+ # than, and the last character also can't be parenthesis or another
+ # character disallowed by MediaWiki.
+ notInside = '\]\s<>"'
+ # The first half of this regular expression is required because '' is
+ # not allowed inside links. For example, in this wiki text:
+ # ''Please see http://www.example.org.''
+ # .'' shouldn't be considered as part of the link.
+ regex = r'(?P<url>http[s]?://[^' + notInside + ']*?[^' + notAtEnd + '](?=[' + notAtEnd+ ']*\'\')|http[s]?://[^' + notInside + ']*[^' + notAtEnd + '])'
+
+ if withoutBracketed:
+ regex = r'(?<!\[)' + regex
+ elif onlyBracketed:
+ regex = r'\[' + regex
+ linkR = re.compile(regex)
+ return linkR
+
Revision: 6090
Author: erwin85
Date: 2008-11-12 18:51:10 +0000 (Wed, 12 Nov 2008)
Log Message:
-----------
editTime and startTime aren't defined through _getAll if the page doesn't exist. If so, set them to the current time.
Modified Paths:
--------------
trunk/pywikipedia/wikipedia.py
Property Changed:
----------------
trunk/pywikipedia/commons_category_redirect.py
Property changes on: trunk/pywikipedia/commons_category_redirect.py
___________________________________________________________________
Added: svn:mergeinfo
+
Modified: trunk/pywikipedia/wikipedia.py
===================================================================
--- trunk/pywikipedia/wikipedia.py 2008-11-12 17:39:18 UTC (rev 6089)
+++ trunk/pywikipedia/wikipedia.py 2008-11-12 18:51:10 UTC (rev 6090)
@@ -461,7 +461,8 @@
self._permalink = None
self._userName = None
self._ipedit = None
- self._editTime = None
+ self._editTime = '0'
+ self._startTime = '0'
# For the Flagged Revisions MediaWiki extension
self._revisionId = None
self._deletedRevs = None
@@ -1416,8 +1417,14 @@
# <s>Except if the page is new, we need to supply the time of the
# previous version to the wiki to prevent edit collisions</s>
# As of Oct 2008, these must be filled also for new pages
- predata['wpEdittime'] = self._editTime
- predata['wpStarttime'] = self._startTime
+ if self._editTime:
+ predata['wpEdittime'] = self._editTime
+ else:
+ predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ if self._startTime:
+ predata['wpStarttime'] = self._startTime
+ else:
+ predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
if self._revisionId:
predata['baseRevId'] = self._revisionId
# Pass the minorEdit and watchArticle arguments to the Wiki.
@@ -1527,8 +1534,14 @@
# without any reason!
# raise EditConflict(u'Someone deleted the page.')
# No raise, simply define these variables and retry:
- predata['wpEdittime'] = self._editTime
- predata['wpStarttime'] = self._startTime
+ if self._editTime:
+ predata['wpEdittime'] = self._editTime
+ else:
+ predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
+ if self._startTime:
+ predata['wpStarttime'] = self._startTime
+ else:
+ predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime())
continue
if self.site().has_mediawiki_message("viewsource")\
and self.site().mediawiki_message('viewsource') in data:
Bugs item #2269688, was opened at 2008-11-12 15:03
Message generated for change (Settings changed) made by nicdumz
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Pending
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Yann Forget (yannforget)
Assigned to: Nobody/Anonymous (nobody)
Summary: Unicode error with djvutext.py
Initial Comment:
on fr.wikisource:
python djvutext.py -index:Livre:Le_Th%C3%A9%C3%A2tre_de_la_R%C3%A9volution._Le_Quatorze_Juillet._Danton._Les_Loups.djvu -djvu:Le_quatorze_juillet_Danton_Les_loups.djvu -pages:375
Checked for running processes. 1 processes currently running, including the current process.
Traceback (most recent call last):
File "djvutext.py", line 249, in <module>
main()
File "djvutext.py", line 236, in main
wikipedia.output("uploading text from %s to %s" % (djvu, index_page) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)
python version.py
Pywikipedia [http] trunk/pywikipedia (r6084, Nov 11 2008, 21:51:31)
Python 2.5.2 (r252, Sep 13 2008, 22:55:01)
[GCC 4.1.2 (Gentoo 4.1.2 p1.1)]
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-11-12 18:41
Message:
Au loup !!
This display error should be fixed by r6089. Please re-comment on this bug
if it is not the case :)
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_…
Bugs item #2269688, was opened at 2008-11-12 15:03
Message generated for change (Comment added) made by nicdumz
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Yann Forget (yannforget)
Assigned to: Nobody/Anonymous (nobody)
Summary: Unicode error with djvutext.py
Initial Comment:
on fr.wikisource:
python djvutext.py -index:Livre:Le_Th%C3%A9%C3%A2tre_de_la_R%C3%A9volution._Le_Quatorze_Juillet._Danton._Les_Loups.djvu -djvu:Le_quatorze_juillet_Danton_Les_loups.djvu -pages:375
Checked for running processes. 1 processes currently running, including the current process.
Traceback (most recent call last):
File "djvutext.py", line 249, in <module>
main()
File "djvutext.py", line 236, in main
wikipedia.output("uploading text from %s to %s" % (djvu, index_page) )
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)
python version.py
Pywikipedia [http] trunk/pywikipedia (r6084, Nov 11 2008, 21:51:31)
Python 2.5.2 (r252, Sep 13 2008, 22:55:01)
[GCC 4.1.2 (Gentoo 4.1.2 p1.1)]
----------------------------------------------------------------------
Comment By: NicDumZ — Nicolas Dumazet (nicdumz)
Date: 2008-11-12 18:41
Message:
Au loup !!
This display error should be fixed by r6089. Please re-comment on this bug
if it is not the case :)
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_…