pywikibot November 2008

pywikibot@lists.wikimedia.org

20 participants
178 discussions

[Pywikipedia-l] [ pywikipediabot-Feature Requests-1993062 ] Use API module 'parse' for retrieving interwiki links
by SourceForge.net 13 Nov '08

13 Nov '08

Feature Requests item #1993062, was opened at 2008-06-13 16:47 Message generated for change (Comment added) made by multichill You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603141&aid=1993062&group_… Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: interwiki Group: None Status: Open Priority: 7 Private: No Submitted By: Melancholie (melancholie) Assigned to: Nobody/Anonymous (nobody) Summary: Use API module 'parse' for retrieving interwiki links Initial Comment: Currently pages are retrieved in a batch by using Special:Export. Although being fast (as only one request is done), there is a huge data overhead with this method! Why not use the API with its 'parse' module? Only interwiki links can be fetched with that, reducing traffic (overhead) a lot! See: http://de.wikipedia.org/w/api.php?action=parse&format=xml&page=Test&prop=la… Outputs could be downloaded in parallel to virtualize a batch (faster). ---- At least make this method optional (config.py) for being able of reducing data traffic, if wanted. API is just more efficient. ---------------------------------------------------------------------- >Comment By: Multichill (multichill) Date: 2008-11-13 12:46 Message: We are working on a rewrite. The rewrite uses the api as much as possible. ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-06-15 01:27 Message: Logged In: YES user_id=2089773 Originator: YES See http://meta.wikimedia.org/wiki/Interwiki_bot_access_protocol concerning disambiguations and redirects: http://de.wikipedia.org/w/api.php?action=parse&format=xml&text={{:Main_Page… ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-06-14 16:38 Message: Logged In: YES user_id=2089773 Originator: YES Backwards compatibility? That's no reason for not making software more efficient, where possible ;-) That's also why I wrote something about "optional", too. Because for current MediaWiki wikis there is a much more efficient way of retrieving (only) certain contents (langlinks, categories), there should be a method of using that advantage! Will reduce load (bot owner's and server's)... ---------------------------------------------------------------------- Comment By: Bryan (btongminh) Date: 2008-06-13 20:44 Message: Logged In: YES user_id=1806226 Originator: NO Backwards compatibility with non Wikimedia wikis? ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-06-13 17:20 Message: Logged In: YES user_id=2089773 Originator: YES For not being misusable of confusing bots, the yet to be set up MediaWiki message could contain [[foreigncode:{{CURRENTTIMESTAMP}}]] (cache issue?) (sorry for spamming with this request ;-) ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-06-13 17:08 Message: Logged In: YES user_id=2089773 Originator: YES Important note for getting pages' interwikis in a batch: http://de.wikipedia.org/w/api.php?action=parse&text={{:Test}}{{:Bot}}{{:Hau… Either the bot could figure out what interwikis belong together then, or maybe a marker could placed in between: http://de.wikipedia.org/w/api.php?action=parse&text={{:Test}}{{MediaWiki:Iw… [[MediaWiki:Iwmarker]] (or 'Llmarker'?) would have to be set up by the MediaWiki developers with [[en:/de:Abuse-save-mark]] as content (but this is potentially misusable). ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-06-13 16:51 Message: Logged In: YES user_id=2089773 Originator: YES Note: Maybe combine it with 'generator'. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603141&aid=1993062&group_…

1 0

[Pywikipedia-l] [ pywikipediabot-Feature Requests-2255146 ] Page moves should not suppress redirects by default
by SourceForge.net 13 Nov '08

13 Nov '08

Feature Requests item #2255146, was opened at 2008-11-10 12:44 Message generated for change (Comment added) made by multichill You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2255146&group_… Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Closed Priority: 5 Private: No Submitted By: Nobody/Anonymous (nobody) >Assigned to: Multichill (multichill) Summary: Page moves should not suppress redirects by default Initial Comment: When moving pages with Page.move(), a redirect should be created by default. Users with the 'suppressredirect' right (e.g. bots on Wikimedia projects) should be able to suppress redirects. I propose adding a parameter leaveRedirect=True (or possibly suppresRedirect=False) to the function. ---------------------------------------------------------------------- >Comment By: Multichill (multichill) Date: 2008-11-13 12:44 Message: In svn revision 6084 i implemented leaveRedirect. By default this is true. def move(self, newtitle, reason=None, movetalkpage=True, sysop=False, - throttle=True, deleteAndMove=False, safe=True, fixredirects=True): + throttle=True, deleteAndMove=False, safe=True, fixredirects=True, leaveRedirect=True): """Move this page to new title given by newtitle. If safe, don't try to move and delete if not directly requested. @@ -2226,6 +2226,10 @@ predata['wpFixRedirects'] = '1' else: predata['wpFixRedirects'] = '0' + if leaveRedirect: + predata['wpLeaveRedirect'] = '1' + else: + predata['wpLeaveRedirect'] = '0' if token: ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2255146&group_…

1 0

[Pywikipedia-l] [ pywikipediabot-Feature Requests-2269013 ] Add options -cat and -file to fixing_redirects.py
by SourceForge.net 13 Nov '08

13 Nov '08

Feature Requests item #2269013, was opened at 2008-11-12 12:42 Message generated for change (Comment added) made by multichill You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2269013&group_… Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Closed Priority: 5 Private: No Submitted By: Yann Forget (yannforget) Assigned to: Nobody/Anonymous (nobody) Summary: Add options -cat and -file to fixing_redirects.py Initial Comment: Hello, Please add the options -cat (changing all pages in a category) and -file (changing all pages listed in a file) to fixing_redirects.py Thanks, Yann ---------------------------------------------------------------------- >Comment By: Multichill (multichill) Date: 2008-11-13 12:39 Message: As discussed on irc: This is already implemented. fixing_redirects.py uses the pagegenerators so all pagegenerators (including -cat and -file) are available. ---------------------------------------------------------------------- Comment By: Yann Forget (yannforget) Date: 2008-11-12 12:44 Message: and also to movepages.py ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603141&aid=2269013&group_…

1 0

[Pywikipedia-l] [ pywikipediabot-Bugs-2011362 ] Update featured.py (patch)
by SourceForge.net 13 Nov '08

13 Nov '08

Bugs item #2011362, was opened at 2008-07-05 20:07 Message generated for change (Comment added) made by melancholie You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2011362&group_… Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Melancholie (melancholie) Assigned to: Nobody/Anonymous (nobody) Summary: Update featured.py (patch) Initial Comment: Both, hiwiki and yiwiki do use Template:Link_FA, the pages hi:Template:Lien AdQ and yi:Template:רא are only redirects. By re-adding 'Link FA' to the langs arrays, Template:Link_FA will be used by the bot (otherewise those redirects). ---------------------------------------------------------------------- >Comment By: Melancholie (melancholie) Date: 2008-11-13 12:02 Message: update File Added: featured.diff ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-09-07 09:22 Message: Logged In: YES user_id=2089773 Originator: YES syncing File Added: featured.diff ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-08-28 09:20 Message: Logged In: YES user_id=2089773 Originator: YES Added 'cy', see http://als.wikipedia.org/wiki/Benutzer_Diskussion:Melancholie#via_roboto_.2… File Added: featured.diff ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-08-18 23:45 Message: Logged In: YES user_id=2089773 Originator: YES File Added: featured.diff ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-08-18 23:44 Message: Logged In: YES user_id=2089773 Originator: YES Added szl category File Added: featured.diff ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-08-05 13:50 Message: Logged In: YES user_id=2089773 Originator: YES added patch (working current) File Added: featured.diff ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-07-11 07:51 Message: Logged In: YES user_id=2089773 Originator: YES Maybe I was a little bit too imprecise ;-) Here is what I mean: templatelist = template['_default'] try: templatelist += template[tosite.lang] + (u" {{%s|%s}}" % (templatelist[0], fromsite.lang)) templatelist[0] is 'Link FA', but the localized templates used are templatelist[1]. If there is a localized template (with Link_FA being only a redirect) it should be used for edit. So it might be good to change the order (make 'try:' first, add _default with +=) For hi, yi the easiest way might be to just comment out the localized template names, as they are only redirects to Link_FA. ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-07-10 11:59 Message: Logged In: YES user_id=2089773 Originator: YES So {{Link FA}} is always the first choice? Or does _default actually follow redirects? ---------------------------------------------------------------------- Comment By: NicDumZ — Nicolas Dumazet (nicdumz) Date: 2008-07-10 10:41 Message: Logged In: YES user_id=1963242 Originator: NO I'm not sure melancholie :) I added {{Leam VdC}} in r5707, however, since r5669 ( https://fisheye.toolserver.org/browse/pywikipedia/trunk/pywikipedia/feature… ) it uses the '_default' entry (Link FA) AND the locale entry. The list for hi: is for example ['Link Fa', 'Lien AdQ'] ---------------------------------------------------------------------- Comment By: Melancholie (melancholie) Date: 2008-07-10 10:28 Message: Logged In: YES user_id=2089773 Originator: YES Furthermore, add http://fur.wikipedia.org/wiki/Model:Leam_VdC [I wish I were able to do this things myself, but *still* no account yet] ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2011362&group_…

1 0

[Pywikipedia-l] SVN: [6093] branches/rewrite/pywikibot/data/api.py
by russblau＠svn.wikimedia.org 12 Nov '08

12 Nov '08

Revision: 6093 Author: russblau Date: 2008-11-12 20:32:26 +0000 (Wed, 12 Nov 2008) Log Message: ----------- one more generator bug Modified Paths: -------------- branches/rewrite/pywikibot/data/api.py Modified: branches/rewrite/pywikibot/data/api.py =================================================================== --- branches/rewrite/pywikibot/data/api.py 2008-11-12 19:21:49 UTC (rev 6092) +++ branches/rewrite/pywikibot/data/api.py 2008-11-12 20:32:26 UTC (rev 6093) @@ -317,10 +317,10 @@ self.request = Request(**kwargs) self.limit = None if "generator" in kwargs: - self.resultkey = "pages" # name of the "query" - else: # subelement key - self.resultkey = self.module # to look for when iterating - self.continuekey = self.resultkey # usually the query-continue key + self.resultkey = "pages" # name of the "query" subelement key + else: # to look for when iterating + self.resultkey = self.module + self.continuekey = self.module # usually the query-continue key # is the same as the querymodule, # but not always

1 0

[Pywikipedia-l] SVN: [6092] branches/rewrite/pywikibot
by russblau＠svn.wikimedia.org 12 Nov '08

12 Nov '08

Revision: 6092 Author: russblau Date: 2008-11-12 19:21:49 +0000 (Wed, 12 Nov 2008) Log Message: ----------- bugfixes in loading revision text and isEmpty() Modified Paths: -------------- branches/rewrite/pywikibot/__init__.py branches/rewrite/pywikibot/page.py branches/rewrite/pywikibot/site.py Modified: branches/rewrite/pywikibot/__init__.py =================================================================== --- branches/rewrite/pywikibot/__init__.py 2008-11-12 19:21:18 UTC (rev 6091) +++ branches/rewrite/pywikibot/__init__.py 2008-11-12 19:21:49 UTC (rev 6092) @@ -14,6 +14,7 @@ from exceptions import * import config2 +import textlib def deprecate_arg(old_arg, new_arg): Modified: branches/rewrite/pywikibot/page.py =================================================================== --- branches/rewrite/pywikibot/page.py 2008-11-12 19:21:18 UTC (rev 6091) +++ branches/rewrite/pywikibot/page.py 2008-11-12 19:21:49 UTC (rev 6092) @@ -279,8 +279,10 @@ """Return True if title of this Page is in the autoFormat dictionary.""" return self.autoFormat()[0] is not None - def get(self, force=False, get_redirect=False, throttle=None, - sysop=False, nofollow_redirects=None, change_edit_time=None): + @deprecate_arg("throttle", None) + @deprecate_arg("nofollow_redirects", None) + @deprecate_arg("change_edit_time", None) + def get(self, force=False, get_redirect=False, sysop=False): """Return the wiki-text of the page. This will retrieve the page from the server if it has not been @@ -298,17 +300,8 @@ redirect, do not raise an exception. @param sysop: if the user has a sysop account, use it to retrieve this page - @param throttle: DEPRECATED and unused - @param nofollow_redirects: DEPRECATED and unused - @param change_edit_time: DEPRECATED and unused """ - if throttle is not None: - logger.debug("Page.get(throttle) option is deprecated.") - if nofollow_redirects is not None: - logger.debug("Page.get(nofollow_redirects) option is deprecated.") - if change_edit_time is not None: - logger.debug("Page.get(change_edit_time) option is deprecated.") if force: # When forcing, we retry the page no matter what. Old exceptions # do not apply any more. @@ -322,32 +315,27 @@ elif hasattr(self, '_getexception'): raise self._getexception if force or not hasattr(self, "_revid") \ - or not self._revid in self._revisions: + or not self._revid in self._revisions \ + or self._revisions[self._revid].text is None: self.site().loadrevisions(self, getText=True, sysop=sysop) # TODO: Exception handling for no-page, redirects, etc. return self._revisions[self._revid].text + @deprecate_arg("throttle", None) + @deprecate_arg("nofollow_redirects", None) + @deprecate_arg("change_edit_time", None) def getOldVersion(self, oldid, force=False, get_redirect=False, - throttle=None, sysop=False, nofollow_redirects=None, - change_edit_time=None): + sysop=False): """Return text of an old revision of this page; same options as get(). @param oldid: The revid of the revision desired. """ - if throttle is not None: - logger.debug( - "Page.getOldVersion(throttle) option is deprecated.") - if nofollow_redirects is not None: - logger.debug( - "Page.getOldVersion(nofollow_redirects) option is deprecated.") - if change_edit_time is not None: - logger.debug( - "Page.getOldVersion(change_edit_time) option is deprecated.") - if force or not oldid in self._revisions: - self.site().loadrevisions(self, getText=True, ids=oldid, - sysop=sysop) + if force or not oldid in self._revisions \ + or self._revisions[oldid].text is None: + self.site().loadrevisions(self, getText=True, revids=oldid, + sysop=sysop) # TODO: what about redirects, errors? return self._revisions[oldid].text @@ -368,7 +356,7 @@ def _textgetter(self): """Return the current (edited) wikitext, loading it if necessary.""" - if not hasattr(self, '_text'): + if not hasattr(self, '_text') or self._text is None: try: self._text = self.get() except pywikibot.NoPage: @@ -427,8 +415,8 @@ """ txt = self.get() - txt = pywikibot.removeLanguageLinks(txt, site = self.site()) - txt = pywikibot.removeCategoryLinks(txt, site = self.site()) + txt = pywikibot.textlib.removeLanguageLinks(txt, site = self.site()) + txt = pywikibot.textlib.removeCategoryLinks(txt, site = self.site()) if len(txt) < 4: return True else: @@ -443,7 +431,9 @@ """Return other member of the article-talk page pair for this Page. If self is a talk page, returns the associated content page; - otherwise, returns the associated talk page. + otherwise, returns the associated talk page. The returned page need + not actually exist on the wiki. + Returns None if self is a special page. """ @@ -824,12 +814,13 @@ limit=limit) if getAll: revCount = len(self._revisions) - return [(self._revisions[rev].id, - self._revisions[rev].timestamp, - self._revisions[rev].user, - self._revisions[rev].comment) - for rev in sorted(self._revisions.keys(), - reverse=not reverseOrder)[ : revCount] + return [ ( self._revisions[rev].revid, + self._revisions[rev].timestamp, + self._revisions[rev].user, + self._revisions[rev].comment + ) for rev in sorted(self._revisions.keys(), + reverse=not reverseOrder)[ : revCount] + ] def getVersionHistoryTable(self, forceReload=False, reverseOrder=False, getAll=False, revCount=500): Modified: branches/rewrite/pywikibot/site.py =================================================================== --- branches/rewrite/pywikibot/site.py 2008-11-12 19:21:18 UTC (rev 6091) +++ branches/rewrite/pywikibot/site.py 2008-11-12 19:21:49 UTC (rev 6092) @@ -1293,7 +1293,10 @@ rvgen = api.PropertyGenerator(u"info|revisions", titles=rvtitle, site=self) else: - ids = u"|".join(unicode(r) for r in revids) + if isinstance(revids, (int, basestring)): + ids = unicode(revids) + else: + ids = u"|".join(unicode(r) for r in revids) rvgen = api.PropertyGenerator(u"info|revisions", revids=ids, site=self) if getText: @@ -1301,7 +1304,7 @@ u"ids|flags|timestamp|user|comment|content" if section is not None: rvgen.request[u"rvsection"] = unicode(section) - if latest: + if latest or "revids" in rvgen.request: rvgen.limit = -1 # suppress use of rvlimit parameter elif isinstance(limit, int): rvgen.limit = limit

1 0

[Pywikipedia-l] SVN: [6091] branches/rewrite/pywikibot/textlib.py
by russblau＠svn.wikimedia.org 12 Nov '08

12 Nov '08

Revision: 6091 Author: russblau Date: 2008-11-12 19:21:18 +0000 (Wed, 12 Nov 2008) Log Message: ----------- textlib.py contains functions for altering wiki-text (all functions copied-and-pasted from wikipedia.py) Added Paths: ----------- branches/rewrite/pywikibot/textlib.py Added: branches/rewrite/pywikibot/textlib.py =================================================================== --- branches/rewrite/pywikibot/textlib.py (rev 0) +++ branches/rewrite/pywikibot/textlib.py 2008-11-12 19:21:18 UTC (rev 6091) @@ -0,0 +1,566 @@ +# -*- coding: utf-8 -*- +""" +Functions for manipulating wiki-text. + +Unless otherwise noted, all functions take a unicode string as the argument +and return a unicode string. + +""" +# +# (C) Pywikipedia bot team, 2008 +# +# Distributed under the terms of the MIT license. +# +__version__ = '$Id: $' + + +import pywikibot +import re + + +def unescape(s): + """Replace escaped HTML-special characters by their originals""" + if '&' not in s: + return s + s = s.replace("<", "<") + s = s.replace(">", ">") + s = s.replace("'", "'") + s = s.replace(""", '"') + s = s.replace("&", "&") # Must be last + return s + + +def replaceExcept(text, old, new, exceptions, caseInsensitive=False, + allowoverlap=False, marker = '', site = None): + """ + Return text with 'old' replaced by 'new', ignoring specified types of text. + + Skips occurences of 'old' within exceptions; e.g., within nowiki tags or + HTML comments. If caseInsensitive is true, then use case insensitive + regex matching. If allowoverlap is true, overlapping occurences are all + replaced (watch out when using this, it might lead to infinite loops!). + + Parameters: + text - a unicode string + old - a compiled regular expression + new - a unicode string (which can contain regular + expression references), or a function which takes + a match object as parameter. See parameter repl of + re.sub(). + exceptions - a list of strings which signal what to leave out, + e.g. ['math', 'table', 'template'] + caseInsensitive - a boolean + marker - a string that will be added to the last replacement; + if nothing is changed, it is added at the end + + """ + if site is None: + site = pywikibot.getSite() + + exceptionRegexes = { + 'comment': re.compile(r'(?s)'), + # section headers + 'header': re.compile(r'\r\n=+.+=+ *\r\n'), + 'includeonly': re.compile(r'(?is)<includeonly>.*?</includeonly>'), + 'math': re.compile(r'(?is)<math>.*?</math>'), + 'noinclude': re.compile(r'(?is)<noinclude>.*?</noinclude>'), + # wiki tags are ignored inside nowiki tags. + 'nowiki': re.compile(r'(?is)<nowiki>.*?</nowiki>'), + # preformatted text + 'pre': re.compile(r'(?ism)<pre>.*?</pre>'), + 'source': re.compile(r'(?is)<source .*?</source>'), + # inline references + 'ref': re.compile(r'(?ism)<ref[ >].*?</ref>'), + 'timeline': re.compile(r'(?is)<timeline>.*?</timeline>'), + # lines that start with a space are shown in a monospace font and + # have whitespace preserved. + 'startspace': re.compile(r'(?m)^ (.*?)$'), + # tables often have whitespace that is used to improve wiki + # source code readability. + # TODO: handle nested tables. + 'table': re.compile(r'(?ims)^{\|.*?^\|}|<table>.*?</table>'), + # templates with parameters often have whitespace that is used to + # improve wiki source code readability. + # 'template': re.compile(r'(?s){{.*?}}'), + # The regex above fails on nested templates. This regex can handle + # templates cascaded up to level 3, but no deeper. For arbitrary + # depth, we'd need recursion which can't be done in Python's re. + # After all, the language of correct parenthesis words is not regular. + 'template': re.compile(r'(?s){{(({{(({{.*?}})|.)*}})|.)*}}'), + 'hyperlink': compileLinkR(), + 'gallery': re.compile(r'(?is)<gallery.*?>.*?</gallery>'), + # this matches internal wikilinks, but also interwiki, categories, and + # images. + 'link': re.compile(r'\[\[[^\]\|]*(\|[^\]]*)?\]\]'), + 'interwiki': re.compile(r'(?i)\[\[(%s)\s?:[^\]]*\]\][\s]*' + % '|'.join(site.validLanguageLinks() + site.family.obsolete.keys())), + + } + + # if we got a string, compile it as a regular expression + if type(old) is str or type(old) is unicode: + if caseInsensitive: + old = re.compile(old, re.IGNORECASE | re.UNICODE) + else: + old = re.compile(old) + + dontTouchRegexes = [] + for exc in exceptions: + if isinstance(exc, str) or isinstance(exc, unicode): + # assume it's a reference to the exceptionRegexes dictionary + # defined above. + if not exceptionRegexes.has_key(exc): + raise ValueError("Unknown tag type: " + exc) + dontTouchRegexes.append(exceptionRegexes[exc]) + else: + # assume it's a regular expression + dontTouchRegexes.append(exc) + index = 0 + markerpos = len(text) + while True: + match = old.search(text, index) + if not match: + # nothing left to replace + break + + # check which exception will occur next. + nextExceptionMatch = None + for dontTouchR in dontTouchRegexes: + excMatch = dontTouchR.search(text, index) + if excMatch and ( + nextExceptionMatch is None or + excMatch.start() < nextExceptionMatch.start()): + nextExceptionMatch = excMatch + + if nextExceptionMatch is not None and nextExceptionMatch.start() <= match.start(): + # an HTML comment or text in nowiki tags stands before the next valid match. Skip. + index = nextExceptionMatch.end() + else: + # We found a valid match. Replace it. + if callable(new): + # the parameter new can be a function which takes the match as a parameter. + replacement = new(match) + else: + # it is not a function, but a string. + + # it is a little hack to make \n work. It would be better to fix it + # previously, but better than nothing. + new = new.replace('\\n', '\n') + + # We cannot just insert the new string, as it may contain regex + # group references such as \2 or \g<name>. + # On the other hand, this approach does not work because it can't + # handle lookahead or lookbehind (see bug #1731008): + #replacement = old.sub(new, text[match.start():match.end()]) + #text = text[:match.start()] + replacement + text[match.end():] + + # So we have to process the group references manually. + replacement = new + + groupR = re.compile(r'\\(?P<number>\d+)|\\g<(?P<name>.+?)>') + while True: + groupMatch = groupR.search(replacement) + if not groupMatch: + break + groupID = groupMatch.group('name') or int(groupMatch.group('number')) + replacement = replacement[:groupMatch.start()] + match.group(groupID) + replacement[groupMatch.end():] + text = text[:match.start()] + replacement + text[match.end():] + + # continue the search on the remaining text + if allowoverlap: + index = match.start() + 1 + else: + index = match.start() + len(replacement) + markerpos = match.start() + len(replacement) + text = text[:markerpos] + marker + text[markerpos:] + return text + + +def removeDisabledParts(text, tags = ['*']): + """ + Return text without portions where wiki markup is disabled + + Parts that can/will be removed are -- + * HTML comments + * nowiki tags + * pre tags + * includeonly tags + + The exact set of parts which should be removed can be passed as the + 'parts' parameter, which defaults to all. + """ + regexes = { + 'comments' : r'', + 'includeonly': r'<includeonly>.*?</includeonly>', + 'nowiki': r'<nowiki>.*?</nowiki>', + 'pre': r'<pre>.*?</pre>', + 'source': r'<source .*?</source>', + } + if '*' in tags: + tags = regexes.keys() + toRemoveR = re.compile('|'.join([regexes[tag] for tag in tags]), + re.IGNORECASE | re.DOTALL) + return toRemoveR.sub('', text) + + +def isDisabled(text, index, tags = ['*']): + """ + Return True if text[index] is disabled, e.g. by a comment or by nowiki tags. + + For the tags parameter, see removeDisabledParts() above. + """ + # Find a marker that is not already in the text. + marker = '@@' + while marker in text: + marker += '@' + text = text[:index] + marker + text[index:] + text = removeDisabledParts(text, tags) + return (marker not in text) + + +# Functions dealing with interwiki language links + +# Note - MediaWiki supports two kinds of interwiki links; interlanguage and +# interproject. These functions only deal with links to a +# corresponding page in another language on the same project (e.g., +# Wikipedia, Wiktionary, etc.) in another language. They do not find +# or change links to a different project, or any that are formatted +# as in-line interwiki links (e.g., "[[:es:Articulo]]". (CONFIRM) + +def getLanguageLinks(text, insite = None, pageLink = "[[]]"): + """ + Return a dict of interlanguage links found in text. + + Dict uses language codes as keys and Page objects as values. + Do not call this routine directly, use Page.interwiki() method + instead. + + """ + if insite == None: + insite = pywikibot.getSite() + result = {} + # Ignore interwiki links within nowiki tags, includeonly tags, pre tags, + # and HTML comments + text = removeDisabledParts(text) + + # This regular expression will find every link that is possibly an + # interwiki link. + # NOTE: language codes are case-insensitive and only consist of basic latin + # letters and hyphens. + interwikiR = re.compile(r'\[\[([a-zA-Z\-]+)\s?:([^\[\]\n]*)\]\]') + for lang, pagetitle in interwikiR.findall(text): + lang = lang.lower() + # Check if it really is in fact an interwiki link to a known + # language, or if it's e.g. a category tag or an internal link + if lang in insite.family.obsolete: + lang = insite.family.obsolete[lang] + if lang in insite.validLanguageLinks(): + if '|' in pagetitle: + # ignore text after the pipe + pagetitle = pagetitle[:pagetitle.index('|')] + # we want the actual page objects rather than the titles + site = insite.getSite(code = lang) + try: + result[site] = pywikibot.Page(site, pagetitle, insite = insite) + except InvalidTitle: + output( + u"[getLanguageLinks] Text contains invalid interwiki link [[%s:%s]]." + % (lang, pagetitle)) + continue + return result + + +def removeLanguageLinks(text, site = None, marker = ''): + """Return text with all interlanguage links removed. + + If a link to an unknown language is encountered, a warning is printed. + If a marker is defined, that string is placed at the location of the + last occurence of an interwiki link (at the end if there are no + interwiki links). + + """ + if site == None: + site = pywikibot.getSite() + if not site.validLanguageLinks(): + return text + # This regular expression will find every interwiki link, plus trailing + # whitespace. + languages = '|'.join(site.validLanguageLinks() + site.family.obsolete.keys()) + interwikiR = re.compile(r'\[\[(%s)\s?:[^\]]*\]\][\s]*' + % languages, re.IGNORECASE) + text = replaceExcept(text, interwikiR, '', + ['nowiki', 'comment', 'math', 'pre', 'source'], marker=marker) + return text.strip() + + +def replaceLanguageLinks(oldtext, new, site = None): + """Replace interlanguage links in the text with a new set of links. + + 'new' should be a dict with the Site objects as keys, and Page objects + as values (i.e., just like the dict returned by getLanguageLinks + function). + + """ + # Find a marker that is not already in the text. + marker = '@@' + while marker in oldtext: + marker += '@' + if site == None: + site = pywikibot.getSite() + s = interwikiFormat(new, insite = site) + s2 = removeLanguageLinks(oldtext, site = site, marker = marker) + if s: + if site.language() in site.family.interwiki_attop: + newtext = s + site.family.interwiki_text_separator + s2.replace(marker,'').strip() + else: + # calculate what was after the language links on the page + firstafter = s2.find(marker) + len(marker) + # Is there any text in the 'after' part that means we should keep it after? + if "</noinclude>" in s2[firstafter:]: + newtext = s2[:firstafter] + s + s2[firstafter:] + elif site.language() in site.family.categories_last: + cats = getCategoryLinks(s2, site = site) + s2 = removeCategoryLinks(s2.replace(marker,'').strip(), site) + site.family.interwiki_text_separator + s + newtext = replaceCategoryLinks(s2, cats, site=site) + else: + newtext = s2.replace(marker,'').strip() + site.family.interwiki_text_separator + s + newtext = newtext.replace(marker,'') + else: + newtext = s2.replace(marker,'') + return newtext + + +def interwikiFormat(links, insite = None): + """Convert interwiki link dict into a wikitext string. + + 'links' should be a dict with the Site objects as keys, and Page + objects as values. + + Return a unicode string that is formatted for inclusion in insite + (defaulting to the current site). + """ + if insite is None: + insite = pywikibot.getSite() + if not links: + return '' + + ar = interwikiSort(links.keys(), insite) + s = [] + for site in ar: + try: + link = links[site].aslink(forceInterwiki=True) + s.append(link) + except AttributeError: + s.append(pywikibot.getSite(site).linkto(links[site], + othersite=insite)) + if insite.lang in insite.family.interwiki_on_one_line: + sep = u' ' + else: + sep = u'\r\n' + s=sep.join(s) + u'\r\n' + return s + + +# Sort sites according to local interwiki sort logic +def interwikiSort(sites, insite = None): + if insite is None: + insite = pywikibot.getSite() + if not sites: + return [] + + sites.sort() + putfirst = insite.interwiki_putfirst() + if putfirst: + #In this case I might have to change the order + firstsites = [] + for code in putfirst: + # The code may not exist in this family? + if code in insite.family.obsolete: + code = insite.family.obsolete[code] + if code in insite.validLanguageLinks(): + site = insite.getSite(code = code) + if site in sites: + del sites[sites.index(site)] + firstsites = firstsites + [site] + sites = firstsites + sites + if insite.interwiki_putfirst_doubled(sites): #some implementations return False + sites = insite.interwiki_putfirst_doubled(sites) + sites + return sites + + +# Functions dealing with category links + +def getCategoryLinks(text, site): + """Return a list of category links found in text. + + List contains Category objects. + Do not call this routine directly, use Page.categories() instead. + + """ + result = [] + # Ignore category links within nowiki tags, pre tags, includeonly tags, + # and HTML comments + text = removeDisabledParts(text) + catNamespace = '|'.join(site.category_namespaces()) + R = re.compile(r'\[\[\s*(?P<namespace>%s)\s*:\s*(?P<catName>.+?)' + r'(?:\|(?P<sortKey>.+?))?\s*\]\]' + % catNamespace, re.I) + for match in R.finditer(text): + cat = pywikibot.Category(site, + '%s:%s' % (match.group('namespace'), + match.group('catName')), + sortKey = match.group('sortKey')) + result.append(cat) + return result + + +def removeCategoryLinks(text, site, marker = ''): + """Return text with all category links removed. + + Put the string marker after the last replacement (at the end of the text + if there is no replacement). + + """ + # This regular expression will find every link that is possibly an + # interwiki link, plus trailing whitespace. The language code is grouped. + # NOTE: This assumes that language codes only consist of non-capital + # ASCII letters and hyphens. + catNamespace = '|'.join(site.category_namespaces()) + categoryR = re.compile(r'\[\[\s*(%s)\s*:.*?\]\]\s*' % catNamespace, re.I) + text = replaceExcept(text, categoryR, '', ['nowiki', 'comment', 'math', 'pre', 'source'], marker = marker) + if marker: + #avoid having multiple linefeeds at the end of the text + text = re.sub('\s*%s' % re.escape(marker), '\r\n' + marker, text.strip()) + return text.strip() + + +def replaceCategoryInPlace(oldtext, oldcat, newcat, site=None): + """Replace the category oldcat with the category newcat and return + the modified text. + + """ + if site is None: + site = pywikibot.getSite() + + catNamespace = '|'.join(site.category_namespaces()) + title = oldcat.titleWithoutNamespace() + if not title: + return + # title might contain regex special characters + title = re.escape(title) + # title might not be capitalized correctly on the wiki + if title[0].isalpha() and not site.nocapitalize: + title = "[%s%s]" % (title[0].upper(), title[0].lower()) + title[1:] + # spaces and underscores in page titles are interchangeable, and collapsible + title = title.replace(r"\ ", "[ _]+").replace(r"\_", "[ _]+") + categoryR = re.compile(r'\[\[\s*(%s)\s*:\s*%s\s*((?:\|[^]]+)?\]\])' + % (catNamespace, title), re.I) + if newcat is None: + text = replaceExcept(oldtext, categoryR, '', + ['nowiki', 'comment', 'math', 'pre', 'source']) + else: + text = replaceExcept(oldtext, categoryR, + '[[%s:%s\\2' % (site.namespace(14), + newcat.titleWithoutNamespace()), + ['nowiki', 'comment', 'math', 'pre', 'source']) + return text + + +def replaceCategoryLinks(oldtext, new, site = None, addOnly = False): + """Replace the category links given in the wikitext given + in oldtext by the new links given in new. + + 'new' should be a list of Category objects. + + If addOnly is True, the old category won't be deleted and + the category(s) given will be added + (and so they won't replace anything). + """ + + # Find a marker that is not already in the text. + marker = '@@' + while marker in oldtext: + marker += '@' + + if site is None: + site = pywikibot.getSite() + if site.sitename() == 'wikipedia:de' and "{{Personendaten" in oldtext: + raise Error('The PyWikipediaBot is no longer allowed to touch categories on the German Wikipedia on pages that contain the person data template because of the non-standard placement of that template. See http://de.wikipedia.org/wiki/Hilfe_Diskussion:Personendaten/Archiv/bis_2006…') + + s = categoryFormat(new, insite = site) + if addOnly: + s2 = oldtext + else: + s2 = removeCategoryLinks(oldtext, site = site, marker = marker) + + if s: + if site.language() in site.family.category_attop: + newtext = s + site.family.category_text_separator + s2 + else: + # calculate what was after the categories links on the page + firstafter = s2.find(marker) + # Is there any text in the 'after' part that means we should keep it after? + if "</noinclude>" in s2[firstafter:]: + newtext = s2[:firstafter] + s + s2[firstafter:] + elif site.language() in site.family.categories_last: + newtext = s2.replace(marker,'').strip() + site.family.category_text_separator + s + else: + interwiki = getLanguageLinks(s2) + s2 = removeLanguageLinks(s2.replace(marker,''), site) + site.family.category_text_separator + s + newtext = replaceLanguageLinks(s2, interwiki, site) + newtext = newtext.replace(marker,'') + else: + s2 = s2.replace(marker,'') + return s2 + return newtext.strip() + + +def categoryFormat(categories, insite = None): + """Return a string containing links to all categories in a list. + + 'categories' should be a list of Category objects. + + The string is formatted for inclusion in insite. + + """ + if not categories: + return '' + if insite is None: + insite = pywikibot.getSite() + catLinks = [category.aslink(noInterwiki = True) for category in categories] + if insite.category_on_one_line(): + sep = ' ' + else: + sep = '\r\n' + # Some people don't like the categories sorted + #catLinks.sort() + return sep.join(catLinks) + '\r\n' + + +def compileLinkR(withoutBracketed=False, onlyBracketed=False): + """Return a regex that matches external links.""" + # RFC 2396 says that URLs may only contain certain characters. + # For this regex we also accept non-allowed characters, so that the bot + # will later show these links as broken ('Non-ASCII Characters in URL'). + # Note: While allowing parenthesis inside URLs, MediaWiki will regard + # right parenthesis at the end of the URL as not part of that URL. + # The same applies to dot, comma, colon and some other characters. + notAtEnd = '\]\s\)\.:;,<>"' + # So characters inside the URL can be anything except whitespace, + # closing squared brackets, quotation marks, greater than and less + # than, and the last character also can't be parenthesis or another + # character disallowed by MediaWiki. + notInside = '\]\s<>"' + # The first half of this regular expression is required because '' is + # not allowed inside links. For example, in this wiki text: + # ''Please see http://www.example.org.'' + # .'' shouldn't be considered as part of the link. + regex = r'(?P<url>http[s]?://[^' + notInside + ']*?[^' + notAtEnd + '](?=[' + notAtEnd+ ']*\'\')|http[s]?://[^' + notInside + ']*[^' + notAtEnd + '])' + + if withoutBracketed: + regex = r'(?<!\[)' + regex + elif onlyBracketed: + regex = r'\[' + regex + linkR = re.compile(regex) + return linkR +

1 0

[Pywikipedia-l] SVN: [6090] trunk/pywikipedia
by erwin85＠svn.wikimedia.org 12 Nov '08

12 Nov '08

Revision: 6090 Author: erwin85 Date: 2008-11-12 18:51:10 +0000 (Wed, 12 Nov 2008) Log Message: ----------- editTime and startTime aren't defined through _getAll if the page doesn't exist. If so, set them to the current time. Modified Paths: -------------- trunk/pywikipedia/wikipedia.py Property Changed: ---------------- trunk/pywikipedia/commons_category_redirect.py Property changes on: trunk/pywikipedia/commons_category_redirect.py ___________________________________________________________________ Added: svn:mergeinfo + Modified: trunk/pywikipedia/wikipedia.py =================================================================== --- trunk/pywikipedia/wikipedia.py 2008-11-12 17:39:18 UTC (rev 6089) +++ trunk/pywikipedia/wikipedia.py 2008-11-12 18:51:10 UTC (rev 6090) @@ -461,7 +461,8 @@ self._permalink = None self._userName = None self._ipedit = None - self._editTime = None + self._editTime = '0' + self._startTime = '0' # For the Flagged Revisions MediaWiki extension self._revisionId = None self._deletedRevs = None @@ -1416,8 +1417,14 @@ # <s>Except if the page is new, we need to supply the time of the # previous version to the wiki to prevent edit collisions</s> # As of Oct 2008, these must be filled also for new pages - predata['wpEdittime'] = self._editTime - predata['wpStarttime'] = self._startTime + if self._editTime: + predata['wpEdittime'] = self._editTime + else: + predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime()) + if self._startTime: + predata['wpStarttime'] = self._startTime + else: + predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime()) if self._revisionId: predata['baseRevId'] = self._revisionId # Pass the minorEdit and watchArticle arguments to the Wiki. @@ -1527,8 +1534,14 @@ # without any reason! # raise EditConflict(u'Someone deleted the page.') # No raise, simply define these variables and retry: - predata['wpEdittime'] = self._editTime - predata['wpStarttime'] = self._startTime + if self._editTime: + predata['wpEdittime'] = self._editTime + else: + predata['wpEdittime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime()) + if self._startTime: + predata['wpStarttime'] = self._startTime + else: + predata['wpStarttime'] = time.strftime('%Y%m%d%H%M%S', time.gmtime()) continue if self.site().has_mediawiki_message("viewsource")\ and self.site().mediawiki_message('viewsource') in data:

1 0

[Pywikipedia-l] [ pywikipediabot-Bugs-2269688 ] Unicode error with djvutext.py
by SourceForge.net 12 Nov '08

12 Nov '08

Bugs item #2269688, was opened at 2008-11-12 15:03 Message generated for change (Settings changed) made by nicdumz You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_… Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Pending >Resolution: Fixed Priority: 5 Private: No Submitted By: Yann Forget (yannforget) Assigned to: Nobody/Anonymous (nobody) Summary: Unicode error with djvutext.py Initial Comment: on fr.wikisource: python djvutext.py -index:Livre:Le_Th%C3%A9%C3%A2tre_de_la_R%C3%A9volution._Le_Quatorze_Juillet._Danton._Les_Loups.djvu -djvu:Le_quatorze_juillet_Danton_Les_loups.djvu -pages:375 Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "djvutext.py", line 249, in <module> main() File "djvutext.py", line 236, in main wikipedia.output("uploading text from %s to %s" % (djvu, index_page) ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128) python version.py Pywikipedia [http] trunk/pywikipedia (r6084, Nov 11 2008, 21:51:31) Python 2.5.2 (r252, Sep 13 2008, 22:55:01) [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] ---------------------------------------------------------------------- Comment By: NicDumZ — Nicolas Dumazet (nicdumz) Date: 2008-11-12 18:41 Message: Au loup !! This display error should be fixed by r6089. Please re-comment on this bug if it is not the case :) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_…

1 0

[Pywikipedia-l] [ pywikipediabot-Bugs-2269688 ] Unicode error with djvutext.py
by SourceForge.net 12 Nov '08

12 Nov '08

Bugs item #2269688, was opened at 2008-11-12 15:03 Message generated for change (Comment added) made by nicdumz You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_… Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Yann Forget (yannforget) Assigned to: Nobody/Anonymous (nobody) Summary: Unicode error with djvutext.py Initial Comment: on fr.wikisource: python djvutext.py -index:Livre:Le_Th%C3%A9%C3%A2tre_de_la_R%C3%A9volution._Le_Quatorze_Juillet._Danton._Les_Loups.djvu -djvu:Le_quatorze_juillet_Danton_Les_loups.djvu -pages:375 Checked for running processes. 1 processes currently running, including the current process. Traceback (most recent call last): File "djvutext.py", line 249, in <module> main() File "djvutext.py", line 236, in main wikipedia.output("uploading text from %s to %s" % (djvu, index_page) ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128) python version.py Pywikipedia [http] trunk/pywikipedia (r6084, Nov 11 2008, 21:51:31) Python 2.5.2 (r252, Sep 13 2008, 22:55:01) [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] ---------------------------------------------------------------------- Comment By: NicDumZ — Nicolas Dumazet (nicdumz) Date: 2008-11-12 18:41 Message: Au loup !! This display error should be fixed by r6089. Please re-comment on this bug if it is not the case :) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=2269688&group_…

1 0

← Newer
1
...
8
9
10
11
12
13
14
...
18
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

pywikibot November 2008