Patches item #1814580, was opened at 2007-10-16 18:36
Message generated for change (Comment added) made by wikipedian
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1814580&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Vandenberg (zeroj)
Assigned to: Nobody/Anonymous (nobody)
Summary: save page generator output
Initial Comment:
When calling pagegenerators.py directly from the command line, it is handy to save the output for post-processing or use as input into replace.py -file page generator. This patch provides a -output argument.
As this patch does not handle unicode page names, it needs improvement before it should be committed.
----------------------------------------------------------------------
>Comment By: Daniel Herding (wikipedian)
Date: 2007-10-17 11:10
Message:
Logged In: YES
user_id=880694
Originator: NO
Is this really needed? I mean, you can do this:
daniel@localhost:~/projekte/pywikipedia> python pagegenerators.py
-ref:Wikipedia:Pywikipediabot > output.txt
Checked for running processes. 2 processes currently running, including
the current process.
Getting references to [[Wikipedia:Pywikipediabot]]
daniel@localhost:~/projekte/pywikipedia> cat output.txt
Benutzer:Head
Wikipedia:Selbstlinks
Benutzer:Zwobot
Benutzer Diskussion:Waugsberg/Archiv/2006-8
Wikipedia:Bots
[etc.]
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1814580&group_…
Patches item #1807596, was opened at 2007-10-04 17:49
Message generated for change (Comment added) made by wikipedian
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1807596&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Private: No
Submitted By: Alex S.H. Lin (lin4h)
Assigned to: Nobody/Anonymous (nobody)
Summary: interwiki.py: Add Chinese Translation
Initial Comment:
interwiki.py Chinese Message Translation
----------------------------------------------------------------------
>Comment By: Daniel Herding (wikipedian)
Date: 2007-10-17 11:00
Message:
Logged In: YES
user_id=880694
Originator: NO
Patch applied, thank you.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1807596&group_…
Bugs item #1809802, was opened at 2007-10-08 23:34
Message generated for change (Comment added) made by wikipedian
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1809802&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: weblinkchecker.py inefficiently respects max_external_links
Initial Comment:
I noticed while testing my new system that setting max_external_links to anything above 250 seems to be pointless, as 250 page names is hardcoded
gen = pagegenerators.PreloadingGenerator(gen, pageNumber = 250)
So if more than 250 threads were to be created, they would have nothing to do, because it seems a fresh batch of page names (one per thread) will only be fetched once all the previous 250 threads have finished (I could be wrong here). In that case, it'd be better to have a statement like
gen = pagegenerators.PreloadingGenerator(gen, pageNumber = config.max_external_links)
so you fetch at least as much or more page names than the current batch of threads need (I figure the more stored page names have, the less often it would need to wait for downloads). ie. something like:
--- weblinkchecker.py 2007-10-08 17:15:09.000000000 -0400
+++ weblinkchecker.py.bak 2007-10-08 17:14:58.000000000 -0400
@@ -729,7 +729,7 @@
if gen:
if namespaces != []:
gen = pagegenerators.NamespaceFilterPageGenerator(gen, namespaces)
- gen = pagegenerators.PreloadingGenerator(gen, pageNumber = (config.max_external_links * 2))
+ gen = pagegenerators.PreloadingGenerator(gen, pageNumber = 260)
gen = pagegenerators.RedirectFilterPageGenerator(gen)
bot = WeblinkCheckerRobot(gen)
try:
----------------------------------------------------------------------
>Comment By: Daniel Herding (wikipedian)
Date: 2007-10-17 10:51
Message:
Logged In: YES
user_id=880694
Originator: NO
I changed it to this:
# fetch at least 240 pages simultaneously from the wiki, but more
if
# a high thread number is set.
pageNumber = max(240, config.max_external_links * 2)
gen = pagegenerators.PreloadingGenerator(gen, pageNumber =
pageNumber)
I think that should be OK for everyone.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1809802&group_…
Bugs item #1811843, was opened at 2007-10-11 22:54
Message generated for change (Comment added) made by wikipedian
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1811843&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: AnMaster (anmaster)
Assigned to: Nobody/Anonymous (nobody)
Summary: cosmetic_changes.py should not edit <nowiki>
Initial Comment:
<AnMaster> with cosmetic cleanup the bot sometimes edit too much: http://gentoo-wiki.com/index.php?title=HOWTO_AutoLiveCD&diff=117937&oldid=9…
<AnMaster> - if [[ "$IF_UP" == "$IF_ALL" ]]; then + if [["$IF UP" == "$IF ALL"]] ; then
<AnMaster> inside a <nowiki> that is a clear error
... later ...
<AnMaster> where should I report this problem of cosmetic_cleanup?
<Hojjat> You can report it on the bug tracker
<Hojjat> Here is the link:
<Hojjat> http://sourceforge.net/tracker/?group_id=93107 goto bugs sectin
So here it is. This bug is very irritating.
----------------------------------------------------------------------
>Comment By: Daniel Herding (wikipedian)
Date: 2007-10-17 10:45
Message:
Logged In: YES
user_id=880694
Originator: NO
Fixed in SVN, thanks for your bug report.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1811843&group_…
Revision: 4455
Author: wikipedian
Date: 2007-10-17 08:44:54 +0000 (Wed, 17 Oct 2007)
Log Message:
-----------
fixed bug [ 1811843 ] cosmetic_changes.py should not edit <nowiki>
Modified Paths:
--------------
trunk/pywikipedia/cosmetic_changes.py
Modified: trunk/pywikipedia/cosmetic_changes.py
===================================================================
--- trunk/pywikipedia/cosmetic_changes.py 2007-10-17 08:42:41 UTC (rev 4454)
+++ trunk/pywikipedia/cosmetic_changes.py 2007-10-17 08:44:54 UTC (rev 4455)
@@ -158,25 +158,12 @@
return text
def cleanUpLinks(self, text):
- trailR = re.compile(self.site.linktrail())
- # The regular expression which finds links. Results consist of four groups:
- # group title is the target page title, that is, everything before | or ].
- # group section is the page section. It'll include the # to make life easier for us.
- # group label is the alternative link title, that's everything between | and ].
- # group linktrail is the link trail, that's letters after ]] which are part of the word.
- # note that the definition of 'letter' varies from language to language.
- self.linkR = re.compile(r'\[\[(?P<titleWithSection>[^\]\|]+)(\|(?P<label>[^\]\|]*))?\]\](?P<linktrail>' + self.site.linktrail() + ')')
- curpos = 0
- # This loop will run until we have finished the current page
- while True:
- m = self.linkR.search(text, pos = curpos)
- if not m:
- break
- # Make sure that next time around we will not find this same hit.
- curpos = m.start() + 1
- titleWithSection = m.group('titleWithSection')
- label = m.group('label')
- trailingChars = m.group('linktrail')
+ # helper function which works on one link and either returns it
+ # unmodified, or returns a replacement.
+ def handleOneLink(match):
+ titleWithSection = match.group('titleWithSection')
+ label = match.group('label')
+ trailingChars = match.group('linktrail')
if not self.site.isInterwikiLink(titleWithSection):
# The link looks like this:
@@ -210,7 +197,7 @@
if titleWithSection == '':
# just skip empty links.
- continue
+ return match.group()
# Remove unnecessary initial and final spaces from label.
# Please note that some editors prefer spaces around pipes. (See [[en:Wikipedia:Semi-bots]]). We remove them anyway.
@@ -256,7 +243,20 @@
newLink = ' ' + newLink
if hadTrailingSpaces:
newLink = newLink + ' '
- text = text[:m.start()] + newLink + text[m.end():]
+ return newLink
+ # don't change anything
+ return match.group()
+
+ trailR = re.compile(self.site.linktrail())
+ # The regular expression which finds links. Results consist of four groups:
+ # group title is the target page title, that is, everything before | or ].
+ # group section is the page section. It'll include the # to make life easier for us.
+ # group label is the alternative link title, that's everything between | and ].
+ # group linktrail is the link trail, that's letters after ]] which are part of the word.
+ # note that the definition of 'letter' varies from language to language.
+ linkR = re.compile(r'\[\[(?P<titleWithSection>[^\]\|]+)(\|(?P<label>[^\]\|]*))?\]\](?P<linktrail>' + self.site.linktrail() + ')')
+
+ text = wikipedia.replaceExcept(text, linkR, handleOneLink, ['comment', 'math', 'nowiki', 'pre', 'startspace'])
return text
def resolveHtmlEntities(self, text):
@@ -273,7 +273,7 @@
return text
def validXhtml(self, text):
- text = wikipedia.replaceExcept(text, r'<br>', r'<br />', ['comment', 'nowiki', 'pre'])
+ text = wikipedia.replaceExcept(text, r'<br>', r'<br />', ['comment', 'math', 'nowiki', 'pre'])
return text
def removeUselessSpaces(self, text):
Revision: 4454
Author: wikipedian
Date: 2007-10-17 08:42:41 +0000 (Wed, 17 Oct 2007)
Log Message:
-----------
replaceExcept() now accepts a function as replacement parameter, like in
re.sub().
Modified Paths:
--------------
trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py
===================================================================
--- trunk/pywikipedia/wikipedia.py 2007-10-16 12:44:22 UTC (rev 4453)
+++ trunk/pywikipedia/wikipedia.py 2007-10-17 08:42:41 UTC (rev 4454)
@@ -2714,7 +2714,10 @@
Parameters:
text - a unicode string
old - a compiled regular expression
- new - a unicode string
+ new - a unicode string (which can contain regular
+ expression references), or a function which takes
+ a match object as parameter. See parameter repl of
+ re.sub().
exceptions - a list of strings which signal what to leave out,
e.g. ['math', 'table', 'template']
caseInsensitive - a boolean
@@ -2805,23 +2808,29 @@
if sys.platform=='win32':
new = new.replace('\\n', '\n')
- # We cannot just insert the new string, as it may contain regex
- # group references such as \2 or \g<name>.
- # On the other hand, this approach does not work because it can't
- # handle lookahead or lookbehind (see bug #1731008):
- #replacement = old.sub(new, text[match.start():match.end()])
- #text = text[:match.start()] + replacement + text[match.end():]
+ try:
+ # the parameter new can be a function which takes the match as a parameter.
+ replacement = new(match)
+ except TypeError:
+ # it is not a function, but a string.
- # So we have to process the group references manually.
- replacement = new
+ # We cannot just insert the new string, as it may contain regex
+ # group references such as \2 or \g<name>.
+ # On the other hand, this approach does not work because it can't
+ # handle lookahead or lookbehind (see bug #1731008):
+ #replacement = old.sub(new, text[match.start():match.end()])
+ #text = text[:match.start()] + replacement + text[match.end():]
- groupR = re.compile(r'\\(?P<number>\d+)|\\g<(?P<name>.+?)>')
- while True:
- groupMatch = groupR.search(replacement)
- if not groupMatch:
- break
- groupID = groupMatch.group('name') or int(groupMatch.group('number'))
- replacement = replacement[:groupMatch.start()] + match.group(groupID) + replacement[groupMatch.end():]
+ # So we have to process the group references manually.
+ replacement = new
+
+ groupR = re.compile(r'\\(?P<number>\d+)|\\g<(?P<name>.+?)>')
+ while True:
+ groupMatch = groupR.search(replacement)
+ if not groupMatch:
+ break
+ groupID = groupMatch.group('name') or int(groupMatch.group('number'))
+ replacement = replacement[:groupMatch.start()] + match.group(groupID) + replacement[groupMatch.end():]
text = text[:match.start()] + replacement + text[match.end():]
# continue the search on the remaining text
Patches item #1814580, was opened at 2007-10-17 02:36
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1814580&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Vandenberg (zeroj)
Assigned to: Nobody/Anonymous (nobody)
Summary: save page generator output
Initial Comment:
When calling pagegenerators.py directly from the command line, it is handy to save the output for post-processing or use as input into replace.py -file page generator. This patch provides a -output argument.
As this patch does not handle unicode page names, it needs improvement before it should be committed.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603140&aid=1814580&group_…