I'm using replace.py to create wikilinks. Usually I want to select only the
first occurrence of the search string, and my command works fine for this.
But sometimes, the first hit is not suitable (e.g. it's part of a book or
course title, so I don't want to add the wikilink). If I choose n for no,
the bot goes to the next page.
Is there a way I can skip to the next occurrence in the same page? I'm
guessing it will need a modified version of replace.py, so that it gives an
extra option besides ([y]es, [N]o, [e]dit, open in [b]rowser, [a]ll,
[q]uit)
The actual command I'm using is:
python replace.py -regex "(?si)\b((?:FOO1|FOO2))\b(.*$)
" "[[\\1]]\\2" -exceptinsidetag:link -exceptinsidetag:hyperlink
-exceptinsidetag:header -exceptinsidetag:nowiki -exceptinsidetag:ref
-excepttext:"(?si)\[\[((?:FOO1|FOO2)[\|\]])" -namespace:0 -namespace:102
-namespace:4 -summary:"[[Appropedia:Wikilink bot]] adding double square
brackets to: FOO1|FOO2." -log -xml:currentdump.xml
Many thanks!
--
Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
blogs.appropedia.org
identi.ca/appropedia
twitter.com/appropedia
Hi!
Do you have any idea why, using replace.py on some large dumps, I get
this error message:
C:\pywikipedia>replace.py -xml:enwiki-20091128-pages-articles.xml
Please enter the text that should be replaced: impossibletofindword
Please enter the new text: found
Please enter another text that should be replaced, or press Enter to start:
The summary message will default to: Robot: Automated text
replacement (-impossibletofindword +found
)
Press Enter to use this default message, or enter a description of the
changes your bot will make: test
Reading XML dump...
Traceback (most recent call last):
File "C:\pywikipedia\pagegenerators.py", line 847, in __iter__
for page in self.wrapped_gen:
File "C:\pywikipedia\pagegenerators.py", line 779, in
DuplicateFilterPageGenerator
for page in generator:
File "C:\pywikipedia\replace.py", line 218, in __iter__
for entry in self.parser:
File "C:\pywikipedia\xmlreader.py", line 295, in new_parse
for rev in self._parse(event, elem):
File "C:\pywikipedia\xmlreader.py", line 304, in _parse_only_latest
yield self._create_revision(revision)
File "C:\pywikipedia\xmlreader.py", line 341, in _create_revision
redirect=self.isredirect
File "C:\pywikipedia\xmlreader.py", line 64, in __init__
self.username = username.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
'NoneType' object has no attribute 'strip'
I updated pywikipedia to the last revision with no results.
As you can see it does not seem to be user-fixes.py or regex-related.
Thanks in advance!
Davide Bolsi
I am at a point where it would be helpful to have some feedback from other
Pywikipedia users about the future of the rewrite branch. As those who
watch the SVN commits know, I have not had as much time to work on this
lately, and have to prioritize what time I do spend on it.
For those who have used the rewrite branch, what (if anything) needs to be
done to it to get you to use it exclusively and retire the old wikipedia.py
system? What is missing? What is broken? What is present but could be
improved?
For those who have chosen not to use the rewrite branch, why not? What
might lead you to take another look?
And then, I'm sure there are many whose reaction to this post has been,
"What's the rewrite branch?" I don't know what to ask you, so feel free to
move on to the next message.
Most critically, is there any reason to continue development of the trunk
once the rewrite branch is at a point where most users are ready to switch
to it?
-- Russ
Dear all,
As response to Nicolas' e-mail:
> The original idea was to abandon trunk/ to use the rewrite, but we
> lack manpower and (at least for me) time to actually do the conversion
> work of all existing scripts. But I know for a fact that code is
> working, and cleaner. Please give it a try :)
I decided to clean up the nightlies page: I removed all the clutter
(spelling, threadedhttp, pywikiparser) and added the rewrite (in other
words: only the 'pywikipedia' and 'rewrite' packages remain).
Nightlies page: http://toolserver.org/~valhallasw/pywiki/
Best regards,
Merlijn van Deen / valhallasw
Hello! I've recently noticed that noreferences does not work with
articles from pl.wiki :/
I've used that command "python noreferences.pyc -file:logs/refdx
-always" and the python script ignore all pages which have <ref> tags
and no have <references/>. But the script showed one info listed below
when it was necessary: "No changes necessary: references template
found."
An example of ignored pages: http://pl.wikipedia.org/wiki/Budgie_%28album%29
Regards,
patrol
I want to copy (not move) a few hundred pages from one namespace to another.
Any ideas how?
I think I can do it in two stages - if I can create a page. I'd create each
page with only the page name, then convert that to a transclusion from the
mainspace article:
python replace.py -regex ".*" "{{subst:PAGENAME}}" -file:pagelist
python replace.py -regex ".*" "{{subst::\\1}}" -file:pagelist
But I get the error "Page [[blah blah]] not found" when I run the first
command. I also tried with "" as the search string in the first line, but
it's the same. And ideally I'd like it to exclude cases where the page
already exists.
Any solution? Thanks!
--
Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
blogs.appropedia.orgcommunity.livejournal.com/appropedia
identi.ca/appropedia
twitter.com/appropedia
Hi
I use Python frecuently, and today I start working with pywikipediabot, wich
is a very good library, by the way.
But I think that the workflow is very out-of-the-python-way. I explain my
point:
To make a script that uses this environment, you need to put the code on the
main directory of pywikipediabot, or do some links to that directory. But
usually, when you use a third-party module on Python, you should have the
chance to "install" the module and load it with a simple
import pywikipediabot
or
from pywikipediabot import wikipedia
And doing thins on any directory on your system, without any extra
configuration or needed files. I think this could be a nice feature, because
it respects the python-way, and gives the chance to distribuite the module
much more easier using 'distutils'[1] or even Debian packages
[1] http://docs.python.org/distutils/index.html
Regards,
Pablo Recio
Hi there,
I'm currently using the rewrite branch for a project. This project is
not a bot, but a tool for vandalism analysis.
Here I'll explain how I used it and what changes I made, so it may be
useful for the new design of the rewrite. Also, I'd like to get
recommendations about my approaches so I can made them suitable for
integration with pywikipedia.
First of all, my main unit of information is Edit. An Edit is an
object composed of a Page and two consecutive revision IDs of such
page. Edit supports some operations such as getting the edition
comment, user, timestamp and the old and the new text.
I had to implement a method similar to BaseSite.loadrevisions():
Given a list of edits, which have associated their revision IDs but
NOT their Page, fetch them and associate them with their Page object.
This method retrieves all the revisions, creates Page objects for them
and Revision objects which are assigned to the corresponding
Page._revisions dict.
Then, I have to store all this info in-disk for later use. So I wrote
a function for exporting my list edits to XML, using WikiMedia's
format Export 0.4. To ease this process, I added a to_element() method
to Page and Revision objects. to_element() returns an Element object
(from the ElementTree API) representing the object. So, exporting is
as easy as iterating over all Pages, calling their to_element()
method() and appending it to a common root. What do you think about
this? Should it be included in pywikipedia? Do you prefer a different
approach for exporting to XML?
For importing again from XML, I adapted the old XmlDump. My version
yields Page objects instead of revisions. Of course this might be a
performance nightmare when working with XML dumps with full history,
so it can be modified to yield Revision objects.
I think the Revision class should include a page attribute, containing
the Page object that the Revision belongs to. That would be of use,
for example, when writing an XmlDump yielding Revisions and, in
general, for more applications that are Revision oriented.
And last but not least, currently it's easy to end up with multiple
Page objects representing the same page, but with different object
state. Do you think that BaseSite should implement a Page factory or
some way to "create a Page object for this title if it doesn't exist
or give me the one that already exists"?
Well, that's all at the moment.
Best regards,
--
Santiago M. Mola
Jabber ID: cooldwind(a)gmail.com
Hi,
Currently, I have to do work with old revisions from articles. I found
that most method in Page are focused on working with the latest
revision. So it'd very convenient to add a new Revision class where
methods like section(), isDisambig(), userName(), editTime(), etc.,
live. Then Page could properly get (and cache) any revision. These
methods in page could be shortcuts for the actual methods in Revision
of the latest revision.
What do you think?
Also, am I missing something and there's actually a good way of
working on arbitrary revisions that I missed?
Thanks,
--
Santiago M. Mola
Jabber ID: cooldwind(a)gmail.com