Replace.py should save -- please comment

List overview All Threads
Download

newer

older

Proposal for farewell of good old...

bot page count

Bináris

22 Oct 2010 22 Oct '10

12:16 p.m.

Hi!

My old problem is that repalce.py can't write the pages to work on into a file on my disk. I have used a modificated version for years that does no changes but writes the title of the involved pages to a subpage on Wikipedia in automated mode, and then I can make the replacements from that page much more quickly than directly from dump or living Wikipedia. This is slow and generates a plenty of dummy edits.

In other words, replace.py has a tool to get the titles from a file (-file) or from a wikipage (-links), but has no tool to generate this file.

Now I am ready to rewrite it. This way we can start it and the bot will find all the possible articles to work on and save the titles without editing Wikipedia (and without artificial delay), meanwhile we can have the lunch or run a marathon or sleep. Then we make the replacements from this with -file.

My idea is that replace.py should have two new parameters: -save writes the results into a new file instead of editing articles. It overwrites existing file without notice. -saveappend writes into a file or appends to the existing one. OR: -save writes and appends (primary mode) -savenew writes and overwrites

The help is here: http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data So we have to import codecs. My script is: articles=codecs.open('cikkek.txt','a',encoding='utf-8') ... tutuzuzu=u'# %s\n' %page.aslink() <-- needs rewrite to the new syntax articles.write(unicode(tutuzuzu)) <-- needs further testing, if nicode() is really needed articles.flush()

It works fine except '\n' is a unix-styled newline that has to be converted by lfcr.py in order to make it readable with notepad.exe. This is with constant filename, that should be developed to get from command line.

Your opinions before I begin?

-- Bináris

Attachments:

attachment.htm (text/html — 2.5 KB)

Show replies by date

Dr. Trigon

22 Oct 22 Oct

3:05 p.m.

Hi!

First I am not an expert here, second my thoughts:

- sounds good! :) - the two cases (new, append) are not really needed if you just use append, and delete the list by yourself in the file browser (but this is a philosophical issue) - for your save/append code, have a look at [1] and maybe [2] also. In [1] is code quite similar to your proposal and this code is already in use and works. As visible from [2] also a '.decode('latin-1')' is needed for me, this may differ for you, since unicode is quite mysterious... ;))

Hope this helps a bit! Greetings

[1] https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_bas... [2] https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/sum_disc.py?r=HEA...

Am 22.10.2010 11:16, schrieb Bináris:

...

Hi!

My old problem is that repalce.py can't write the pages to work on into a file on my disk. I have used a modificated version for years that does no changes but writes the title of the involved pages to a subpage on Wikipedia in automated mode, and then I can make the replacements from that page much more quickly than directly from dump or living Wikipedia. This is slow and generates a plenty of dummy edits.

In other words, replace.py has a tool to get the titles from a file (-file) or from a wikipage (-links), but has no tool to generate this file.

Now I am ready to rewrite it. This way we can start it and the bot will find all the possible articles to work on and save the titles without editing Wikipedia (and without artificial delay), meanwhile we can have the lunch or run a marathon or sleep. Then we make the replacements from this with -file.

My idea is that replace.py should have two new parameters: -save writes the results into a new file instead of editing articles. It overwrites existing file without notice. -saveappend writes into a file or appends to the existing one. OR: -save writes and appends (primary mode) -savenew writes and overwrites

The help is here: http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data So we have to import codecs. My script is: articles=codecs.open('cikkek.txt','a',encoding='utf-8') ... tutuzuzu=u'# %s\n' %page.aslink() <-- needs rewrite to the new syntax articles.write(unicode(tutuzuzu)) <-- needs further testing, if nicode() is really needed articles.flush()

It works fine except '\n' is a unix-styled newline that has to be converted by lfcr.py in order to make it readable with notepad.exe. This is with constant filename, that should be developed to get from command line.

Your opinions before I begin?

Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Bináris

23 Oct 23 Oct

1:13 p.m.

I am ready! I will upload it today, after a few more tests.

...

-save writes and appends (primary mode) -savenew writes and overwrites

This is the final. We can write the titles into a file, and this will make the work much faster and more comfortable! No more many thousand dummy edits of my subpage for preparing replacements, no more waiting for the server, no more spamming the bot's contribution list. This was my dream.

You can upload the ready list to a wikipage as a numbered list with one edit, if you want to use it online with -links, or you can directly work from it with -file.

The script also writes a dissection line after every 100 title. This is useful if, e. g., you have 3000 titles to work on and after processing 1000 items you want to clear your file and you have to find the actual title. I have done this many times, that's why I know this is important.

For that, I had to implement an editcounter. So why not use it when making the real replacements? Replace.py now writes the number of pages edited upon ending its task or quitting with "q". Not applicable with ctrl C. Here is a little bug, the counter cannot check if put_async _will_ be successful.

Enjoy. :-)

Bináris

Dr. Trigon

3:38 p.m.

Am 23.10.2010 12:13, schrieb Bináris:

...

I am ready! I will upload it today, after a few more tests.

...
-save writes and appends (primary mode) -savenew writes and overwrites

This is the final. We can write the titles into a file, and this will make the work much faster and more comfortable! No more many thousand dummy edits of my subpage for preparing replacements, no more waiting for the server, no more spamming the bot's contribution list. This was my dream.

You can upload the ready list to a wikipage as a numbered list with one edit, if you want to use it online with -links, or you can directly work from it with -file.

The script also writes a dissection line after every 100 title. This is useful if, e. g., you have 3000 titles to work on and after processing 1000 items you want to clear your file and you have to find the actual title. I have done this many times, that's why I know this is important.

For that, I had to implement an editcounter. So why not use it when making the real replacements? Replace.py now writes the number of pages edited upon ending its task or quitting with "q". Not applicable with ctrl C. Here is a little bug, the counter cannot check if put_async _will_ be successful.

you could try dooing so by using threading.Event or similar (e.g. signal)

Greetings

...

Enjoy. :-)

Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Merlijn van Deen

4:08 p.m.

On 23 October 2010 12:13, Bináris wikiposta@gmail.com wrote:

...

quitting with "q". Not applicable with ctrl C. Here is a little bug, the counter cannot check if put_async _will_ be successful.

It can.

def put_async(self, newtext, comment=None, watchArticle=None, minorEdit=True, force=False, callback=None): (...) callback: a callable object that will be called after the page put operation; this object must take two arguments: (1) a Page object, and (2) an exception instance, which will be None if the page was saved successfully.

Just make sure you know how to do multithreading, or this will come back to bite you. (No, incrementing a global variable is not the right way.)

Best regards, Merlijn

Bináris

6:22 p.m.

Thank you, I have read this, but here I reach the current limits of my knowledge. Once I will know how. I would appreciate every help. My version is at https://sourceforge.net/tracker/?func=detail&aid=3093682&group_id=93...

2010/10/23 Merlijn van Deen valhallasw@arctus.nl

...

Just make sure you know how to do multithreading, or this will come back to bite you. (No, incrementing a global variable is not the right way.)

-- Bináris

info＠gno.de

6 Nov 6 Nov

5:41 a.m.

Is there any reason or advantage for putting the page in asyncron mode while I am using replace.py manually? I guess this should be changed to normal put and put operation in -always mode maybe done asyncroneously to save time for searching and processing the next pages like interwiki.py does it with -async option.

Greetings

xqt

----- Original Nachricht ---- Von: Merlijn van Deen valhallasw@arctus.nl An: Pywikipedia discussion list pywikipedia-l@lists.wikimedia.org Datum: 23.10.2010 15:08 Betreff: Re: [Pywikipedia-l] Replace.py should save -- please comment

...

On 23 October 2010 12:13, Bináris wikiposta@gmail.com wrote:

...
quitting with "q". Not applicable with ctrl C. Here is a little bug, the counter cannot check if put_async _will_ be successful.

It can.
def put_async(self, newtext,
              comment=None, watchArticle=None, minorEdit=True,
force=False, callback=None): (...) callback: a callable object that will be called after the page put operation; this object must take two arguments: (1) a Page object, and (2) an exception instance, which will be None if the page was saved successfully.

Just make sure you know how to do multithreading, or this will come back to bite you. (No, incrementing a global variable is not the right way.)

Best regards, Merlijn

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Bináris

6:20 p.m.

Thank you, thank you! That's a nice day. :-)))

2010/11/6 info@gno.de

...

I've submitted it (with some minor changes) in r8700 Thanks a lot.

2010/11/6 info@gno.de

...

Is there any reason or advantage for putting the page in asyncron mode while I am using replace.py manually?

Yes, there is, definiately. It makes the useful work faster: I answer the yes/no/edit... question for example 100 or 300 times very quickly with just uparrow and enter, then I go to have lunch or bath or read a book while it saves. When there are easy corrections (for example spelling with a good fix or changing some text manually), it will often continue saving 15-30 minutes after I finished. This was a great invention. With page.put() we should always wait for saving before we got the next article to edit.

...

I guess this should be changed to normal put and put operation in -always mode maybe done asyncroneously to save time for searching and processing the next pages like interwiki.py does it with -async option.

I think the speed of automatic procession depends primarily of putthrottle time, not the process itself. Searching maybe a question; using -cat, -links, -file it is quick; using -xml it stops after each 20th page to search, but with direct search from Wikipedia with -start it may take a lot of time to get the next 60 articles and no time to go to the next one within a 60-page pack.

...

-- Bináris

Bináris

22 Feb 22 Feb

12:53 a.m.

2010/10/23 Merlijn van Deen valhallasw@arctus.nl

...

On 23 October 2010 12:13, Bináris wikiposta@gmail.com wrote:

...
quitting with "q". Not applicable with ctrl C. Here is a little bug, the counter cannot check if put_async _will_ be successful.

It can.
def put_async(self, newtext,
              comment=None, watchArticle=None, minorEdit=True, force=False,
              callback=None):
(...) callback: a callable object that will be called after the page put operation; this object must take two arguments: (1) a Page object, and (2) an exception instance, which will be None if the page was saved successfully.

Just make sure you know how to do multithreading, or this will come back to bite you. (No, incrementing a global variable is not the right way.)

I tried to learn and read this threading stuff in order to have a correct editcounter. (If was quite a time ago, so someone might forget the topic: I created an editcounter in class ReplaceRobot of replace.py and was disappointed with my counter not listening to the failure of put_async.

Now I created a callback function: def lacika(self, page, err): if err is None: print u'%s saved successfully' % page.title() #for debug else: print 9999999 self.editcounter -= 1 print self.editcounter

Now I ran four tests: 1. Successful save, it's OK. 2. After loading the page, while replace.py waits for my choice, I protected the page. Result: 1 page was changed. <-- message from main process Updating page [[Szerkesztő:BinBot/semmi]] via API 9999999 <-- just a test message from callback 0 <-- New self.editcounter, correct, but too late Here comes my first problem. As far as I understood, I should use *join* to make the main process wait for put_async. But put_asyn is taken from wikipedia.py. How can I use *join* here, where should I write it? Could someone write a simple example to make the bot wait for put_async?

3rd test: I created an edit conflict with the same result. (I made the bot to write err instead of 9999999, it's OK.)

4th test: After loading the page, while the bot is waiting for my choice, I delete the page. This seems to be a bug: 1 page was changed. Updating page [[Szerkesztő:BinBot/semmi]] via API Unknown Error. API Error code:missingtitle Information:The article you tried to edit doesn't exist Szerkeszt<:BinBot/semmi saved successfully First, the error is not handled, second, callback got None for error and reported a false success.

-- Bináris

Bináris

23 Feb 23 Feb

11:30 p.m.

I made a big effort to understand this stuff with put_async and threading, but here is a point I can't get over.

I read a lot and understood that means of waiting for a thread is join, and in wikipedia.py, join must be a method of _putthread which is the Thread object. Noe, wherever I write the line _putthread.join() (I tried put_async, async_put and even replace.py which I know is not a good solution) it freezes my command window as if the thread never terminated. _putthread.join(time) waits for the given time, but this is not appropriate, just for test. Does any script really use this callback at all? Which one? Line 8054 in wikipedia.py says "an explicit end-of-Queue marker is needed", and this is supposed to be a call for async_put with None as value of page. But I don't see anywhere this dummy call for async_put. May that be a bug or I just misunderstand? (Btw, Python 2.4 should be forgotten.)

Please, I need your help.

-- Bináris

Hannes Röst

24 Feb 24 Feb

11:56 a.m.

Hi Binaris

I did not write the any of the threaded stuff in wikipedia.py but I have used it a couple of times. I think what you should do is provide a callable _object_ and not a callback function. You can then iterate through the list of callback objects and look at the errors if there are any. Here is a sample program I wrote to illustrate the concept:

import wikipedia as pywikibot from time import sleep

pages = [ 'User:HRoestBot/CallbackTest1', 'User:HRoestBot/CallbackTest2', ]

class CallbackObject(object): def __init__(self): self.done = False

def __call__(self, page, error): self.page = page self.error = error self.done = True

Callbacks = [] for mypage in pages: print(mypage); callb = CallbackObject() page = pywikibot.Page(pywikibot.getSite(), mypage) Callbacks.append(callb) page.put_async('some text', callback=callb)

# Waiting until all pages are saved on Wikipedia while True: if all( [c.done for c in Callbacks] ): break print "Still Waiting" sleep(5)

# Now we can look at the errors for obj in Callbacks: print obj.page, obj.error if not obj.error is None: # do something to handle errors

The output of such a program may then be

$ python test.py unicode test: triggers problem #3081100 HRoestBot/CallbackTest1 HRoestBot/CallbackTest2 Sleeping for 4.0 seconds, 2012-02-24 09:32:57 Still Waiting Still Waiting Updating page [[HRoestBot/CallbackTest1]] via API Still Waiting Sleeping for 19.3 seconds, 2012-02-24 09:33:18 Still Waiting Updating page [[HRoestBot/CallbackTest2]] via API Still Waiting [[de:HRoestBot/CallbackTest1]] An edit conflict has occured. [[de:HRoestBot/CallbackTest2]] An edit conflict has occured. hr@hr:~/projects/private/pywikipedia_gitsvn$

At least that is how I do it. I hope that helps to understand. You can also use pywikibot.page_put_queue.qsize() and pywikibot.page_put_queue.empty() to check whether the queue is empty or not but this might still lead to problems because the page is fetched from the queue and *then* page.put is called on it. So until page.put() finishes, the queue will be empty even though the bot is still putting the page. See the function def async_put(), it seems to me much safer to rely on the Callback objects to be sure that all the put-calls are done.

You can also look at _flush() method in wikipedia.py to see how it determines whether all pages are put and its save to exit or not.

Hannes

On 23 February 2012 21:30, Bináris wikiposta@gmail.com wrote:

...

I made a big effort to understand this stuff with put_async and threading, but here is a point I can't get over.

I read a lot and understood that means of waiting for a thread is join, and in wikipedia.py, join must be a method of _putthread which is the Thread object. Noe, wherever I write the line _putthread.join() (I tried put_async, async_put and even replace.py which I know is not a good solution) it freezes my command window as if the thread never terminated. _putthread.join(time) waits for the given time, but this is not appropriate, just for test. Does any script really use this callback at all? Which one? Line 8054 in wikipedia.py says "an explicit end-of-Queue marker is needed", and this is supposed to be a call for async_put with None as value of page. But I don't see anywhere this dummy call for async_put. May that be a bug or I just misunderstand? (Btw, Python 2.4 should be forgotten.)

Please, I need your help.

-- Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Merlijn van Deen

26 Feb 26 Feb

6:09 p.m.

...

I created an editcounter in class ReplaceRobot of replace.py and was disappointed with my counter not listening to the failure of put_async.

Now I created a callback function: def lacika(self, page, err): if err is None: print u'%s saved successfully' % page.title() #for debug else: print 9999999 self.editcounter -= 1 print self.editcounter

Note that this is not thread safe (or maybe it is due to the GIL - but it could happen self.editcounter changes between reading it, subtracting one, and writing it back).

...

Now I ran four tests:

Successful save, it's OK.

After loading the page, while replace.py waits for my choice, I

protected the page. Result: 1 page was changed. <-- message from main process Updating page [[Szerkesztő:BinBot/semmi]] via API 9999999 <-- just a test message from callback 0 <-- New self.editcounter, correct, but too late

What is exactly the problem here? This is exactly the expected behaviour: it runs the callback /after/ the page has been saved. It's not a time machine. If you need to have an exact editcount during the main loop, use .put(). Alternatively, move your logic into the callback function - that is the only place where the correct information is available.

...

Here comes my first problem. As far as I understood, I should use *join*to make the main process wait for put_async. But put_asyn is taken from wikipedia.py. How can I use *join* here, where should I write it? Could someone write a simple example to make the bot wait for put_async?

Why would you want to join() the thread? The bot always waits for all save operations to finish.

...

4th test: After loading the page, while the bot is waiting for my choice, I delete the page. This seems to be a bug: 1 page was changed. Updating page [[Szerkesztő:BinBot/semmi]] via API Unknown Error. API Error code:missingtitle Information:The article you tried to edit doesn't exist Szerkeszt<:BinBot/semmi saved successfully

So did it re-create the page? or...?

Best, Merlijn

info＠gno.de

6 Nov 6 Nov

5:32 a.m.

I've submitted it (with some minor changes) in r8700 Thanks a lot.

xqt

----- Original Nachricht ---- Von: Bináris wikiposta@gmail.com An: Pywikipedia discussion list pywikipedia-l@lists.wikimedia.org Datum: 23.10.2010 12:13 Betreff: Re: [Pywikipedia-l] Replace.py should save -- please comment

...

I am ready! I will upload it today, after a few more tests.

...
-save writes and appends (primary mode) -savenew writes and overwrites

This is the final. We can write the titles into a file, and this will make the work much faster and more comfortable! No more many thousand dummy edits of my subpage for preparing replacements, no more waiting for the server, no more spamming the bot's contribution list. This was my dream.

You can upload the ready list to a wikipage as a numbered list with one edit, if you want to use it online with -links, or you can directly work from it with -file.

The script also writes a dissection line after every 100 title. This is useful if, e. g., you have 3000 titles to work on and after processing 1000 items you want to clear your file and you have to find the actual title. I have done this many times, that's why I know this is important.

For that, I had to implement an editcounter. So why not use it when making the real replacements? Replace.py now writes the number of pages edited upon ending its task or quitting with "q". Not applicable with ctrl C. Here is a little bug, the counter cannot check if put_async _will_ be successful.

Enjoy. :-)

Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

4680

Age (days ago)

5172

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

12 comments

5 participants

tags (0)

participants (5)

Bináris
Dr. Trigon
Hannes Röst
info＠gno.de
Merlijn van Deen