Archivebot and header1

List overview All Threads
Download

newer

older

Logpages

Feature request...

Bináris

26 Sep 2010 26 Sep '10

12:32 p.m.

A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page: http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C... This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.

The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?

-- Bináris

Attachments:

attachment.htm (text/html — 998 bytes)

Show replies by date

Dr. Trigon

10 Oct 10 Oct

3:25 p.m.

To be honest I do not really know the archive bot, as far as I know it uses various regex to determine the headings.

Since my bot does the same, but I had serveral serious issues by using regex I wrote a own more sophisticated method to retrieve pages headings. I don't know if this is of any help for you but if you are interessted in this, please have a look at:

https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik...

https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/...

and look for the 'getSections' method.

By the way this is something that should be comitted to the framework anyway... ;))

Greetings

Am 26.09.2010 19:32, schrieb Bináris:

...

A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page: http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C... This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.

The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?

-- Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Bináris

11 Oct 11 Oct

3:41 a.m.

Thank you, I will forward it (though it seems the bot owner only runs it, and nobody maintains).

info＠gno.de

7:57 a.m.

give me that stuff ;)

xqt

----- Original Nachricht ---- Von: "Dr. Trigon" dr.trigon@surfeu.ch An: Pywikipedia discussion list pywikipedia-l@lists.wikimedia.org Datum: 10.10.2010 22:25 Betreff: Re: [Pywikipedia-l] Archivebot and header1

...

To be honest I do not really know the archive bot, as far as I know it uses various regex to determine the headings.

Since my bot does the same, but I had serveral serious issues by using regex I wrote a own more sophisticated method to retrieve pages headings. I don't know if this is of any help for you but if you are interessted in this, please have a look at:

https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik ipedia.py?r=HEAD

or

https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/ dtbext_wikipedia.py

and look for the 'getSections' method.

By the way this is something that should be comitted to the framework anyway... ;))

Greetings

Am 26.09.2010 19:32, schrieb Bináris:

...
A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page:

http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C 5%91fala

...
This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.

The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?

-- Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Dr. Trigon

12:26 p.m.

Feel free to take any part of code out of my bot (I would be glad!) at any time:

https://fisheye.toolserver.org/browse/drtrigon/

in this case the code is given in:

https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/...

from this module you need:

Page.getSections() Page._getSectionByteOffset() Page._findSection()

to copy into wikipedia.py (just ignore addAttributes()) and the small modification done to 'get' in:

Page.get()

has also to be added.

Since mediawiki software has from time to time (and page to page) some issues in getting all that data correctly, the function may return with an empty list [] (but only if mw has problems). This should happen very rarely.

Greetings (and enjoy)

Am 11.10.2010 14:57, schrieb info@gno.de:

...

give me that stuff ;)

xqt

----- Original Nachricht ---- Von: "Dr. Trigon"dr.trigon@surfeu.ch An: Pywikipedia discussion listpywikipedia-l@lists.wikimedia.org Datum: 10.10.2010 22:25 Betreff: Re: [Pywikipedia-l] Archivebot and header1

...
To be honest I do not really know the archive bot, as far as I know it uses various regex to determine the headings.

Since my bot does the same, but I had serveral serious issues by using regex I wrote a own more sophisticated method to retrieve pages headings. I don't know if this is of any help for you but if you are interessted in this, please have a look at:

https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik ipedia.py?r=HEAD

or

https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/ dtbext_wikipedia.py

and look for the 'getSections' method.

By the way this is something that should be comitted to the framework anyway... ;))

Greetings

Am 26.09.2010 19:32, schrieb Bináris:

...
A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page:

http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C 5%91fala

...
This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.

The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?

-- Bináris

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

5183

Age (days ago)

5198

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Bináris
Dr. Trigon
info＠gno.de