A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page: http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C... This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.
The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?
To be honest I do not really know the archive bot, as far as I know it uses various regex to determine the headings.
Since my bot does the same, but I had serveral serious issues by using regex I wrote a own more sophisticated method to retrieve pages headings. I don't know if this is of any help for you but if you are interessted in this, please have a look at:
https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik...
or
https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/...
and look for the 'getSections' method.
By the way this is something that should be comitted to the framework anyway... ;))
Greetings
Am 26.09.2010 19:32, schrieb Bináris:
A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page: http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C... This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.
The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Thank you, I will forward it (though it seems the bot owner only runs it, and nobody maintains).
give me that stuff ;)
xqt
----- Original Nachricht ---- Von: "Dr. Trigon" dr.trigon@surfeu.ch An: Pywikipedia discussion list pywikipedia-l@lists.wikimedia.org Datum: 10.10.2010 22:25 Betreff: Re: [Pywikipedia-l] Archivebot and header1
To be honest I do not really know the archive bot, as far as I know it uses various regex to determine the headings.
Since my bot does the same, but I had serveral serious issues by using regex I wrote a own more sophisticated method to retrieve pages headings. I don't know if this is of any help for you but if you are interessted in this, please have a look at:
https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik ipedia.py?r=HEAD
or
https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/ dtbext_wikipedia.py
and look for the 'getSections' method.
By the way this is something that should be comitted to the framework anyway... ;))
Greetings
Am 26.09.2010 19:32, schrieb Bináris:
A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page:
http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C 5%91fala
This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.
The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Feel free to take any part of code out of my bot (I would be glad!) at any time:
https://fisheye.toolserver.org/browse/drtrigon/
in this case the code is given in:
https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/...
from this module you need:
Page.getSections() Page._getSectionByteOffset() Page._findSection()
to copy into wikipedia.py (just ignore addAttributes()) and the small modification done to 'get' in:
Page.get()
has also to be added.
Since mediawiki software has from time to time (and page to page) some issues in getting all that data correctly, the function may return with an empty list [] (but only if mw has problems). This should happen very rarely.
Greetings (and enjoy)
Am 11.10.2010 14:57, schrieb info@gno.de:
give me that stuff ;)
xqt
----- Original Nachricht ---- Von: "Dr. Trigon"dr.trigon@surfeu.ch An: Pywikipedia discussion listpywikipedia-l@lists.wikimedia.org Datum: 10.10.2010 22:25 Betreff: Re: [Pywikipedia-l] Archivebot and header1
To be honest I do not really know the archive bot, as far as I know it uses various regex to determine the headings.
Since my bot does the same, but I had serveral serious issues by using regex I wrote a own more sophisticated method to retrieve pages headings. I don't know if this is of any help for you but if you are interessted in this, please have a look at:
https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik ipedia.py?r=HEAD
or
https://fisheye.toolserver.org/browse/~raw,r=44/drtrigon/pywikipedia/dtbext/ dtbext_wikipedia.py
and look for the 'getSections' method.
By the way this is something that should be comitted to the framework anyway... ;))
Greetings
Am 26.09.2010 19:32, schrieb Bináris:
A user in huwiki regularly runs this script to archive a lot of talk pages and community pages: http://hu.wikipedia.org/wiki/Szerkeszt%C5%91:Cherybot/archivebot_hu.py This is some modified version of archivebot.py. We have a community page:
http://hu.wikipedia.org/wiki/Wikip%C3%A9dia:B%C3%BCrokrat%C3%A1k_%C3%BCzen%C 5%91fala
This has 5 first level headers (=title=). This is unusual. When the bot arhives a section above the =title=, the =title= line goes to the archive, too. Now, I was asked to help to correct this behavior. I am not familiar with the whole thing, I have never run archivebot.py.
The question is: was there any problem like this in another wiki, is there a bugfix for this in the fresh version, or is it only our problem?
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
pywikipedia-l@lists.wikimedia.org