I have been using " -exceptinsidetag:header" with replace.py. This was added by Daniel Herding in response to a request by me:
On Mon, Jun 30, 2008 at 23:11, Daniel Herding DHerding@gmx.de wrote:
This will exclude wikilinks and URLs. There are some more things that can be excluded, see the source code of the method replaceExcept() in wikipedia.py (look at the exceptionRegexes dictionary). I have just added a regular expression for section headers for you, so if you're running the SVN version, you can use this parameter:
-exceptinsidetag:header
I seem to recall this working in a nightly version a couple of years ago, but it's not working now - I'm not sure when it stopped. Is it possible to put it back in?
Thanks!
I just worked it out, mostly... instead of: -exceptinsidetag:header
I used: -exceptinside:'=[^\n\r]*=[ \t]*'
And it worked!
There might be a small risk of false positives, so I tried various tweaks, e.g. -exceptinside:'^=[^\n\r]*=[ \t]*$' -exceptinside:'[\n\r]=[^\n\r]*=[ \t]*[\n\r]' -exceptinside:'[\n\r]=[^\n\r]*='
But none worked... any suggestions?
On Thu, Dec 22, 2011 at 18:21, Chris Watkins chriswaterguy@appropedia.orgwrote:
I have been using " -exceptinsidetag:header" with replace.py. This was added by Daniel Herding in response to a request by me:
On Mon, Jun 30, 2008 at 23:11, Daniel Herding DHerding@gmx.de wrote:
This will exclude wikilinks and URLs. There are some more things that can be excluded, see the source code of the method replaceExcept() in wikipedia.py (look at the exceptionRegexes dictionary). I have just added a regular expression for section headers for you, so if you're running the SVN version, you can use this parameter:
-exceptinsidetag:header
I seem to recall this working in a nightly version a couple of years ago, but it's not working now - I'm not sure when it stopped. Is it possible to put it back in?
Thanks!
-- Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I do know that heading or section recognition inside the framework was mostly (e.g. archive bot) done by using regex... I myself felt always that it is not reliable since there are a lot of odd possible situations. Thus I wrote an 'getSections' method for DrTrigonBot but I am not aware if this could be of any use for you...
Anyway feel free to have a look at it and use it if you like... ;)
https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik...
Greetings
On 22.12.2011 09:18, Chris Watkins wrote:
I just worked it out, mostly... instead of: -exceptinsidetag:header
I used: -exceptinside:'=[^\n\r]*=[ \t]*'
And it worked!
There might be a small risk of false positives, so I tried various tweaks, e.g. -exceptinside:'^=[^\n\r]*=[ \t]*$' -exceptinside:'[\n\r]=[^\n\r]*=[ \t]*[\n\r]' -exceptinside:'[\n\r]=[^\n\r]*='
But none worked... any suggestions?
On Thu, Dec 22, 2011 at 18:21, Chris Watkins <chriswaterguy@appropedia.org mailto:chriswaterguy@appropedia.org> wrote:
I have been using " -exceptinsidetag:header" with replace.py. This was added by Daniel Herding in response to a request by me:
On Mon, Jun 30, 2008 at 23:11, Daniel Herding <DHerding@gmx.de mailto:DHerding@gmx.de> wrote:
This will exclude wikilinks and URLs. There are some more things that can be excluded, see the source code of the method replaceExcept() in wikipedia.py (look at the exceptionRegexes dictionary). I have just added a regular expression for section headers for you, so if you're running the SVN version, you can use this parameter:
-exceptinsidetag:header
I seem to recall this working in a nightly version a couple of years ago, but it's not working now - I'm not sure when it stopped. Is it possible to put it back in?
Thanks!
-- Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
-- Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
On Sat, Dec 24, 2011 at 20:41, Dr. Trigon dr.trigon@surfeu.ch wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I do know that heading or section recognition inside the framework was mostly (e.g. archive bot) done by using regex... I myself felt always that it is not reliable since there are a lot of odd possible situations.
That's true - the regex solution that I gave works sometimes, but sometimes it still matches inside headers. Don't know why - haven't debugged it yet.
Thus I wrote an 'getSections' method for DrTrigonBot but I am not aware if this could be of any use for you...
Anyway feel free to have a look at it and use it if you like... ;)
https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik...
Hmm... it's above me, as I don't speak Python. Not sure how to use it. :-(
Thanks anyway! Chris
Greetings
On 22.12.2011 09:18, Chris Watkins wrote:
I just worked it out, mostly... instead of: -exceptinsidetag:header
I used: -exceptinside:'=[^\n\r]*=[ \t]*'
And it worked!
There might be a small risk of false positives, so I tried various tweaks, e.g. -exceptinside:'^=[^\n\r]*=[ \t]*$' -exceptinside:'[\n\r]=[^\n\r]*=[ \t]*[\n\r]' -exceptinside:'[\n\r]=[^\n\r]*='
But none worked... any suggestions?
On Thu, Dec 22, 2011 at 18:21, Chris Watkins <chriswaterguy@appropedia.org mailto:chriswaterguy@appropedia.org> wrote:
I have been using " -exceptinsidetag:header" with replace.py. This was added by Daniel Herding in response to a request by me:
On Mon, Jun 30, 2008 at 23:11, Daniel Herding <DHerding@gmx.de mailto:DHerding@gmx.de> wrote:
This will exclude wikilinks and URLs. There are some more things that can be excluded, see the source code of the method replaceExcept() in wikipedia.py (look at the exceptionRegexes dictionary). I have just added a regular expression for section headers for you, so if you're running the SVN version, you can use this parameter:
-exceptinsidetag:header
I seem to recall this working in a nightly version a couple of years ago, but it's not working now - I'm not sure when it stopped. Is it possible to put it back in?
Thanks!
-- Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
-- Chris Watkins
Appropedia.org - Sharing knowledge to build rich, sustainable lives.
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk71ni0ACgkQAXWvBxzBrDBEKQCgwDB6gNylbEgXPxfld1M7sAhL 9XUAoIhYypqoyM3FzUCNSgJ7bT+6QLoj =yxc+ -----END PGP SIGNATURE-----
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l