-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
If this is also an issue with section detection within pages you could (if you like) also consider to use the code given in 'getSections' [1]...
[1] https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik...
Greetings DrTrigon
Am 03.09.2011 13:58, schrieb xqt@svn.wikimedia.org:
http://www.mediawiki.org/wiki/Special:Code/pywikipedia/9494
Revision: 9494 Author: xqt Date: 2011-09-03 11:58:48 +0000 (Sat, 03 Sep 2011) Log Message: ----------- reverrevert r3147 due to bug #2989218; check for italic code in headings.TODO: use a better regex to find it.
Modified Paths: -------------- trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py
- --- trunk/pywikipedia/wikipedia.py 2011-09-03 11:17:47 UTC (rev 9493)
+++ trunk/pywikipedia/wikipedia.py 2011-09-03 11:58:48 UTC (rev 9494) @@ -66,7 +66,6 @@ within a non-wiki-markup section of text decodeEsperantoX: decode Esperanto text using the x convention. encodeEsperantoX: convert wikitext to the Esperanto x-encoding. - sectionencode: encode text for use as a section title in wiki-links. findmarker(text, startwith, append): return a string which is not part of text expandmarker(text, marker, separator): return marker string expanded @@ -654,7 +653,7 @@ self._contents = contents hn = self.section() if hn: - m = re.search("=+ *%s *=+" % hn, self._contents) + m = re.search("=+[ ']*%s[ ']*=+" % hn, self._contents) if verbose and not m: output(u"WARNING: Section does not exist: %s" % self.aslink(forceInterwiki = True)) # Store any exceptions for later reference @@ -779,8 +778,8 @@ else: raise IsRedirectPage(redirtarget) if self.section(): - # TODO: What the hell is this? Docu please. - m = re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*.3D" % re.escape(self.section()), sectionencode(pageInfo['revisions'][0]['*'],self.site().encoding()))
+ m = re.search("=+[ ']*%s[ ']*=+" % re.escape(self.section()),
pageInfo['revisions'][0]['*']) if not
m: try: self._getexception @@ -920,8 +919,8 @@ else: raise IsRedirectPage(redirtarget) if self.section(): - # TODO: What the hell is this? Docu please. - m = re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*.3D" % re.escape(self.section()), sectionencode(text,self.site().encoding())) + m = re.search("=+[ ']*%s[ ']*=+" % re.escape(self.section()), + text) if not m: try: self._getexception @@ -4140,8 +4139,7 @@ page2._startTime = time.strftime('%Y%m%d%H%M%S', time.gmtime()) if section: - m = re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*.3D"
- - % re.escape(section), sectionencode(text,page2.site().encoding()))
m = re.search("=+[ ']*%s[ ']*=+" %
re.escape(section), text) if not m: try: page2._getexception @@ -4302,7 +4300,7 @@ # Use the data loading time. page2._startTime = time.strftime('%Y%m%d%H%M%S', time.gmtime()) if section: - m = re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*.3D" % re.escape(section), sectionencode(text,page2.site().encoding()))
m = re.search("=+[ ']*%s[ ']*=+" %
re.escape(section), text) if not m: try: page2._getexception @@ -4531,10 +4529,6 @@ break return text
-def sectionencode(text, encoding): - """Encode text so that it can be used as a section title in wiki-links.""" - return urllib.quote(text.replace(" ","_").encode(encoding)).replace("%",".") - ######## Unicode library functions ########
def UnicodeToAsciiHtml(s):
_______________________________________________ Pywikipedia-svn mailing list Pywikipedia-svn@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-svn
I guess this needs a second api call? The rewrite branch uses it.
btw I found your isRedirect() method. Since this api call is not sure and gives false possitives I prefer checking the text itself. Sometimes the database contains invalid redirect flags.
Greetings xqt
----- Original Nachricht ---- Von: "Dr. Trigon" dr.trigon@surfeu.ch An: pywikipedia-l@lists.wikimedia.org Datum: 03.09.2011 22:25 Betreff: Re: [Pywikipedia-l] [Pywikipedia-svn] SVN: [9494] trunk/pywikipedia/wikipedia.py
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
If this is also an issue with section detection within pages you could (if you like) also consider to use the code given in 'getSections' [1]...
[1] https://fisheye.toolserver.org/browse/drtrigon/pywikipedia/dtbext/dtbext_wik ipedia.py?hb=true
Greetings DrTrigon
Am 03.09.2011 13:58, schrieb xqt@svn.wikimedia.org:
http://www.mediawiki.org/wiki/Special:Code/pywikipedia/9494
Revision: 9494 Author: xqt Date: 2011-09-03 11:58:48 +0000 (Sat, 03 Sep 2011) Log Message: ----------- reverrevert r3147 due to bug #2989218; check for italic code in headings.TODO: use a better regex to find it.
Modified Paths: -------------- trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py
- --- trunk/pywikipedia/wikipedia.py 2011-09-03 11:17:47 UTC (rev 9493)
+++ trunk/pywikipedia/wikipedia.py 2011-09-03 11:58:48 UTC (rev 9494) @@ -66,7 +66,6 @@ within a non-wiki-markup section of text decodeEsperantoX: decode Esperanto text using the x convention. encodeEsperantoX: convert wikitext to the Esperanto x-encoding. - sectionencode: encode text for use as a section title in wiki-links. findmarker(text, startwith, append): return a string which is not part of text expandmarker(text, marker, separator): return marker string expanded @@ -654,7 +653,7 @@ self._contents = contents hn = self.section() if hn: - m = re.search("=+ *%s *=+" % hn, self._contents) + m = re.search("=+[ ']*%s[ ']*=+" % hn, self._contents) if verbose and not m: output(u"WARNING: Section does not exist: %s" % self.aslink(forceInterwiki = True)) # Store any exceptions for later reference @@ -779,8 +778,8 @@ else: raise IsRedirectPage(redirtarget) if self.section(): - # TODO: What the hell is this? Docu please. - m =
re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*\ .3D"
% re.escape(self.section()), sectionencode(pageInfo['revisions'][0]['*'],self.site().encoding()))
m = re.search("=+[ ']*%s[ ']*=+" % re.escape(self.section()),
pageInfo['revisions'][0]['*']) if not
m: try: self._getexception @@ -920,8 +919,8 @@ else: raise IsRedirectPage(redirtarget) if self.section(): - # TODO: What the hell is this? Docu please. - m =
re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*\ .3D"
% re.escape(self.section()), sectionencode(text,self.site().encoding())) + m = re.search("=+[ ']*%s[ ']*=+" % re.escape(self.section()), + text) if not m: try: self._getexception @@ -4140,8 +4139,7 @@ page2._startTime = time.strftime('%Y%m%d%H%M%S', time.gmtime()) if section: - m =
re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*\ .3D"
% re.escape(section),
sectionencode(text,page2.site().encoding()))
m = re.search("=+[ ']*%s[ ']*=+" %
re.escape(section), text) if not m: try: page2._getexception @@ -4302,7 +4300,7 @@ # Use the data loading time. page2._startTime = time.strftime('%Y%m%d%H%M%S', time.gmtime()) if section: - m =
re.search(".3D_*(.27.27+)?(.5B.5B)?_*%s_*(.5B.5B)?(.27.27+)?_*\ .3D"
% re.escape(section), sectionencode(text,page2.site().encoding()))
m = re.search("=+[ ']*%s[ ']*=+" %
re.escape(section), text) if not m: try: page2._getexception @@ -4531,10 +4529,6 @@ break return text
-def sectionencode(text, encoding): - """Encode text so that it can be used as a section title in wiki-links.""" - return urllib.quote(text.replace(" ","_").encode(encoding)).replace("%",".") - ######## Unicode library functions ########
def UnicodeToAsciiHtml(s):
_______________________________________________ Pywikipedia-svn mailing list Pywikipedia-svn@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-svn
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk5ijR8ACgkQAXWvBxzBrDBNNQCgve2/z/SUa3bUNd625ibUKG/G sEMAn2/LtRfr9kvdV1UX+aVKL9MQZwl8 =9anJ -----END PGP SIGNATURE-----
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
I guess this needs a second api call? The rewrite branch uses it.
Yes - but not only! It does a lot more since it matches the wiki text (and the headings there) with the 'anchors' (and others) given from api call. Additionally it determines the byteoffset within the wiki text, instead of the byteoffset given by 'parse' with is given relative to the parsed text (which is quite useless in my opinion ;) But since the wiki text can also contain wrong syntax or some notorious examples, it is possible for the function not to resolve. (In fact this is true for a lot of functions within framework, e.g. if the network connection is down or else...)
btw I found your isRedirect() method. Since this api call is not sure and gives false possitives I prefer checking the text itself. Sometimes the database contains invalid redirect flags.
Thanks for this hint - never had problems with this, but - I have to consider that!
Greetings
pywikipedia-l@lists.wikimedia.org