Hi,
\b in a regex treats letter "é" (which is a correct Hungarian letter) as a word boundary. Can I prevent this behaviour with some kind of settings?
On Jul 28, 2013 4:13 AM, "Bináris" wikiposta@gmail.com wrote:
\b in a regex treats letter "é" (which is a correct Hungarian letter) as
a word boundary.
Can I prevent this behaviour with some kind of settings?
I don't think I've ever had to worry about that issue myself so I don't know the solution offhand. these links should help. (just googled "python regex locale")
http://docs.python.org/2/library/re.html#re.LOCALE
http://docs.python.org/2/howto/regex.html#compilation-flags
http://stackoverflow.com/questions/12240260/umlauts-in-regexp-matching-via-l...
-Jeremy
On 28 July 2013 10:12, Bináris wikiposta@gmail.com wrote:
Hi,
\b in a regex treats letter "é" (which is a correct Hungarian letter) as a word boundary. Can I prevent this behaviour with some kind of settings?
Simple ascii:
re.findall(r".+?\b", "bla bla bla")
['bla', ' ', 'bla', ' ', 'bla']
Incorrect: - no re.UNICODE flag, bytestring
re.findall(r".+?\b", "bléa bléa bléa")
['bl', '\xc3\xa9', 'a', ' ', 'bl', '\xc3\xa9', 'a', ' ', 'bl', '\xc3\xa9', 'a']
- no re.UNICODE flag, unicode string
re.findall(r".+?\b", u"bléa bléa bléa")
[u'bl', u'\xe9', u'a', u' ', u'bl', u'\xe9', u'a', u' ', u'bl', u'\xe9', u'a']
- re.UNICODE flag, bytestring
re.findall(r".+?\b", "bléa bléa bléa", re.UNICODE)
['bl\xc3', '\xa9', 'a', ' ', 'bl\xc3', '\xa9', 'a', ' ', 'bl\xc3', '\xa9', 'a']
CorrecT: - both re.UNICODE and using a unicode string
re.findall(r".+?\b", u"bléa bléa bléa", re.UNICODE)
[u'bl\xe9a', u' ', u'bl\xe9a', u' ', u'bl\xe9a']
Hope this helps!
Merlijn
pywikipedia-l@lists.wikimedia.org