Regex question - Pywikipedia-l - lists.wikimedia.org

List overview All Threads
Download

Regex question

xmlreader.py

Extending replace.py with an...

Bináris

28 Jul 2013 28 Jul '13

4:12 p.m.

Hi,

\b in a regex treats letter "é" (which is a correct Hungarian letter) as a word boundary. Can I prevent this behaviour with some kind of settings?

-- Bináris

Attachments:

attachment.htm (text/html — 205 bytes)

Reply

Show replies by date

Jeremy Baron

28 Jul 28 Jul

4:38 p.m.

On Jul 28, 2013 4:13 AM, "Bináris" wikiposta@gmail.com wrote:

\b in a regex treats letter "é" (which is a correct Hungarian letter) as

a word boundary.

Can I prevent this behaviour with some kind of settings?

I don't think I've ever had to worry about that issue myself so I don't know the solution offhand. these links should help. (just googled "python regex locale")

http://docs.python.org/2/library/re.html#re.LOCALE

http://docs.python.org/2/howto/regex.html#compilation-flags

http://stackoverflow.com/questions/12240260/umlauts-in-regexp-matching-via-l...

-Jeremy

Reply

Merlijn van Deen

10 Aug 10 Aug

9:22 p.m.

On 28 July 2013 10:12, Bináris wikiposta@gmail.com wrote:

Hi,

\b in a regex treats letter "é" (which is a correct Hungarian letter) as a word boundary. Can I prevent this behaviour with some kind of settings?

Simple ascii:

...
...
re.findall(r".+?\b", "bla bla bla")

['bla', ' ', 'bla', ' ', 'bla']

Incorrect: - no re.UNICODE flag, bytestring

...
...
re.findall(r".+?\b", "bléa bléa bléa")

['bl', '\xc3\xa9', 'a', ' ', 'bl', '\xc3\xa9', 'a', ' ', 'bl', '\xc3\xa9', 'a']

- no re.UNICODE flag, unicode string

...
...
re.findall(r".+?\b", u"bléa bléa bléa")

[u'bl', u'\xe9', u'a', u' ', u'bl', u'\xe9', u'a', u' ', u'bl', u'\xe9', u'a']

- re.UNICODE flag, bytestring

...
...
re.findall(r".+?\b", "bléa bléa bléa", re.UNICODE)

['bl\xc3', '\xa9', 'a', ' ', 'bl\xc3', '\xa9', 'a', ' ', 'bl\xc3', '\xa9', 'a']

CorrecT: - both re.UNICODE and using a unicode string

...
...
re.findall(r".+?\b", u"bléa bléa bléa", re.UNICODE)

[u'bl\xe9a', u' ', u'bl\xe9a', u' ', u'bl\xe9a']

Hope this helps!

Merlijn

Reply

4152

Age (days ago)

4165

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

2 comments

3 participants

tags (0)

participants (3)

Bináris
Jeremy Baron
Merlijn van Deen