PCRE unicode mode performances? - Wikitech-l

2 Jul 2013


      Hello,
AbuseFilter does not match word boundaries in devanagari script which is
logged at  https://bugzilla.wikimedia.org/46773 (has some unit test
result attached).
The root cause is that the regex pattern are not in unicode mode ('u'
regexp flag) and thus \b is being dumb.
The fix would be to set the preg_match in AbuseFilter to unicode mode,
but I am worried about the performances implications.  I once wrote a
patch that used unicode properties and that made the parser
significantly slower.
Maybe the AbuseFilter code path is not that critical for performances :)
 Any thoughts?
-- 
Antoine "hashar" Musso