Hello,
AbuseFilter does not match word boundaries in devanagari script which is logged at https://bugzilla.wikimedia.org/46773 (has some unit test result attached).
The root cause is that the regex pattern are not in unicode mode ('u' regexp flag) and thus \b is being dumb.
The fix would be to set the preg_match in AbuseFilter to unicode mode, but I am worried about the performances implications. I once wrote a patch that used unicode properties and that made the parser significantly slower.
Maybe the AbuseFilter code path is not that critical for performances :) Any thoughts?
wikitech-l@lists.wikimedia.org