[Mediawiki-l] Extension:FileIndexer has issue with accented characters

Francois Piette francois.piette at overbyte.be
Tue Feb 24 13:25:18 UTC 2009


Hi !

I have installed the Extension:FileIndexer new variant
 (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant) from
Ramon Dohle (raZe) on my version 1.12 and it works well for english text.
When I upload a PDF file containing french accented characters such as
e-acute ("é"), those are wrongly indexed and show on the file upload page.

I've looked inside the wiki database (table wikiprefix_searchindex, column
si_text) and found that an e-acute is represented as the string "u8c3a9" for
any standard page while it is represented by "u8efbfbd" for the uploaded PDF
entry. Actually any accented character is represented by "u8efbfbd" ! Of
course searching doesn't work with such caracter substitution.

"u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but
it seems is it a kind of placer holder.

Any advice appreciated.
--
francois.piette at overbyte.be
Author of ICS (Internet Component Suite, freeware)
Author of MidWare (Multi-tier framework, freeware)
http://www.overbyte.be




More information about the MediaWiki-l mailing list