Hi !
I have installed the Extension:FileIndexer new variant (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant) from Ramon Dohle (raZe) on my version 1.12 and it works well for english text. When I upload a PDF file containing french accented characters such as e-acute ("é"), those are wrongly indexed and show on the file upload page.
I've looked inside the wiki database (table wikiprefix_searchindex, column si_text) and found that an e-acute is represented as the string "u8c3a9" for any standard page while it is represented by "u8efbfbd" for the uploaded PDF entry. Actually any accented character is represented by "u8efbfbd" ! Of course searching doesn't work with such caracter substitution.
"u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but it seems is it a kind of placer holder.
Any advice appreciated. -- francois.piette@overbyte.be Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be
as the notes on the page you link to says, pdftotext package doesnt have a "-" sign so this may be an issue similar to this.
regards
mark
On Tue, Feb 24, 2009 at 1:25 PM, Francois Piette < francois.piette@overbyte.be> wrote:
Hi !
I have installed the Extension:FileIndexer new variant (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant) from Ramon Dohle (raZe) on my version 1.12 and it works well for english text. When I upload a PDF file containing french accented characters such as e-acute ("é"), those are wrongly indexed and show on the file upload page.
I've looked inside the wiki database (table wikiprefix_searchindex, column si_text) and found that an e-acute is represented as the string "u8c3a9" for any standard page while it is represented by "u8efbfbd" for the uploaded PDF entry. Actually any accented character is represented by "u8efbfbd" ! Of course searching doesn't work with such caracter substitution.
"u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but it seems is it a kind of placer holder.
Any advice appreciated.
francois.piette@overbyte.be Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Mark,
Thanks for your prompt reply.
as the notes on the page you link to says, pdftotext package doesnt have a "-" sign so this may be an issue similar to this.
This is not related to the accented characters issue. fyi the "-" sign is a problem when a word is split on two lines with an hyphen. This doesn't affect accented words more than others.
pdftotext and iconv combined correctly convert PDF document containing french characters to UTF-8. The issue is elsewhere.
Any other idea ? -- francois.piette@overbyte.be Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be
----- Original Message ----- From: "Mark (Markie)" newsmarkie@googlemail.com To: "MediaWiki announcements and site admin list" mediawiki-l@lists.wikimedia.org Sent: Tuesday, February 24, 2009 2:35 PM Subject: Re: [Mediawiki-l] Extension:FileIndexer has issue with accentedcharacters
as the notes on the page you link to says, pdftotext package doesnt have a "-" sign so this may be an issue similar to this.
regards
mark
On Tue, Feb 24, 2009 at 1:25 PM, Francois Piette < francois.piette@overbyte.be> wrote:
Hi !
I have installed the Extension:FileIndexer new variant (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant) from Ramon Dohle (raZe) on my version 1.12 and it works well for english text. When I upload a PDF file containing french accented characters such as e-acute ("é"), those are wrongly indexed and show on the file upload page.
I've looked inside the wiki database (table wikiprefix_searchindex, column si_text) and found that an e-acute is represented as the string "u8c3a9" for any standard page while it is represented by "u8efbfbd" for the uploaded PDF entry. Actually any accented character is represented by "u8efbfbd" ! Of course searching doesn't work with such caracter substitution.
"u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but it seems is it a kind of placer holder.
Any advice appreciated.
francois.piette@overbyte.be Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be
MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
Hi there !
I'm still looking for a solution of the problem explained below. Any advice is really welcome.
-- francois.piette@overbyte.be The author of the freeware multi-tier middleware MidWare The author of the freeware Internet Component Suite (ICS) http://www.overbyte.be
----- Original Message ----- From: "Francois Piette" francois.piette@overbyte.be To: mediawiki-l@lists.wikimedia.org Sent: Tuesday, February 24, 2009 2:25 PM Subject: [Mediawiki-l] Extension:FileIndexer has issue with accentedcharacters
Hi !
I have installed the Extension:FileIndexer new variant (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant) from Ramon Dohle (raZe) on my version 1.12 and it works well for english text. When I upload a PDF file containing french accented characters such as e-acute ("é"), those are wrongly indexed and show on the file upload page.
I've looked inside the wiki database (table wikiprefix_searchindex, column si_text) and found that an e-acute is represented as the string "u8c3a9" for any standard page while it is represented by "u8efbfbd" for the uploaded PDF entry. Actually any accented character is represented by "u8efbfbd" ! Of course searching doesn't work with such caracter substitution.
"u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but it seems is it a kind of placer holder.
Any advice appreciated. -- francois.piette@overbyte.be Author of ICS (Internet Component Suite, freeware) Author of MidWare (Multi-tier framework, freeware) http://www.overbyte.be
_______________________________________________ MediaWiki-l mailing list MediaWiki-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
mediawiki-l@lists.wikimedia.org