[Mediawiki-l] Extension:FileIndexer has issue with accentedcharacters

Francois Piette francois.piette at overbyte.be
Tue Feb 24 13:58:21 UTC 2009


Mark,

Thanks for your prompt reply.

> as the notes on the page you link to says, pdftotext
> package doesnt have a "-" sign so this may be an
> issue similar to this.

This is not related to the accented characters issue. fyi the "-" sign is a
problem when a word is split on two lines with an hyphen. This doesn't
affect accented words more than others.

pdftotext and iconv combined correctly convert PDF document containing
french characters to UTF-8. The issue is elsewhere.

Any other idea ?
--
francois.piette at overbyte.be
Author of ICS (Internet Component Suite, freeware)
Author of MidWare (Multi-tier framework, freeware)
http://www.overbyte.be

----- Original Message ----- 
From: "Mark (Markie)" <newsmarkie at googlemail.com>
To: "MediaWiki announcements and site admin list"
<mediawiki-l at lists.wikimedia.org>
Sent: Tuesday, February 24, 2009 2:35 PM
Subject: Re: [Mediawiki-l] Extension:FileIndexer has issue with
accentedcharacters


as the notes on the page you link to says, pdftotext package doesnt have a
"-" sign so this may be an issue similar to this.

regards

mark


On Tue, Feb 24, 2009 at 1:25 PM, Francois Piette <
francois.piette at overbyte.be> wrote:

> Hi !
>
> I have installed the Extension:FileIndexer new variant
>  (http://www.mediawiki.org/wiki/Extension_talk:FileIndexer#New_Variant)
> from
> Ramon Dohle (raZe) on my version 1.12 and it works well for english text.
> When I upload a PDF file containing french accented characters such as
> e-acute ("é"), those are wrongly indexed and show on the file upload page.
>
> I've looked inside the wiki database (table wikiprefix_searchindex, column
> si_text) and found that an e-acute is represented as the string "u8c3a9"
> for
> any standard page while it is represented by "u8efbfbd" for the uploaded
> PDF
> entry. Actually any accented character is represented by "u8efbfbd" ! Of
> course searching doesn't work with such caracter substitution.
>
> "u8c3a9" is actually the code for UTF-8. I'm not sure about "u8efbfbd" but
> it seems is it a kind of placer holder.
>
> Any advice appreciated.
> --
> francois.piette at overbyte.be
> Author of ICS (Internet Component Suite, freeware)
> Author of MidWare (Multi-tier framework, freeware)
> http://www.overbyte.be
>
>
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l at lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.




More information about the MediaWiki-l mailing list