David and I had a discussion about moving ascii-folding to come before stemming on English Wikipedia. It seemed like a good idea, but we decided we should run some tests before implementing it, just to be sure.
Turns out it is a good idea!
Much more detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_...
We won't deploy it until we deploy BM25 later in the year, since it requires a full re-index of English Wikipedia, as does BM25. That's something we should only do once.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
Thanks Trey!
this will certainly greatly improve the intitle keyword as it uses the field with stems for filtering and hopefully will find pages that were ignored because of this filter ordering (e.g. intitle:louys can't find User:Louÿs currently).
I think I'll do the same for French which suffers from the same problem. IMO we should continue to work on this for other languages while we try to switch from asciifolding (latin letters only) to icu folding.
We may require some guidance on some languages where diacritics removal can be counter productive and maybe blacklist some letters (e.g. for finnish: is it appropriate to fold Ä or Ö for example?)
Note on accent folding: cirrus tries to always prefer exact matches. Searching for élément should always prefer élément over element. Users that prefer exact matches can always force cirrus to discard stems by wrapping the word in double quotes, e.g. "élément".
Le 10/08/2016 à 16:07, Trey Jones a écrit :
David and I had a discussion about moving ascii-folding to come before stemming on English Wikipedia. It seemed like a good idea, but we decided we should run some tests before implementing it, just to be sure.
Turns out it is a good idea!
Much more detail: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_... https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia
We won't deploy it until we deploy BM25 later in the year, since it requires a full re-index of English Wikipedia, as does BM25. That's something we should only do once.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
I'm less sure that re-ordering would do the right thing for French. Presumably the French stemmer knows about accented characters and uses them. We should test and make sure. Maybe we need custom folding only for "unusual" accents in any given language (they are all unusual for English).
We can test French the same way we tested English, though, and be sure.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 10, 2016 at 10:48 AM, David Causse dcausse@wikimedia.org wrote:
Thanks Trey!
this will certainly greatly improve the intitle keyword as it uses the field with stems for filtering and hopefully will find pages that were ignored because of this filter ordering (e.g. intitle:louys can't find User:Louÿs currently).
I think I'll do the same for French which suffers from the same problem. IMO we should continue to work on this for other languages while we try to switch from asciifolding (latin letters only) to icu folding.
We may require some guidance on some languages where diacritics removal can be counter productive and maybe blacklist some letters (e.g. for finnish: is it appropriate to fold Ä or Ö for example?)
Note on accent folding: cirrus tries to always prefer exact matches. Searching for élément should always prefer élément over element. Users that prefer exact matches can always force cirrus to discard stems by wrapping the word in double quotes, e.g. "élément".
Le 10/08/2016 à 16:07, Trey Jones a écrit :
David and I had a discussion about moving ascii-folding to come before stemming on English Wikipedia. It seemed like a good idea, but we decided we should run some tests before implementing it, just to be sure.
Turns out it is a good idea!
Much more detail: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/ Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia
We won't deploy it until we deploy BM25 later in the year, since it requires a full re-index of English Wikipedia, as does BM25. That's something we should only do once.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
discovery mailing listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
The problem with french is slightly different but leads to somewhat the same problems: there are no ascii folding configured currently.
This leads to the same intitle problems where it can't find words with diacritics when diacritics are omitted in the query.
For french putting ascii folding before the stemmer is certainly a bad idea and we should imo do ascii folding after the stemmer (possibly using preserve original).
Le 10/08/2016 à 16:52, Trey Jones a écrit :
I'm less sure that re-ordering would do the right thing for French. Presumably the French stemmer knows about accented characters and uses them. We should test and make sure. Maybe we need custom folding only for "unusual" accents in any given language (they are all unusual for English).
We can test French the same way we tested English, though, and be sure.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 10, 2016 at 10:48 AM, David Causse <dcausse@wikimedia.org mailto:dcausse@wikimedia.org> wrote:
Thanks Trey! this will certainly greatly improve the intitle keyword as it uses the field with stems for filtering and hopefully will find pages that were ignored because of this filter ordering (e.g. intitle:louys can't find User:Louÿs currently). I think I'll do the same for French which suffers from the same problem. IMO we should continue to work on this for other languages while we try to switch from asciifolding (latin letters only) to icu folding. We may require some guidance on some languages where diacritics removal can be counter productive and maybe blacklist some letters (e.g. for finnish: is it appropriate to fold Ä or Ö for example?) Note on accent folding: cirrus tries to always prefer exact matches. Searching for élément should always prefer élément over element. Users that prefer exact matches can always force cirrus to discard stems by wrapping the word in double quotes, e.g. "élément". Le 10/08/2016 à 16:07, Trey Jones a écrit :
David and I had a discussion about moving ascii-folding to come before stemming on English Wikipedia. It seemed like a good idea, but we decided we should run some tests before implementing it, just to be sure. Turns out it is a good idea! Much more detail: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia <https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia> We won't deploy it until we deploy BM25 later in the year, since it requires a full re-index of English Wikipedia, as does BM25. That's something we should only do once. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/discovery <https://lists.wikimedia.org/mailman/listinfo/discovery>
_______________________________________________ discovery mailing list discovery@lists.wikimedia.org <mailto:discovery@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/discovery <https://lists.wikimedia.org/mailman/listinfo/discovery>
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Thanks, all....I've moved both tickets to the sprint board and hopefully we can get a lot of answers going forward for this ticket https://phabricator.wikimedia.org/T141216 from the test https://phabricator.wikimedia.org/T142620 on it.
Please let me know if I've missed anything! :)
Cheers,
Deb
-- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation
On Wed, Aug 10, 2016 at 9:04 AM, David Causse dcausse@wikimedia.org wrote:
The problem with french is slightly different but leads to somewhat the same problems: there are no ascii folding configured currently.
This leads to the same intitle problems where it can't find words with diacritics when diacritics are omitted in the query.
For french putting ascii folding before the stemmer is certainly a bad idea and we should imo do ascii folding after the stemmer (possibly using preserve original).
Le 10/08/2016 à 16:52, Trey Jones a écrit :
I'm less sure that re-ordering would do the right thing for French. Presumably the French stemmer knows about accented characters and uses them. We should test and make sure. Maybe we need custom folding only for "unusual" accents in any given language (they are all unusual for English).
We can test French the same way we tested English, though, and be sure.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Wed, Aug 10, 2016 at 10:48 AM, David Causse dcausse@wikimedia.org wrote:
Thanks Trey!
this will certainly greatly improve the intitle keyword as it uses the field with stems for filtering and hopefully will find pages that were ignored because of this filter ordering (e.g. intitle:louys can't find User:Louÿs currently).
I think I'll do the same for French which suffers from the same problem. IMO we should continue to work on this for other languages while we try to switch from asciifolding (latin letters only) to icu folding.
We may require some guidance on some languages where diacritics removal can be counter productive and maybe blacklist some letters (e.g. for finnish: is it appropriate to fold Ä or Ö for example?)
Note on accent folding: cirrus tries to always prefer exact matches. Searching for élément should always prefer élément over element. Users that prefer exact matches can always force cirrus to discard stems by wrapping the word in double quotes, e.g. "élément".
Le 10/08/2016 à 16:07, Trey Jones a écrit :
David and I had a discussion about moving ascii-folding to come before stemming on English Wikipedia. It seemed like a good idea, but we decided we should run some tests before implementing it, just to be sure.
Turns out it is a good idea!
Much more detail: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re- Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia
We won't deploy it until we deploy BM25 later in the year, since it requires a full re-index of English Wikipedia, as does BM25. That's something we should only do once.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
discovery mailing listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________ discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/ma ilman/listinfo/discovery
discovery mailing listdiscovery@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery