Hi,
http://en-suggesty.wmflabs.org/suggest.html is updated with a score that integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions. - x will suggest xxx - po will suggest pornhub just below poland in 2nd position. And is ranked #6 for the query 'p'
I just wanted to let you know about this and would like to know if it's something we should address.
Thanks for your feedback.
David.
Yeah, there are some weird (but apparently popular) suggestions there.
coc -> cocaine first, and then coca cola peg -> pegging (sexual practice) is third, higher than anyone named peggy ob -> Osama bin Laden first, and Obama nowhere on the list oba -> gets Obama to 3rd place, but even searching for obama still has Osama bin Laden as the first result mur -> Murray-Darling steamboat people (what the heck???)
One other oddity: Searching for a digit works great for 1 through 9. But searching for 0 doesn't bring up ANY results, either for the suggester, or for prefix search. Doing an actual search on enwiki for 0 brings up a disambig page with lots of reasonable candidate results.
Kevin Smith Agile Coach, Wikimedia Foundation
On Fri, Jan 22, 2016 at 1:53 PM, David Causse dcausse@wikimedia.org wrote:
Hi,
http://en-suggesty.wmflabs.org/suggest.html is updated with a score that integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions.
- x will suggest xxx
- po will suggest pornhub just below poland in 2nd position. And is ranked
#6 for the query 'p'
I just wanted to let you know about this and would like to know if it's something we should address.
Thanks for your feedback.
David.
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
Yeah, there are some weird (but apparently popular) suggestions there.
It's an interesting question whether we should or can do anything about it. On one hand, the potential for hilarity is obvious. On the other hand, if that's what people look for... On the third hand, obviously different people look for different things, so here one-size-fits-all may lead to weird results. Also, visits and searches are not exactly the same... Maybe we can get search stats or search click stats or referer stats or something like that instead? Another option proposed would be to assign negative category boosts, but then we would have to manually curate the "bad" categories.
One other oddity: Searching for a digit works great for 1 through 9. But searching for 0 doesn't bring up ANY results, either for the suggester, or for prefix search. Doing an actual search on enwiki for 0 brings up a disambig page with lots of reasonable candidate results.
That may be a bug. I wonder if we don't have if ($search) somewhere that leads to it - since "0" is falsy in PHP, it may make it look like "". $a != "" would work though.
Hey David,
Thanks for starting this discussion!
On 22 January 2016 at 13:53, David Causse dcausse@wikimedia.org wrote:
http://en-suggesty.wmflabs.org/suggest.html is updated with a score that integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions.
- x will suggest xxx
- po will suggest pornhub just below poland in 2nd position. And is ranked
#6 for the query 'p'
As of right now, neither of these queries do this any more. "x" now suggests "Xinjiang" as the top result, and "po" now suggests "Pope Francis" after "Poland"... which may or may not be more palatable than Pornhub, depending on your viewpoints and ideals! Generally, Wikipedians like to point out that Wikipedia is not censored https://en.wikipedia.org/wiki/Wikipedia:What_Wikipedia_is_not#Wikipedia_is_not_censored. That said, it's still worth considering whether this is appropriate or not. I personally don't have much of a problem with the fact that certain search results might be a little offensive... but I do think that they're probably also not really that useful.
Given how volatile this has made our search results, my sense is that we're giving too much weight to how much we're letting page view data affect the ranking. Is it as simple as tweaking a coefficient so that page views are still taken into consideration but with lower weight, or do we need to do something more involved? I created T124722 https://phabricator.wikimedia.org/T124722 to track this work, and added it our list of blockers for a wider rollout of the suggester https://phabricator.wikimedia.org/T121616.
Thanks!
Dan
For the purpose of this exercise I think that it is completely reasonable for staff/developers to play with the factors and make sure that we are not having offence caused through this development. We want the focus to be on the tool, and what it can do; not start a bunfight and detract from the goal.
For full production, I do NOT think that it is reasonable that either staff or developers make the determination of what is or what is not offensive, and whether a term should or should not be displayed. That determination sits clearly with the community, and is part of a discussion when the tool approaches full production and given to the community. It is part of what the community can or will need to do.
All that said, page views as a raw number should not be the determinator of a suggestion. I will add fuller comment to the phabricator ticket.
Regards, Billinghurst
On Tue, Jan 26, 2016 at 9:37 AM, Dan Garry dgarry@wikimedia.org wrote:
Hey David,
Thanks for starting this discussion!
On 22 January 2016 at 13:53, David Causse dcausse@wikimedia.org wrote:
http://en-suggesty.wmflabs.org/suggest.html is updated with a score that integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions.
- x will suggest xxx
- po will suggest pornhub just below poland in 2nd position. And is ranked
#6 for the query 'p'
As of right now, neither of these queries do this any more. "x" now suggests "Xinjiang" as the top result, and "po" now suggests "Pope Francis" after "Poland"... which may or may not be more palatable than Pornhub, depending on your viewpoints and ideals! Generally, Wikipedians like to point out that Wikipedia is not censored. That said, it's still worth considering whether this is appropriate or not. I personally don't have much of a problem with the fact that certain search results might be a little offensive... but I do think that they're probably also not really that useful.
Given how volatile this has made our search results, my sense is that we're giving too much weight to how much we're letting page view data affect the ranking. Is it as simple as tweaking a coefficient so that page views are still taken into consideration but with lower weight, or do we need to do something more involved? I created T124722 to track this work, and added it our list of blockers for a wider rollout of the suggester.
Thanks!
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Mon, Jan 25, 2016 at 11:16 PM, billinghurst billinghurstwiki@gmail.com wrote:
For the purpose of this exercise I think that it is completely reasonable for staff/developers to play with the factors and make sure that we are not having offence caused through this development. We want the focus to be on the tool, and what it can do; not start a bunfight and detract from the goal.
For full production, I do NOT think that it is reasonable that either staff or developers make the determination of what is or what is not offensive, and whether a term should or should not be displayed. That determination sits clearly with the community, and is part of a discussion when the tool approaches full production and given to the community. It is part of what the community can or will need to do.
All that said, page views as a raw number should not be the determinator of a suggestion. I will add fuller comment to the phabricator ticket.
They arn't, and i hope noone was led to believe this was ever the intention. Page views is a factor. Currently the number of incoming wikilinks, outgoing wikilinks, external links, redirects, headings and the size of the article all have different weights. Page views is being added as another factor, the current WIP patch uses page views as ~23% of the final score (if my math is right).
Regards, Billinghurst
On Tue, Jan 26, 2016 at 9:37 AM, Dan Garry dgarry@wikimedia.org wrote:
Hey David,
Thanks for starting this discussion!
On 22 January 2016 at 13:53, David Causse dcausse@wikimedia.org wrote:
http://en-suggesty.wmflabs.org/suggest.html is updated with a score
that
integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions.
- x will suggest xxx
- po will suggest pornhub just below poland in 2nd position. And is
ranked
#6 for the query 'p'
As of right now, neither of these queries do this any more. "x" now
suggests
"Xinjiang" as the top result, and "po" now suggests "Pope Francis" after "Poland"... which may or may not be more palatable than Pornhub,
depending
on your viewpoints and ideals! Generally, Wikipedians like to point out
that
Wikipedia is not censored. That said, it's still worth considering
whether
this is appropriate or not. I personally don't have much of a problem
with
the fact that certain search results might be a little offensive... but
I do
think that they're probably also not really that useful.
Given how volatile this has made our search results, my sense is that
we're
giving too much weight to how much we're letting page view data affect
the
ranking. Is it as simple as tweaking a coefficient so that page views are still taken into consideration but with lower weight, or do we need to do something more involved? I created T124722 to track this work, and added
it
our list of blockers for a wider rollout of the suggester.
Thanks!
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hello, happy to join the discussion.
I also think that a search by phonetic is a really good improvement, currently many times you search on google and then copy paste.
I am also experimenting with elastic search, and thanks to this thread I discovered wikipedia is also using it with CirrusSearch; could search by applied only to *links names* (no text) of currently not phonetically supported languages, and then map results on ES?
e.g. for chinese https://pypi.python.org/pypi/dragonmapper
Maybe also ES has their own support ?
On Tue, Jan 26, 2016 at 8:30 AM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
On Mon, Jan 25, 2016 at 11:16 PM, billinghurst <billinghurstwiki@gmail.com
wrote:
For the purpose of this exercise I think that it is completely reasonable for staff/developers to play with the factors and make sure that we are not having offence caused through this development. We want the focus to be on the tool, and what it can do; not start a bunfight and detract from the goal.
For full production, I do NOT think that it is reasonable that either staff or developers make the determination of what is or what is not offensive, and whether a term should or should not be displayed. That determination sits clearly with the community, and is part of a discussion when the tool approaches full production and given to the community. It is part of what the community can or will need to do.
All that said, page views as a raw number should not be the determinator of a suggestion. I will add fuller comment to the phabricator ticket.
They arn't, and i hope noone was led to believe this was ever the intention. Page views is a factor. Currently the number of incoming wikilinks, outgoing wikilinks, external links, redirects, headings and the size of the article all have different weights. Page views is being added as another factor, the current WIP patch uses page views as ~23% of the final score (if my math is right).
Regards, Billinghurst
On Tue, Jan 26, 2016 at 9:37 AM, Dan Garry dgarry@wikimedia.org wrote:
Hey David,
Thanks for starting this discussion!
On 22 January 2016 at 13:53, David Causse dcausse@wikimedia.org
wrote:
http://en-suggesty.wmflabs.org/suggest.html is updated with a score
that
integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions.
- x will suggest xxx
- po will suggest pornhub just below poland in 2nd position. And is
ranked
#6 for the query 'p'
As of right now, neither of these queries do this any more. "x" now
suggests
"Xinjiang" as the top result, and "po" now suggests "Pope Francis" after "Poland"... which may or may not be more palatable than Pornhub,
depending
on your viewpoints and ideals! Generally, Wikipedians like to point out
that
Wikipedia is not censored. That said, it's still worth considering
whether
this is appropriate or not. I personally don't have much of a problem
with
the fact that certain search results might be a little offensive... but
I do
think that they're probably also not really that useful.
Given how volatile this has made our search results, my sense is that
we're
giving too much weight to how much we're letting page view data affect
the
ranking. Is it as simple as tweaking a coefficient so that page views
are
still taken into consideration but with lower weight, or do we need to
do
something more involved? I created T124722 to track this work, and
added it
our list of blockers for a wider rollout of the suggester.
Thanks!
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Maybe it was there as nuance, however, I was trying to say that ***raw*** pageview numbers themself should not be the factor (whatever % of the total that you apply), though some calculation based on pageview with other factors, eg. an order of magnitude of the pageview so all that range of pages has a smoothing factor.
If you are saying that the pageview is approximately a quarter, that seems to be a very large number based on two letters typed "po..." has many combinations and that pornhub comes up early due to pageview factor is ... ummm... thought provoking. I would think that 1/4 of searches for "po..." are not for pornhub, though I am not aware that such data is available.
Regards, Billinghurst
On Tue, Jan 26, 2016 at 6:30 PM, Erik Bernhardson ebernhardson@wikimedia.org wrote:
On Mon, Jan 25, 2016 at 11:16 PM, billinghurst billinghurstwiki@gmail.com wrote:
For the purpose of this exercise I think that it is completely reasonable for staff/developers to play with the factors and make sure that we are not having offence caused through this development. We want the focus to be on the tool, and what it can do; not start a bunfight and detract from the goal.
For full production, I do NOT think that it is reasonable that either staff or developers make the determination of what is or what is not offensive, and whether a term should or should not be displayed. That determination sits clearly with the community, and is part of a discussion when the tool approaches full production and given to the community. It is part of what the community can or will need to do.
All that said, page views as a raw number should not be the determinator of a suggestion. I will add fuller comment to the phabricator ticket.
They arn't, and i hope noone was led to believe this was ever the intention. Page views is a factor. Currently the number of incoming wikilinks, outgoing wikilinks, external links, redirects, headings and the size of the article all have different weights. Page views is being added as another factor, the current WIP patch uses page views as ~23% of the final score (if my math is right).
Regards, Billinghurst
On Tue, Jan 26, 2016 at 9:37 AM, Dan Garry dgarry@wikimedia.org wrote:
Hey David,
Thanks for starting this discussion!
On 22 January 2016 at 13:53, David Causse dcausse@wikimedia.org wrote:
http://en-suggesty.wmflabs.org/suggest.html is updated with a score that integrates pageviews.
Pageviews solve most of the problems we encountered in the previous formula unfortunately we now see some porn related suggestions.
- x will suggest xxx
- po will suggest pornhub just below poland in 2nd position. And is
ranked #6 for the query 'p'
As of right now, neither of these queries do this any more. "x" now suggests "Xinjiang" as the top result, and "po" now suggests "Pope Francis" after "Poland"... which may or may not be more palatable than Pornhub, depending on your viewpoints and ideals! Generally, Wikipedians like to point out that Wikipedia is not censored. That said, it's still worth considering whether this is appropriate or not. I personally don't have much of a problem with the fact that certain search results might be a little offensive... but I do think that they're probably also not really that useful.
Given how volatile this has made our search results, my sense is that we're giving too much weight to how much we're letting page view data affect the ranking. Is it as simple as tweaking a coefficient so that page views are still taken into consideration but with lower weight, or do we need to do something more involved? I created T124722 to track this work, and added it our list of blockers for a wider rollout of the suggester.
Thanks!
Dan
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Le 26/01/2016 11:20, billinghurst a écrit :
I would think that 1/4 of searches for "po..." are not for pornhub, though I am not aware that such data is available.
Yes it's the main problem we have today, score is computed from document metadata (size, templates, headings, incoming_links... and now pageviews). Search usage is not part of the score: we suggest pages not search queries.
Another problem I have today is that I don't have any good method to evaluate the quality of the formula. I've added a small page on wikitech that describes the formula[1]. It's the R script I use to briefly evaluate the score distribution before testing on en-suggesty. Note that this page is not necessarily updated with the latest params, gerrit[2] may contain up-to-date params with what you can see on en-suggesty. Another data I failed to use is term statistics from the prefixsearch index[2], it helps to see the level of ambiguity of a prefix according to its length.
Any suggestions to improve the method and/or the formula are very welcome.
Thanks!
[1] https://wikitech.wikimedia.org/wiki/User:DCausse/Completion_Suggester_And_Pa... [2] https://gerrit.wikimedia.org/r/#/c/265771/ [3] https://wikitech.wikimedia.org/wiki/User:DCausse/Term_Stats_With_Cirrus_Dump