New subject: [Analytics] An example of pageviews usage

16 Nov 2015

This is fascinating, we started to experiment with the completion 
suggester few months ago.

The first goal was to increase recall and use the ability to activate 
fuzzy lookups to handle small typos.
It became clear that scoring was a critical part of this feature.
Some prefixes are already very ambiguous (mar, cha, list ...) and 
enabling fuzziness does not help here.
We tried to implement a score based on the data currently available 
(size, incoming_links, templates...) but this score is kind of "bigger 
is better".
This is why we were interested in pageviews to add "popularity" in the 
score. Thanks for sharing this tool it is very helpful to have a quick 
look at how it would look like.

I still don't know if pageviews can be the only score component or if we 
should compose with other factors like "quality", "authority".
My concerns with pageviews are :
- we certainly have outliers (caused by 3rd party tools/bots ...)
- what's the coverage of pageviews: i.e. in one month how many pages get 
0 pageviews?

Quality: we have a set of templates that are already used to flag 
good/featured articles. Cirrus uses it on enwiki only, I'd really like 
to extend this to other wikis. I'm also very interested in the tool 
behind http://ores.wmflabs.org/scores/enwiki/wp10/?revids=686575075 .

Authority/Centrality: Erik ran an experiment with a pagerank like 
algorithm and it shows very interesting results.

I'm wondering if this approach can work, I tend to think that by using 
only one factor (pageviews) we can have both very long tails with 1 or 0 
pageview and big outliers caused by new bots/tools we failed to detect. 
Using other factors not related to pageviews might help to mitigate 
these problems.
So the question about normalization is also interesting to compute a 
composite score between 3 different components.

For the question about weighting over time, I think you detailed the 
problem very well.
It really depends on what we want to do here, near-real-time (12h or 
24h) can lead to weird behaviors and will only work for very popular wikis.

Concerning your experiment, do you plan to activate fuzzy search?
On our side it was a bit difficult, completion suggester is still 
incomplete. Fuzzy results are not discounted so we had to workaround 
this problem with client-side rescoring.

Thank you!

Le 14/11/2015 00:10, Greg Lindahl a écrit :
...
  On Fri, Nov 13, 2015 at 01:45:57PM -0800, Erik
Bernhardson wrote:

  Have you put any thought into normalizing page
view data?  I haven't studied it, but I think you've got a good start:
normalizing
 them by the # of pageviews of the community. So if someone types an
 entire French phrase into the English wikipedia, and you wanted to
 show both En and Fr options in the autocomplete, a simple
 normalization would be a good start for having something to sort
 by. Ditto for search.

 Your next question, about weighting over time, is really a question
 about how much data you have. It's nice to be able to push up current
 events, so that someone searching for Paris today could see (alas) the
 brand new article about today's attacks. But it's the amount of
 pageview data that really dictates how well you can do that. For the
 English wikipedia, there are so many pageviews that you probably have
 enough data over 24 hours to produce good, not-noisy counts. And for
 less than 24 hours, you'll probably end up magnifying Europe's
 favorites as America wakes up, and America's favorites as Asia wakes
 up. Probably not a good thing!

 For a less-used wiki, only 24 hours might produce pretty sparse and
 noisy counts. So you will need to look back farther, which reduces
 your ability to react to current events.

 You'd like to experiment with exponential decay, you can look at the
 statistics to try to figure out if you're just magnifying noise. Or
 Europe's preferences become popular when Americans wake up.

 (And if you're really interested in geography, you could divide the
 data up so that Europe, America, ANZ, Asia, etc have separate
 autocompletes... if you have enough pageview data.)

 -- greg

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics 

Re: [discovery] [Analytics] An example of pageviews usage