I've been working on book search at the Internet Archive, and I've been using Wikipedia article titles and redirects as entities and synonyms. I wanted to build autocomplete for this gizmo, so I downloaded 7 days of pageviews for the en Wikipedia, and wrote a tiny script to sum them up. It worked great!
Here's the demo (currently live, will disappear eventually). "number" is the pageviews count.
curl http://researcher3.fnf.archive.org:8080/autocomplete?q=Que | json_pp { "autocomplete" : [ { "number" : 68310, "label" : "Queen Victoria" }, { "number" : 53283, "label" : "Quentin Tarantino" }, { "number" : 29192, "label" : "Quebec" }, { "number" : 23717, "label" : "Queen Elizabeth The Queen Mother" }, { "number" : 20500, "label" : "Quetiapine" } ] }
It was great to meet you at IA yesterday, thanks for following up with this link to your work. Very interesting and coincides with our own work on using the completion suggester to replace the current prefix search used on-wiki.
Have you put any thought into normalizing page view data? One thing we have been trying to figure out (but on the back-burner as we focus on currently quarterly goals) is how best to integrate page views ( https://phabricator.wikimedia.org/T112681). Because we have to do this across many wiki's with a wide varience in page views, and we want to use the data not only for the completion suggester but also within our full text search results, we are thinking about normalizing the data down to a % of page views for that wiki over a time period. Possiblying taking in a larger time period of page views and weighting newer page views as more important than older page views. Additionally we are looking into if we should be batch loading page view information on a weekly basis, or if we can load page view data only when pages are edited (or some combination of the two). I've pinged david and trey with this and they might have some questions for you :)
For comparison here is similar data but with a different scoring algorithm david worked up that reuses the same data we use for rescoring full text searches: https://en.wikipedia.org/w/api.php?action=cirrus-suggest&text=Que
We havn't yet put this into production because we wanted to integrate page view data into the scoring before running more tests. It looks quite promising based on your initial
On Fri, Nov 13, 2015 at 11:07 AM, Greg Lindahl lindahl@pbm.com wrote:
I've been working on book search at the Internet Archive, and I've been using Wikipedia article titles and redirects as entities and synonyms. I wanted to build autocomplete for this gizmo, so I downloaded 7 days of pageviews for the en Wikipedia, and wrote a tiny script to sum them up. It worked great!
Here's the demo (currently live, will disappear eventually). "number" is the pageviews count.
curl http://researcher3.fnf.archive.org:8080/autocomplete?q=Que | json_pp { "autocomplete" : [ { "number" : 68310, "label" : "Queen Victoria" }, { "number" : 53283, "label" : "Quentin Tarantino" }, { "number" : 29192, "label" : "Quebec" }, { "number" : 23717, "label" : "Queen Elizabeth The Queen Mother" }, { "number" : 20500, "label" : "Quetiapine" } ] }
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Fri, Nov 13, 2015 at 01:45:57PM -0800, Erik Bernhardson wrote:
Have you put any thought into normalizing page view data?
I haven't studied it, but I think you've got a good start: normalizing them by the # of pageviews of the community. So if someone types an entire French phrase into the English wikipedia, and you wanted to show both En and Fr options in the autocomplete, a simple normalization would be a good start for having something to sort by. Ditto for search.
Your next question, about weighting over time, is really a question about how much data you have. It's nice to be able to push up current events, so that someone searching for Paris today could see (alas) the brand new article about today's attacks. But it's the amount of pageview data that really dictates how well you can do that. For the English wikipedia, there are so many pageviews that you probably have enough data over 24 hours to produce good, not-noisy counts. And for less than 24 hours, you'll probably end up magnifying Europe's favorites as America wakes up, and America's favorites as Asia wakes up. Probably not a good thing!
For a less-used wiki, only 24 hours might produce pretty sparse and noisy counts. So you will need to look back farther, which reduces your ability to react to current events.
You'd like to experiment with exponential decay, you can look at the statistics to try to figure out if you're just magnifying noise. Or Europe's preferences become popular when Americans wake up.
(And if you're really interested in geography, you could divide the data up so that Europe, America, ANZ, Asia, etc have separate autocompletes... if you have enough pageview data.)
-- greg
This is fascinating, we started to experiment with the completion suggester few months ago.
The first goal was to increase recall and use the ability to activate fuzzy lookups to handle small typos. It became clear that scoring was a critical part of this feature. Some prefixes are already very ambiguous (mar, cha, list ...) and enabling fuzziness does not help here. We tried to implement a score based on the data currently available (size, incoming_links, templates...) but this score is kind of "bigger is better". This is why we were interested in pageviews to add "popularity" in the score. Thanks for sharing this tool it is very helpful to have a quick look at how it would look like.
I still don't know if pageviews can be the only score component or if we should compose with other factors like "quality", "authority". My concerns with pageviews are : - we certainly have outliers (caused by 3rd party tools/bots ...) - what's the coverage of pageviews: i.e. in one month how many pages get 0 pageviews?
Quality: we have a set of templates that are already used to flag good/featured articles. Cirrus uses it on enwiki only, I'd really like to extend this to other wikis. I'm also very interested in the tool behind http://ores.wmflabs.org/scores/enwiki/wp10/?revids=686575075 .
Authority/Centrality: Erik ran an experiment with a pagerank like algorithm and it shows very interesting results.
I'm wondering if this approach can work, I tend to think that by using only one factor (pageviews) we can have both very long tails with 1 or 0 pageview and big outliers caused by new bots/tools we failed to detect. Using other factors not related to pageviews might help to mitigate these problems. So the question about normalization is also interesting to compute a composite score between 3 different components.
For the question about weighting over time, I think you detailed the problem very well. It really depends on what we want to do here, near-real-time (12h or 24h) can lead to weird behaviors and will only work for very popular wikis.
Concerning your experiment, do you plan to activate fuzzy search? On our side it was a bit difficult, completion suggester is still incomplete. Fuzzy results are not discounted so we had to workaround this problem with client-side rescoring.
Thank you!
Le 14/11/2015 00:10, Greg Lindahl a écrit :
On Fri, Nov 13, 2015 at 01:45:57PM -0800, Erik Bernhardson wrote:
Have you put any thought into normalizing page view data?
I haven't studied it, but I think you've got a good start: normalizing them by the # of pageviews of the community. So if someone types an entire French phrase into the English wikipedia, and you wanted to show both En and Fr options in the autocomplete, a simple normalization would be a good start for having something to sort by. Ditto for search.
Your next question, about weighting over time, is really a question about how much data you have. It's nice to be able to push up current events, so that someone searching for Paris today could see (alas) the brand new article about today's attacks. But it's the amount of pageview data that really dictates how well you can do that. For the English wikipedia, there are so many pageviews that you probably have enough data over 24 hours to produce good, not-noisy counts. And for less than 24 hours, you'll probably end up magnifying Europe's favorites as America wakes up, and America's favorites as Asia wakes up. Probably not a good thing!
For a less-used wiki, only 24 hours might produce pretty sparse and noisy counts. So you will need to look back farther, which reduces your ability to react to current events.
You'd like to experiment with exponential decay, you can look at the statistics to try to figure out if you're just magnifying noise. Or Europe's preferences become popular when Americans wake up.
(And if you're really interested in geography, you could divide the data up so that Europe, America, ANZ, Asia, etc have separate autocompletes... if you have enough pageview data.)
-- greg
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This is why we were interested in pageviews to add "popularity" in the score. Thanks for sharing this tool it is very helpful to have a quick look at how it would look like.
Indeed, very interesting! Here's another tool that calculates trending articles in a variety of ways and was interesting for me to peruse as I was thinking about this same topic: https://www.vitribyte.com/ (free sign-up, worth going through)
I still don't know if pageviews can be the only score component or if we
should compose with other factors like "quality", "authority". My concerns with pageviews are :
- we certainly have outliers (caused by 3rd party tools/bots ...)
We are doing a better and better job of filtering that. The data behind the just-released pageview API [1] and the latest dumps dataset [2] is using that filtering and we'll just improve the criteria over time hopefully finding and labeling most automata properly.
- what's the coverage of pageviews: i.e. in one month how many pages get 0
pageviews?
select count(distinct page_title) from wmf.pageview_hourly where agent_type = 'user' and year=2015 and month=10 and day=15;
Result: 23,110,732
We have about 35 million total articles [3] so the percentage is something like 68% of all our articles get viewed daily. This is probably quite inaccurate because the 23 million number above includes views to redirects and doesn't have the same exact definition of articles as the dataset behind that graph. And it's daily instead of monthly, but still hopefully informs this a bit.
Quality: we have a set of templates that are already used to flag good/featured articles. Cirrus uses it on enwiki only, I'd really like to extend this to other wikis. I'm also very interested in the tool behind http://ores.wmflabs.org/scores/enwiki/wp10/?revids=686575075 .
We're very interested in starting to link this type of data with pageview data and making different combinations of this accessible via other endpoints on the API (the pageview API is just a set of endpoints served by what we hope to be a more generic Analytics Query Service).
I'm wondering if this approach can work, I tend to think that by using only one factor (pageviews) we can have both very long tails with 1 or 0 pageview and big outliers caused by new bots/tools we failed to detect. Using other factors not related to pageviews might help to mitigate these problems. So the question about normalization is also interesting to compute a composite score between 3 different components.
+1 for using multiple factors. We've been looking at Druid and we think it can be very useful for this type of big data question where the answer has to consider lots of dimensions.