As we reach the last month of the quarter, it's a good opportunity for us
to reflect on where we want to go for the last part of our remaining time.
On the one hand, we're in quite a good place. We're just wrapping up our
work on our Q2 goal for search
which is excellent! On the other hand, the test showed minimal impact, so
our users still aren't seeing the impact of our work. Since we can continue
running A/B tests for improving language support relatively cheaply in
terms of required engineering time, let's take a look back at what we've
done previously and see if we can choose something high impact to work on!
The completion suggester is a very promising avenue for us to invest in. As
noted in our analysis of the initial test
<https://phabricator.wikimedia.org/T111858>, using the completion suggester
instead of prefixsearch significantly reduced the zero results rate. We've
not had an impact on this through other efforts, so this is interesting! In
order to more thoroughly test the suggester, we can make it a Beta Feature
<https://phabricator.wikimedia.org/T119535>. This will allow editors to
opt-in to testing it, and will gather us valuable qualitative feedback
about what use cases the completion suggester could support better. The
caveat, of course, is that the feedback will be from a specific segment of
our user base (users who test beta features) which is more specialised than
the intended audience (everyone). That said, the feedback will still be
very helpful. There's quite a bit of work to do here; our initial test of
the suggester was very hacky, but now that it's proven itself, we can be
The other avenue is using page views to influence result ranking. This is
in an earlier stage thant he completion suggester, in that it's a
relatively unproven approach for us, but it's something that's logical and
that we've been interested in for a while. But, we've repeatedly had to
deprioritise it for other work. If something is popular, it makes sense to
rank it up in search results. Obviously, we do not want to be *too* aggressive
with this in case we create feedback loops, but I think the potential
benefits are quite clear if done correctly.
I explained a lot of this more briefly in our last standup, but hopefully
this should give you all some guidance on where we're going.
Thanks, and as always, if there are any questions then please let me know.
Lead Product Manager, Discovery
So I was looking up information on peripheral neuritis and I
accidentally mistyped it as "peripheral neuriti". The good news: the
autocorrector worked out I'd done it wrong, corrected it, and sent me
automatically to the right results. Yay!
But looking at the results I see a really obvious improvement we could
make that would definitely improve the user experience in this
scenario. See, if you look at the first article on the list you'll see
it's "Peripheral neuropathy". Why? Because peripheral neuritis
redirects to that. But the article header appears in the search
results as "Peripheral neuropathy", since that's the real title.
But it's not what I searched for. What I searched for was neuritis. Is
neuritis the same as neuropathy? I dunno, I'm a random reader. Is this
a good search result to click on? No idea.
What I'd love for us to do is run an A/B test with two conditions:
1. Users who search for a term which redirects to an article get the
current experience (control)
2. Users who search for a term which redirects to an article get the
article title in the search results claiming to be the redirect title
I bet this would really improve the clickthrough rate for this class
of searches. It would definitely improve the UX.
 I'm researching thalidomide. Long story.
The Discovery Analysis team is pleased to report we have released a
new dashboard, providing basic data about usage of the Wikipedia
portal (https://www.wikipedia.org). It can be found at
Oliver, on behalf of the Discovery Analysis team
Last weekend I attended an amazing Open GIS <http://gisconf.ru/> conference
in Moscow. Many good topics, great energy, lots of people wanting to help
us build the best maps on the planet. I gave two presentations, one about
the overall state of our maps initiative, and one on the tech we have built.
As part of the discussion, the GeoHack for the Russian Wikipedia was
updated to use our maps, so we had a six fold increase
<http://searchdata.wmflabs.org/maps/> in the number of maps users! Will
see how it may change during the week. As part of the KPIs, we should
graphs <https://phabricator.wikimedia.org/T119448> (top 10 only).
Other results might take time - people learnt of our technology, and I
learnt some of the projects we may benefit from, for example I learnt of
GeoJSON+CSS <http://wiki.openstreetmap.org/wiki/Geojson_CSS> (will allow
our editors to style custom objects they overlay on top of the map).
In short, it was a fun weekend )
After talking with Fundraising, we have agreed a code freeze for the week
commencing Monday 30th December to minimise disruption of the fundraiser.
As a reminder, there will be no train deployment that week, so basically
what this code freeze amounts to is "do not manually deploy things that
week". The train deployment will resume the following week as normal.
This code freeze replaces our previously documented two day code freeze on
Tuesday 1st and Wednesday 2nd December. This should not affect our previous
agreement with Fundraising/RelEng that it's okay to continue doing deploys
on the portal in that time.
Lead Product Manager, Discovery
this is a study (in french) I found in the list of papers that should be
reviewed for the next research newsletter: http://scoms.hypotheses.org/498
The purpose of the study is to model the social network of movie actors
of the 1920s and 1930s with Wikidata.
In few words it uses wdqs to export the dataset, applies some conversion
with R and imports the graph into Gephi.
Oliver & Mikhail,
Could you guys review why user satisfaction KPI continues to be
affected even after recent changes in
CC'ing discovery@ so that others are aware of the issue
This is fascinating, we started to experiment with the completion
suggester few months ago.
The first goal was to increase recall and use the ability to activate
fuzzy lookups to handle small typos.
It became clear that scoring was a critical part of this feature.
Some prefixes are already very ambiguous (mar, cha, list ...) and
enabling fuzziness does not help here.
We tried to implement a score based on the data currently available
(size, incoming_links, templates...) but this score is kind of "bigger
This is why we were interested in pageviews to add "popularity" in the
score. Thanks for sharing this tool it is very helpful to have a quick
look at how it would look like.
I still don't know if pageviews can be the only score component or if we
should compose with other factors like "quality", "authority".
My concerns with pageviews are :
- we certainly have outliers (caused by 3rd party tools/bots ...)
- what's the coverage of pageviews: i.e. in one month how many pages get
Quality: we have a set of templates that are already used to flag
good/featured articles. Cirrus uses it on enwiki only, I'd really like
to extend this to other wikis. I'm also very interested in the tool
behind http://ores.wmflabs.org/scores/enwiki/wp10/?revids=686575075 .
Authority/Centrality: Erik ran an experiment with a pagerank like
algorithm and it shows very interesting results.
I'm wondering if this approach can work, I tend to think that by using
only one factor (pageviews) we can have both very long tails with 1 or 0
pageview and big outliers caused by new bots/tools we failed to detect.
Using other factors not related to pageviews might help to mitigate
So the question about normalization is also interesting to compute a
composite score between 3 different components.
For the question about weighting over time, I think you detailed the
problem very well.
It really depends on what we want to do here, near-real-time (12h or
24h) can lead to weird behaviors and will only work for very popular wikis.
Concerning your experiment, do you plan to activate fuzzy search?
On our side it was a bit difficult, completion suggester is still
incomplete. Fuzzy results are not discounted so we had to workaround
this problem with client-side rescoring.
Le 14/11/2015 00:10, Greg Lindahl a écrit :
> On Fri, Nov 13, 2015 at 01:45:57PM -0800, Erik Bernhardson wrote:
>> Have you put any thought into normalizing page view data?
> I haven't studied it, but I think you've got a good start: normalizing
> them by the # of pageviews of the community. So if someone types an
> entire French phrase into the English wikipedia, and you wanted to
> show both En and Fr options in the autocomplete, a simple
> normalization would be a good start for having something to sort
> by. Ditto for search.
> Your next question, about weighting over time, is really a question
> about how much data you have. It's nice to be able to push up current
> events, so that someone searching for Paris today could see (alas) the
> brand new article about today's attacks. But it's the amount of
> pageview data that really dictates how well you can do that. For the
> English wikipedia, there are so many pageviews that you probably have
> enough data over 24 hours to produce good, not-noisy counts. And for
> less than 24 hours, you'll probably end up magnifying Europe's
> favorites as America wakes up, and America's favorites as Asia wakes
> up. Probably not a good thing!
> For a less-used wiki, only 24 hours might produce pretty sparse and
> noisy counts. So you will need to look back farther, which reduces
> your ability to react to current events.
> You'd like to experiment with exponential decay, you can look at the
> statistics to try to figure out if you're just magnifying noise. Or
> Europe's preferences become popular when Americans wake up.
> (And if you're really interested in geography, you could divide the
> data up so that Europe, America, ANZ, Asia, etc have separate
> autocompletes... if you have enough pageview data.)
> -- greg
> Analytics mailing list