Hi all,
In response to [0] I am considering volunteering to develop the tabbed
search interface [2] [3]. To me it looks like more logical, more
familiar to users compared to the other interfaces.
I'm Gryllida at Wikimedia sites. I have prior Perl and JavaScript
experience interacting with the MediaWiki API [1], but none in PHP. The
JavaScript things I wrote are rather scattered; I have only minimal
understanding of objects and modules as I only wrote subroutine style
scripts before. At home, I use a GNU/Linux Debian desktop.
So this week I came to IRC and asked several questions to get an idea of
what the Discovery team is doing. Thanks Deborah for sharing the current
state of things! :-) I appear to realize that the tabbed interface is in
the plans and nobody is working on it, so it's good to take.
We had left some questions unanswered. Particularly, is the Labs
instance at [4] expected to be used for all ideas at once or only for
one at a time, and is it shared between several people? Is it a good
idea for me to use a Labs instance at initial development stages or only
when the code is nearing completion? Or is it better to use a Vagrant
instance locally? Or both?
What documentation and code do you recommend me to read? May I develop
it as an extension as much as possible and not a gadget, so that people
don't have to wait for page JavaScript to finish loading before they see
the new sister wiki tabs?
May I please ask someone to volunteer mentoring me throughout the
project? (I am in the UTC+11 timezone at present; 'gry' nickname at
chat.freenode.net.)
Regards,
Svetlana.
[0]:
https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2016-November/00…
[1]: http://svetlana.nfshost.com/fs/
[2]:
https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design…
[3]: https://wikitech.wikimedia.org/wiki/User:Gryllida/sandbox
[4]: https://phabricator.wikimedia.org/T151344
Hello!
I've been working a presentation
<https://docs.google.com/presentation/d/1ctlqdLA__0OxDuO7mJEIDLP-xt9a7E4jv4I…>
that gives a summary of who Discovery is, what our mission is, and
what's coming up for the rest of the year. I'd like to share it all with
you!
This presentation is a living document. The content and style can and will
change over time, perhaps even drastically. This is especially true for the
roadmap slide. I made it clear in the presentation, but it's worth pointing
out again.:-)
If there are any questions, I'd be happy to answer them!
Thanks,
Dan
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
Recently I've been doing some investigation into how we can collect enough
data to plausibly train an ML model for search re-ranking. As with all ML
training, the labeled dataset to train against is an important piece. Many
approaches seem to use human labeled relevance, and we have a platform for
collecting this data which has proven to have decent predictive
capabilities for offline tests of changes to our search. But the amount of
data necessary for training ML models is simply not there.
In my research i've come across a paper "A Dynamic Bayesian Network Click
Model for Web Search Ranking"[1] and related implementation[2] that seems
to have some promise. Machine generation of relevance labels seems
promising, because i can collect a reasonable amount of information about
clickthroughs and the search results that were provided to users.
For one week of enwiki traffic i have ~20k queries that were issued by more
than 10 identities (~distinct search session). This has around 135k
distinct (query, identity) pairs, 140k distinct (query, identity, click
page id) pairs, 414k distinct (query, result page id) pairs, and covers ~3M
results (~20 per page) that were shown to users and could be converted into
relevance judgements. I'm not sure which way to train the final model on
though, the 414k distinct (query, result_page_id) pairs, or the 3M which
has duplicates from the 414k representing the same (query, result_page_id)
pair being shown multiple times.
I was also curious about a part in the appendix of the paper, labeled
Confidence. It states:
Remember that the latent variables au and su will later be used as targets
> for learning a ranking function. It is thus important to know the
> confidence associated with these values
>
Why is it important to know the confidence, and how does that play into
training a model? This is probably basic ML stuff but I'm new to all of
this.
And finally, are there better ways of generating relevance labels from
clickthrough data, ideally with open source implementations? This is just
something I happened to stumble upon in my research and certainly not the
only thing out there.
[1] http://www2009.eprints.org/1/1/p1.pdf
[2] https://github.com/varepsilon/clickmodels
(cc'ing the discovery mailing list, as that team owns both the
implementation and operation of search.)
I can partially answer this as one of the people responsible for search,
but I have to defer to others about API, bots, and such.
This would be a noticeable portion of our traffic, for reference:
action=opensearch (and generator variants): 1.5k RPS
action=query&list=search (and generator variants): 600 RPS
all api: 8k RPS (might be a bit higher, this is averaged over an hour)
opensearch is relatively cheap, the p95 to our search servers is ~30ms,
with p50 at 7ms. So 600 RPS of opensearch traffic wouldn't be too hard on
our search cluster. Using action=query is going to be too heavy, the full
text searches are computationally more expensive to serve.
Might I ask, which wiki(s) would you be querying against? opensearch
traffic is spread across our search cluster, but individual wikis only hit
portions of it. For example opensearch on en.wikipedia.org is served by
~40% of the cluster, but zh.wikipedia.org (chinese) is only served by ~13%.
If you are going to send heavy traffic to zh I might need to adjust those
numbers to spread the load to more servers (easy enough, just need to know).
Additionally, you mentioned descriptions and keywords. These would not be
provided directly by the opensearch api so you might be thinking of using
the generator version of it (action=query&generator=prefixsearch) to get
the results augmented
(ex: /w/api.php?action=query&format=json&prop=extracts&generator=prefixsearch&exlimit=5&exintro=1&explaintext=1&gpssearch=yah&gpslimit=5).
I'm not personally sure how expensive that is, someone else would have to
chime in.
So, from a computational point of view and only with respect to the search
portion of our cluster, this seems plausible as long as we coordinate so
that we know the traffic is coming. Others will have to chime in about the
wider picture.
Erik B.
On Mon, Nov 14, 2016 at 4:40 PM, Eric Kuo <erickuo(a)yahoo-inc.com> wrote:
> Hi,
>
> This is Eric from Yahoo. My team develops mobile apps for Taiwan and Hong
> Kong users. We want to provide wiki description on keywords in our
> contents, and we consider using MediaWiki API:OpenSearch and/or API:Query
> to achieve this. Our estimated RPS is 900, and we will cache the query
> result on our side. We would like to know if there is any concern with
> respect to our RPS, and if so, what is the best practice.
>
> Any comments and suggestions are welcome. Thank you for your time.
>
> Best regards,
> Eric
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
Thanks, Huji, to answer your question - this will be a short series of
tests on Wikipedia that will display additional relevant search results
across wikis in the same language to a selected number of users that fall
into our bucketing schema. Not everyone will see the new results every time
if they land on a search results listing page.
Cheers,
Deb
--
deb tankersley
Product Manager, Discovery
irc: debt
Wikimedia Foundation
On Thu, Nov 10, 2016 at 6:11 PM, Huji Lee <huji.huji(a)gmail.com> wrote:
> I think it would be best if we test it in at least one RTL wiki. I will
> mention this in the VP of Persian Wikipedia (FA WP).
>
> If it is meant to only be shown for select users and has no impact for
> others, I am wiling to volunteer myself as a tester.
>
> Huji
>
> On Thu, Nov 10, 2016 at 6:03 PM, Deborah Tankersley <
> dtankersley(a)wikimedia.org> wrote:
>
>> Hello,
>>
>> The Discovery Search team is looking for a few language specific
>> Wikipedia sites that would be interested in helping with A/B testing for
>> cross-wiki search results. These tests would evaluate if adding search
>> results across wiki projects in the same language would be useful,
>> relevant, and are of interest to users.
>>
>> We've written up the details
>> <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements> [1],
>> came up with a multitude of designs
>> <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Design>
>> [2], and had many conversations on both talk pages and with our own
>> internal Design team. We have also outlined the initial tests
>> <https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements/Testing> [3]
>> that we'd like to run.
>>
>> These planned A/B tests would run for about a week and would only be
>> shown to a small subsection of users that visit the Wikipedia(s) that the
>> tests are running on. The analyzed results of these tests will be posted on
>> wiki so that everyone can see how they did in terms of usage and adoption
>> of the test group.
>>
>> We would like to know if there are any particular Wikipedias that would
>> want to help us test these new search results across projects in their
>> language. Interested community members might want to post something to
>> their project's Village Pump to build consensus. Wikipedias that are
>> related culturally or linguistically would also be of interest.
>>
>> Please post on our testing talk page
>> <https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Result_Improvements/T…>
>> [4] if there are any questions, concerns, or volunteers!
>>
>> Thanks!
>>
>>
>> [1] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_Improvements
>> [2] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_
>> Improvements/Design
>> [3] https://www.mediawiki.org/wiki/Cross-wiki_Search_Result_
>> Improvements/Testing
>> [4] https://www.mediawiki.org/wiki/Talk:Cross-wiki_Search_Re
>> sult_Improvements/Testing
>>
>> --
>> deb tankersley
>> Product Manager, Discovery
>> irc: debt
>> Wikimedia Foundation
>>
>> _______________________________________________
>> Wikitech-ambassadors mailing list
>> Wikitech-ambassadors(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
>>
>>
>
> _______________________________________________
> Wikitech-ambassadors mailing list
> Wikitech-ambassadors(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
>
>