On Fri, Oct 19, 2018 at 11:13 AM Trey Jones <tjones@wikimedia.org> wrote:
Sounds interesting to me! Presumably their re-ranked results would be returned to users and clicks and maybe other data would be logged for an A/B test.

Right, I think the idea would be to treat this as an AB test with their ranker as an option.
 
The questions that immediately come to mind are:
- How much hand-holding would be needed to get their container set up? Given good specs on our side and decent tech capabilities on their side, the hand-holding could be very minimal.

We're hoping that wrapping things up into a container will simplify things from our end.  Of course it can't be as simple as us simply running their black box in our production environment. Even if kubernetes can generally wall off the container so it has no access except to receive tcp connections on a predetermined port, I think we will need to have more visibility and possibly the ability to build the container ourselves (from their git repo with appropriate Dockerfile or whatever). Mostly this would have to be informed by SRE.

- What do they get out of it? What kinds of data are they going to expect to collect, and if it is covered by NDA or other PII protections, does it limit their ability to publish related research?

Ideally they should not need any NDA, instead they would receive AB testing results from our standard report generator. The professor was partially interested in live AB testing, instead of access to historical data, as they wouldn't individually need access to private data.

On Fri, Oct 19, 2018 at 12:24 PM Stas Malyshev <smalyshev@wikimedia.org> wrote:
> While at the conference this week I met up with an professor and he had
> an interesting proposal. Essentially the idea would be they could build
> containers (following our OSS requirements) that we could deploy, and
> for some small percent of search traffic pass our top 100 results to
> their container for reranking. 

Sounds interesting. I wonder though how to organize it - would it be on
VPS or production cluster? VPS allows for a lot of freedom in installing
custom stuff but routing traffic there from production may be
problematic (especially if it has PII). And production clusters have
pretty strict rules about what can be run there and who can access, so
it may be a bit of an issue for researchers to work within those bounds.
Also, if they want to do something with the data - e.g. publish
something about it, do independent analysis, keep logs, etc. - we will
have to figure out how to do it so it won't have impact on privacy and
how to ensure our privacy guidelines are followed.

It would have to live on the production cluster, inside the kubernetes cluster most likely. This would limit what they can do certainly, the question i guess would be after we apply those limits is it still a useful thing? This project is slightly different from most WMF services though, in that it will have extremely limited connectivity. The only thing exposed to the container should be a single port it can listen on for connections from mediawiki, it shouldn't have it's own public network interface. 


On Mon, Oct 22, 2018 at 10:17 AM David Causse <dcausse@wikimedia.org> wrote:
Hi,

I'm all for it (assuming production and/or Cloud vps requirements are all met).
The real question to me is how this will work, should we think about a protocol we could present to researchers or work with a researcher to design what needs to be transferred between this ranking machine and our search system?
In short I love the idea but there seems to be a lot of details to figure out.

From our side we would certainly need to define a stable interface, it's a good requirement to keep in mind.