On Fri, Oct 19, 2018 at 11:13 AM Trey Jones <tjones(a)wikimedia.org> wrote:
Sounds interesting to me! Presumably their re-ranked
results would be
returned to users and clicks and maybe other data would be logged for an
Right, I think the idea would be to treat this as an AB test with their
The questions that immediately come to mind are:
- How much hand-holding would be needed to get their container set up?
Given good specs on our side and decent tech capabilities on their side,
the hand-holding could be very minimal.
We're hoping that wrapping things up into a container will simplify things
from our end. Of course it can't be as simple as us simply running their
black box in our production environment. Even if kubernetes can generally
wall off the container so it has no access except to receive tcp
connections on a predetermined port, I think we will need to have more
visibility and possibly the ability to build the container ourselves (from
their git repo with appropriate Dockerfile or whatever). Mostly this would
have to be informed by SRE.
- What do they get out of it? What kinds of data are they going to expect
to collect, and if it is covered by NDA or other PII
protections, does it
limit their ability to publish related research?
Ideally they should not need any NDA, instead they would receive AB testing
results from our standard report generator. The professor was partially
interested in live AB testing, instead of access to historical data, as
they wouldn't individually need access to private data.
On Fri, Oct 19, 2018 at 12:24 PM Stas Malyshev <smalyshev(a)wikimedia.org>
While at the
conference this week I met up with an professor and he had
an interesting proposal. Essentially the idea would be they could build
containers (following our OSS requirements) that we could deploy, and
for some small percent of search traffic pass our top 100 results to
their container for reranking.
Sounds interesting. I wonder though how to organize it - would it be on
VPS or production cluster? VPS allows for a lot of freedom in installing
custom stuff but routing traffic there from production may be
problematic (especially if it has PII). And production clusters have
pretty strict rules about what can be run there and who can access, so
it may be a bit of an issue for researchers to work within those bounds.
Also, if they want to do something with the data - e.g. publish
something about it, do independent analysis, keep logs, etc. - we will
have to figure out how to do it so it won't have impact on privacy and
how to ensure our privacy guidelines are followed.
It would have to live on the production cluster, inside the kubernetes
cluster most likely. This would limit what they can do certainly, the
question i guess would be after we apply those limits is it still a useful
thing? This project is slightly different from most WMF services though, in
that it will have extremely limited connectivity. The only thing exposed to
the container should be a single port it can listen on for connections from
mediawiki, it shouldn't have it's own public network interface.
On Mon, Oct 22, 2018 at 10:17 AM David Causse <dcausse(a)wikimedia.org> wrote:
I'm all for it (assuming production and/or Cloud vps requirements are all
The real question to me is how this will work, should we think about a
protocol we could present to researchers or work with a researcher to
design what needs to be transferred between this ranking machine and our
In short I love the idea but there seems to be a lot of details to figure
From our side we would certainly need to define a
stable interface, it's a
good requirement to keep in mind.