While at the conference this week I met up with an professor and he had an interesting proposal. Essentially the idea would be they could build containers (following our OSS requirements) that we could deploy, and for some small percent of search traffic pass our top 100 results to their container for reranking.
Certainly more details would have to be worked out, but does this seem reasonable? Overall I still think working directly with academics can be beneficial to our own work, although I'm not sure what our overhead would be.
Sounds interesting to me! Presumably their re-ranked results would be returned to users and clicks and maybe other data would be logged for an A/B test.
The questions that immediately come to mind are: - How much hand-holding would be needed to get their container set up? Given good specs on our side and decent tech capabilities on their side, the hand-holding could be very minimal. - What do they get out of it? What kinds of data are they going to expect to collect, and if it is covered by NDA or other PII protections, does it limit their ability to publish related research? - If it goes well, how do we integrate it? Are they going to be willing to make their core code open source?
I think these obstacles can all be readily overcome, but they are a few of the things we would have to think about as we enter such a collaboration.
The best outcome, though, would be great—new ideas and new techniques for us and better results for our users.
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
On Fri, Oct 19, 2018 at 1:54 PM, Erik Bernhardson < ebernhardson@wikimedia.org> wrote:
While at the conference this week I met up with an professor and he had an interesting proposal. Essentially the idea would be they could build containers (following our OSS requirements) that we could deploy, and for some small percent of search traffic pass our top 100 results to their container for reranking.
Certainly more details would have to be worked out, but does this seem reasonable? Overall I still think working directly with academics can be beneficial to our own work, although I'm not sure what our overhead would be.
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Fri, Oct 19, 2018 at 11:13 AM Trey Jones tjones@wikimedia.org wrote:
Sounds interesting to me! Presumably their re-ranked results would be returned to users and clicks and maybe other data would be logged for an A/B test.
Right, I think the idea would be to treat this as an AB test with their
ranker as an option.
The questions that immediately come to mind are:
- How much hand-holding would be needed to get their container set up?
Given good specs on our side and decent tech capabilities on their side, the hand-holding could be very minimal.
We're hoping that wrapping things up into a container will simplify things from our end. Of course it can't be as simple as us simply running their black box in our production environment. Even if kubernetes can generally wall off the container so it has no access except to receive tcp connections on a predetermined port, I think we will need to have more visibility and possibly the ability to build the container ourselves (from their git repo with appropriate Dockerfile or whatever). Mostly this would have to be informed by SRE.
- What do they get out of it? What kinds of data are they going to expect
to collect, and if it is covered by NDA or other PII protections, does it limit their ability to publish related research?
Ideally they should not need any NDA, instead they would receive AB testing results from our standard report generator. The professor was partially interested in live AB testing, instead of access to historical data, as they wouldn't individually need access to private data.
On Fri, Oct 19, 2018 at 12:24 PM Stas Malyshev smalyshev@wikimedia.org wrote:
While at the conference this week I met up with an professor and he had an interesting proposal. Essentially the idea would be they could build containers (following our OSS requirements) that we could deploy, and for some small percent of search traffic pass our top 100 results to their container for reranking.
Sounds interesting. I wonder though how to organize it - would it be on VPS or production cluster? VPS allows for a lot of freedom in installing custom stuff but routing traffic there from production may be problematic (especially if it has PII). And production clusters have pretty strict rules about what can be run there and who can access, so it may be a bit of an issue for researchers to work within those bounds. Also, if they want to do something with the data - e.g. publish something about it, do independent analysis, keep logs, etc. - we will have to figure out how to do it so it won't have impact on privacy and how to ensure our privacy guidelines are followed.
It would have to live on the production cluster, inside the kubernetes cluster most likely. This would limit what they can do certainly, the question i guess would be after we apply those limits is it still a useful thing? This project is slightly different from most WMF services though, in that it will have extremely limited connectivity. The only thing exposed to the container should be a single port it can listen on for connections from mediawiki, it shouldn't have it's own public network interface.
On Mon, Oct 22, 2018 at 10:17 AM David Causse dcausse@wikimedia.org wrote:
Hi,
I'm all for it (assuming production and/or Cloud vps requirements are all met). The real question to me is how this will work, should we think about a protocol we could present to researchers or work with a researcher to design what needs to be transferred between this ranking machine and our search system? In short I love the idea but there seems to be a lot of details to figure out.
From our side we would certainly need to define a stable interface, it's a
good requirement to keep in mind.
Hi!
While at the conference this week I met up with an professor and he had an interesting proposal. Essentially the idea would be they could build containers (following our OSS requirements) that we could deploy, and for some small percent of search traffic pass our top 100 results to their container for reranking.
Sounds interesting. I wonder though how to organize it - would it be on VPS or production cluster? VPS allows for a lot of freedom in installing custom stuff but routing traffic there from production may be problematic (especially if it has PII). And production clusters have pretty strict rules about what can be run there and who can access, so it may be a bit of an issue for researchers to work within those bounds. Also, if they want to do something with the data - e.g. publish something about it, do independent analysis, keep logs, etc. - we will have to figure out how to do it so it won't have impact on privacy and how to ensure our privacy guidelines are followed.
- If it goes well, how do we integrate it? Are they going to be
willing to make their core code open source?
Good point. I think we should require at least some open result, i.e. either open source code (with reusable license, i.e. no patents banning reuse etc.) or open publication with freely accessible algorithms and outcomes (or both?) I don't think it would make sense for us to cooperate if we'd be unable to benefit from the results.
Hi,
On 10/19/18 12:24 PM, Stas Malyshev wrote:
- If it goes well, how do we integrate it? Are they going to be
willing to make their core code open source?
Good point. I think we should require at least some open result, i.e. either open source code (with reusable license, i.e. no patents banning reuse etc.) or open publication with freely accessible algorithms and outcomes (or both?) I don't think it would make sense for us to cooperate if we'd be unable to benefit from the results.
No need to re-invent the wheel here, there's already an Open Access Policy[1] that covers these kinds of things for research projects.
[1] https://foundation.wikimedia.org/wiki/Open_access_policy
-- Legoktm
Hi,
I'm all for it (assuming production and/or Cloud vps requirements are all met). The real question to me is how this will work, should we think about a protocol we could present to researchers or work with a researcher to design what needs to be transferred between this ranking machine and our search system? In short I love the idea but there seems to be a lot of details to figure out.
On Fri, Oct 19, 2018 at 7:55 PM Erik Bernhardson ebernhardson@wikimedia.org wrote:
While at the conference this week I met up with an professor and he had an interesting proposal. Essentially the idea would be they could build containers (following our OSS requirements) that we could deploy, and for some small percent of search traffic pass our top 100 results to their container for reranking.
Certainly more details would have to be worked out, but does this seem reasonable? Overall I still think working directly with academics can be beneficial to our own work, although I'm not sure what our overhead would be. _______________________________________________ Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery