Hi!
While at the conference this week I met up with an
professor and he had
an interesting proposal. Essentially the idea would be they could build
containers (following our OSS requirements) that we could deploy, and
for some small percent of search traffic pass our top 100 results to
their container for reranking.
Sounds interesting. I wonder though how to organize it - would it be on
VPS or production cluster? VPS allows for a lot of freedom in installing
custom stuff but routing traffic there from production may be
problematic (especially if it has PII). And production clusters have
pretty strict rules about what can be run there and who can access, so
it may be a bit of an issue for researchers to work within those bounds.
Also, if they want to do something with the data - e.g. publish
something about it, do independent analysis, keep logs, etc. - we will
have to figure out how to do it so it won't have impact on privacy and
how to ensure our privacy guidelines are followed.
- If it goes well, how do we integrate it? Are they
going to be
willing to make their core code open source?
Good point. I think we should require at least some open result, i.e.
either open source code (with reusable license, i.e. no patents banning
reuse etc.) or open publication with freely accessible algorithms and
outcomes (or both?) I don't think it would make sense for us to
cooperate if we'd be unable to benefit from the results.
--
Stas Malyshev
smalyshev(a)wikimedia.org