Erik, Trey, David, Kevin, and I met this morning to discuss how we're going
to handle data collection for the upcoming TextCat test [1]. A big problem
in this particular case is that the system wasn't designed/engineered in a
way that's conducive for cross-wiki logging / session tracking. And
recently we even lost the ability to use the referrer info to see which
page the user came from when visiting another wiki page when going between
wikis. (I was told this was done for user privacy reasons.)
Erik said he had recently implemented a click event in the
TestSearchSatisfaction2 schema that we might be able to hook into to
measure clickthrough rate for users who are eligible for TextCat language
detection & get shown results in the language their non-English query
probably is written in. Whether we use this and how much we rely on this
particular method of measuring whether TextCat is successful (beyond just
measuring how it impacts the zero results rate) depends on the validation
[2] of the click events and how they compare to page visit events (which
cannot be fired in an interwiki context).
We also discussed an alternative approach which uses web requests with the
caveat being that if a user is selected for the test once, they'll be
selected every time. So if a particular IP+UA combination is part of the
test and performs 2 million searches (as is sometimes the case), then we'll
have to do some very careful filtering which will also exclude some
completely valid use cases (a computer lab in a school or a country with
only 2 public IP addresses). But we're shooting for being able to use
TestSearchSatisfaction2 :)
[1]
https://phabricator.wikimedia.org/T121542
[2]
https://phabricator.wikimedia.org/T132706
--
*Mikhail Popov* // Count Logula, Discovery
<https://www.mediawiki.org/wiki/Wikimedia_Discovery>
https://wikimediafoundation.org/
*Imagine a world in which every single human being can freely share in
the **sum
of all knowledge. That's our commitment.* Donate
<https://donate.wikimedia.org/>.