On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin
<conrad.irwin(a)googlemail.com> wrote:
Wiktionary is case-sensitive and so case-folding there
may not be
appropriate; I personally would be interested in seeing these logs
before even the NFC normalizers get to them (given a lack of any other
source to find out how people type fun characters in the wild) though I
can appreciate this is somewhat sadistic, and probably the logs are
taken too late for this.
It would not be too much work to publish a set of post-processing
scripts that could perform those normalisations that people are
interested in; I don't think any two people will agree exactly on what
You've missed the point of the normalization here. It's not to be
helpful to users: As you observe, it's easy for the recipient of the
list to perform their own. The reason to normalize is to push more
queries above the reporting threshold. For example, 5 people might
search for "john f. kinndey" (a misspelling of "John F. Kennedy"?)
but
all capitalize it differently. A redirect on this misspelling would be
useful regardless of the case.
All things equal I'd rather *not* normalize the data... it's just more
stuff that may have surprising behaviour. But I think this is
something which may need to be balanced against the disclosure
threshold.
It would also be possible to do the disclosure calculation against
normalized data while releasing the raw values... but I must admit a
little bit of uneasiness that the normalization might be ignoring some
piece of information relevant to privacy.
For example, if we were to go that route we might employ some fairly
aggressive normalization... removing all whitespace and punctuation.
If we went as far as also removing all *numbers* from the check we'd
run into things like "Greg Maxwell (555)-555-1212" getting published
because enough distinct people searched for "greg maxwell". Obviously
the answer to that one is "don't remove numbers" from the check, but I
worry about the cases I haven't thought of.
On Thu, Jan 14, 2010 at 12:51 PM, Aryeh Gregor
<Simetrical+wikilist(a)gmail.com> wrote:
Some people might search for their own name more than
five times in a
week, possibly together with other embarrassing or incriminating
search terms.
Yes, it's possible that someone may search 5 times, from 5 IPs (which
*might* be from one machine due to proxy round-Robbins), an identical
string ... "MyFullName seen on friday night with a woman other than
his wife" ... but what to do?
Any information which is disclosed has some risk of disclosing
something that someone would rather not be. This risk can be made
arbitrarily small, but it can't be eliminated.
I think the benefit to the readers of having this information
available easily outweighs some sufficiently fringe confidentiality
concern. At some point your frequently repeated search is a
statistic, which no reasonable privacy policy would frown on
disclosing.
This is important to our operations, disclosing it is in the public
interest, and failing to do work in this area puts us at a
disadvantage compared to other parties who might be far less
scrupulous. (e.g. If WMF's search performs poorly, you might feel
compelled to use Search Engine X — which happens to secretly sell your
data to the highest bidder.)
Is there some sufficiently high number which *no one* paying attention
here has a concern about? We could simply start with that.... and
possibly lower the threshold over time as the lowest hanging fruit are
solved, tracking our disclosure comfort.
I think we all have an interest and obligation to take every
reasonable means, but no one can ask for more than that.
Would anyone feel more comfortable if this ignored queries made via
the secure server? Non-HTTPS traffic can be watched by anyone on the
path between you and Wikimedia... any illusion of absolute privacy on
the insecure traffic is patently false already.