On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this.
It would not be too much work to publish a set of post-processing scripts that could perform those normalisations that people are interested in; I don't think any two people will agree exactly on what
You've missed the point of the normalization here. It's not to be helpful to users: As you observe, it's easy for the recipient of the list to perform their own. The reason to normalize is to push more queries above the reporting threshold. For example, 5 people might search for "john f. kinndey" (a misspelling of "John F. Kennedy"?) but all capitalize it differently. A redirect on this misspelling would be useful regardless of the case.
All things equal I'd rather *not* normalize the data... it's just more stuff that may have surprising behaviour. But I think this is something which may need to be balanced against the disclosure threshold.
It would also be possible to do the disclosure calculation against normalized data while releasing the raw values... but I must admit a little bit of uneasiness that the normalization might be ignoring some piece of information relevant to privacy.
For example, if we were to go that route we might employ some fairly aggressive normalization... removing all whitespace and punctuation. If we went as far as also removing all *numbers* from the check we'd run into things like "Greg Maxwell (555)-555-1212" getting published because enough distinct people searched for "greg maxwell". Obviously the answer to that one is "don't remove numbers" from the check, but I worry about the cases I haven't thought of.
On Thu, Jan 14, 2010 at 12:51 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms.
Yes, it's possible that someone may search 5 times, from 5 IPs (which *might* be from one machine due to proxy round-Robbins), an identical string ... "MyFullName seen on friday night with a woman other than his wife" ... but what to do?
Any information which is disclosed has some risk of disclosing something that someone would rather not be. This risk can be made arbitrarily small, but it can't be eliminated.
I think the benefit to the readers of having this information available easily outweighs some sufficiently fringe confidentiality concern. At some point your frequently repeated search is a statistic, which no reasonable privacy policy would frown on disclosing.
This is important to our operations, disclosing it is in the public interest, and failing to do work in this area puts us at a disadvantage compared to other parties who might be far less scrupulous. (e.g. If WMF's search performs poorly, you might feel compelled to use Search Engine X — which happens to secretly sell your data to the highest bidder.)
Is there some sufficiently high number which *no one* paying attention here has a concern about? We could simply start with that.... and possibly lower the threshold over time as the lowest hanging fruit are solved, tracking our disclosure comfort.
I think we all have an interest and obligation to take every reasonable means, but no one can ask for more than that.
Would anyone feel more comfortable if this ignored queries made via the secure server? Non-HTTPS traffic can be watched by anyone on the path between you and Wikimedia... any illusion of absolute privacy on the insecure traffic is patently false already.