On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this.
The logs are taken from the Squids, long before MediaWiki touches them, so they shouldn't be normalized at all.
I don't think the IP addresses should come into the analysis at all, though possibly a cut-off at 5 or 10 searches might be useful to prevent a huge tail-end of probably useless information (it also might exclude cases where people have typed things into the search box by accident - maybe they got distracted while logging in)
Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms.