On 01/14/2010 05:51 PM, Aryeh Gregor wrote:
On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin conrad.irwin@googlemail.com wrote:
Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this.
The logs are taken from the Squids, long before MediaWiki touches them, so they shouldn't be normalized at all.
I don't think the IP addresses should come into the analysis at all, though possibly a cut-off at 5 or 10 searches might be useful to prevent a huge tail-end of probably useless information (it also might exclude cases where people have typed things into the search box by accident - maybe they got distracted while logging in)
Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms.
Such people would be able to deny searching for such terms, I don't see this as posing any more problems than the history dumps. Thinking further though, it would be possible to tie a search to an IP address or User when a page is created with the search term (as it is highly likely if there was only one search that it was this user who did it).
It thus seems likely that a cut off point is needed, and that it can only be chosen arbitrarily or by someone with relevant permission scanning logs to find out this information. Looking at "prior art", it seems that 25 is high enough or more than: http://wikistics.falsikon.de/2008/wiktionary/fr/wanted/ but obviously, the higher the number, the less complete the lists are.
Conrad