- search queries are logged in a standardized fashion (for grouping),
e.g. lowercase, single spaces, no leading/trailing spaces, special chars converted to spaces, etc.
Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this.
It would not be too much work to publish a set of post-processing scripts that could perform those normalisations that people are interested in; I don't think any two people will agree exactly on what information is useful, and removing information unnecessarily is just draconian.
- display searches per week (?) that have been searched for at least
10 times from at least 5 different IP hashes (to avoid people searching their own name 100 times...)
I don't think the IP addresses should come into the analysis at all, though possibly a cut-off at 5 or 10 searches might be useful to prevent a huge tail-end of probably useless information (it also might exclude cases where people have typed things into the search box by accident - maybe they got distracted while logging in)
The logs are probably combined across wikis, so I'd change that to
#start_datetime end_datetime projectcode hits search_string
If these files were to be provided regularly, it would make sense to have the time period and the wiki defined in the file name, either a month or a week at a time, this would leave the file contents very simple, just the raw number of hits followed by a space, followed by what was typed into the Search box (or as close to as is available).
$ cat enwiktionary-2010-01-failedsearches.lis
123919 MLIF .... 12873 mlif ... 103 MILF definition ... 1 what does M.I.L.F meen????
Conrad
( http://en.wiktionary.org/w/index.php?oldid=4055082 for MILF explanation)