* search queries are logged in a standardized fashion
(for grouping),
e.g. lowercase, single spaces, no leading/trailing spaces, special
chars converted to spaces, etc.
Wiktionary is case-sensitive and so case-folding there may not be
appropriate; I personally would be interested in seeing these logs
before even the NFC normalizers get to them (given a lack of any other
source to find out how people type fun characters in the wild) though I
can appreciate this is somewhat sadistic, and probably the logs are
taken too late for this.
It would not be too much work to publish a set of post-processing
scripts that could perform those normalisations that people are
interested in; I don't think any two people will agree exactly on what
information is useful, and removing information unnecessarily is just
draconian.
* display searches per week (?) that have been
searched for at least
10 times from at least 5 different IP hashes (to avoid people
searching their own name 100 times...)
I don't think the IP addresses should come into the analysis at all,
though possibly a cut-off at 5 or 10 searches might be useful to prevent
a huge tail-end of probably useless information (it also might exclude
cases where people have typed things into the search box by accident -
maybe they got distracted while logging in)
The logs are probably combined across wikis, so
I'd change that to
#start_datetime end_datetime projectcode hits search_string
If these files were to be provided regularly, it would make sense to
have the time period and the wiki defined in the file name, either a
month or a week at a time, this would leave the file contents very
simple, just the raw number of hits followed by a space, followed by
what was typed into the Search box (or as close to as is available).
$ cat enwiktionary-2010-01-failedsearches.lis
123919 MLIF
....
12873 mlif
...
103 MILF definition
...
1 what does M.I.L.F meen????
Conrad
(
http://en.wiktionary.org/w/index.php?oldid=4055082 for MILF explanation)