Log of failed searches

List overview All Threads
Download

newer

older

RDFa and Microdata in MediaWiki

Re: [Wikitech-l] FW: [FOSDEM] News...

Apoc 2400

14 Jan 2010 14 Jan '10

9:37 a.m.

Would it be possible to generate a log or statistics of searches on Wikipedia using the "Go" button that did not immediately reach an article? Properly anonymized of course. I think it would be useful for finding missing articles and redirects to create. There would be a lot of crap of course, but probably also very useful information on what people have trouble finding.

Show replies by date

Magnus Manske

14 Jan 14 Jan

3:09 p.m.

On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400(a)gmail.com> wrote:

...

We used to have that. I don't remember why it was turned off - probably too many results. Magnus

Robert Stojnic

3:22 p.m.

Magnus Manske wrote:

...

On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400(a)gmail.com> wrote:

We used to have that. I don't remember why it was turned off - probably too many results.

We used to do it, and the plan was to make it public, however, there are privacy issues apparently and no-one knows if we can or cannot publish them, and in what format etc.. So since it was filling up the disk and was not used, I have disabled it until a solution and storage space is found. Cheers, r.

Nikola Smolenski

3:27 p.m.

Robert Stojnic wrote:

...

Magnus Manske wrote: > On Thu, Jan 14, 2010 at 9:37 AM, Apoc 2400 <apoc2400(a)gmail.com> wrote: >> Would it be possible to generate a log or statistics of searches on >> Wikipedia using the "Go" button that did not immediately reach an article?

Also, searches made using either button that did not have any results. There are smaller Wikipedias out there, you know :)

...

Properly anonymized of course. I think it would be useful for finding missing articles and redirects to create. There would be a lot of crap of course, but probably also very useful information on what people have trouble finding.

We used to have that. I don't remember why it was turned off - probably too many results.

We used to do it, and the plan was to make it public, however, there are privacy issues apparently and no-one knows if we can or cannot publish

What would be privacy issues if only the statistics are displayed?

...

them, and in what format etc.. So since it was filling up the disk and

I suggest HTML :)

Magnus Manske

3:47 p.m.

On Thu, Jan 14, 2010 at 3:27 PM, Nikola Smolenski <smolensk(a)eunet.rs> wrote:

...

Robert Stojnic wrote:

Also, searches made using either button that did not have any results. There are smaller Wikipedias out there, you know :)

We used to have that. I don't remember why it was turned off - probably too many results.

We used to do it, and the plan was to make it public, however, there are privacy issues apparently and no-one knows if we can or cannot publish

What would be privacy issues if only the statistics are displayed?

I guess people searching for their own name, or the like. Suggestion : * log search and SHA1 IP hash (anonymous!) * search queries are logged in a standardized fashion (for grouping), e.g. lowercase, single spaces, no leading/trailing spaces, special chars converted to spaces, etc. * display searches per week (?) that have been searched for at least 10 times from at least 5 different IP hashes (to avoid people searching their own name 100 times...) Magnus

Robert Stojnic

3:56 p.m.

This sounds like a good idea, although we could probably argue about cut-offs. However, since this needs to be done in-house (and not on toolserver etc because I imagine we cannot distribute raw logs) I image it is going to go very slow as there is no-one working on it or planning to work on it from core staff... r.

...

Bryan Tong Minh

3:58 p.m.

On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

* log search and SHA1 IP hash (anonymous!)

There are only 2 billion unique addresses and they can all be found in half an hour probably. Bryan

David Gerard

4:01 p.m.

2010/1/14 Bryan Tong Minh <bryan.tongminh(a)gmail.com>om>:

...

On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

> * log search and SHA1 IP hash (anonymous!)

...

There are only 2 billion unique addresses and they can all be found in half an hour probably.

A count of search terms, with no IP info at all? Would be more useful than nothing. (modulo the issue Michael Snow raised re: searches on suppressable names) - d.

Gregory Maxwell

4:15 p.m.

On Thu, Jan 14, 2010 at 11:01 AM, David Gerard <dgerard(a)gmail.com> wrote:

...

2010/1/14 Bryan Tong Minh <bryan.tongminh(a)gmail.com>om>:

On Thu, Jan 14, 2010 at 4:47 PM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

> * log search and SHA1 IP hash (anonymous!)

There are only 2 billion unique addresses and they can all be found in half an hour probably.

A count of search terms, with no IP info at all? Would be more useful than nothing. (modulo the issue Michael Snow raised re: searches on suppressable names)

Magnus was not suggesting disclosing the IP hash, as far as I can tell. He demonstrating an abundance of caution in suggesting only logging that. (er, well, yea, if he was suggesting disclosing that... we shouldn't do that. Even if we add a secret to the hash, it's risky and allows interesting correlation attacks) Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics Which has first been filtered by: * Canonicalization of strings (at least ascii case folding) * Excluding strings over some length * Excluding searches which did not come from at least 5 distinct IPs during the reporting interval There will be useful information excluded by this process, e.g. gads of misspellings which came from only two to four unique IPs... but the output would still be *far* more useful no information at all.

Gregory Maxwell

4:21 p.m.

On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:

...

Here is what I would suggest disclosing: #start_datetime end_datetime hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 5 autoerotic quantum chromodynamics

The logs are probably combined across wikis, so I'd change that to #start_datetime end_datetime projectcode hits search_string 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 39284 naked people 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 23950 hot grits ... 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikipedia 5 autoerotic quantum chromodynamics 2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage & Disziplin Pokémon ... ... 2010-01-01-0:0:4 2010-01-13-23-59-50 en.wikinews 5 ethics in journalism

Tei

4:41 p.m.

2010/1/14 Gregory Maxwell <gmaxwell(a)gmail.com>om>:

...

On Thu, Jan 14, 2010 at 11:15 AM, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:

my $0.02 I expect some fun here, since error encodings will hit things like &, ñ, ó. 2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage amp; Disziplin Pokémon 2010-01-01-0:0:4 2010-01-13-23-59-50 de.wikipedia 25093 Bondage %33amp; Disziplin Pokémon .... on the other part, all these errors will be browser/proxy bugs, and not mediawiki bugs. I think. Anyway, if the "special characters" are replaced by spaces, there will be less weird shit, and more misterious space holes. -- -- Fin del Mensaje.

Conrad Irwin

5:22 p.m.

...

* search queries are logged in a standardized fashion (for grouping), e.g. lowercase, single spaces, no leading/trailing spaces, special chars converted to spaces, etc.

Wiktionary is case-sensitive and so case-folding there may not be appropriate; I personally would be interested in seeing these logs before even the NFC normalizers get to them (given a lack of any other source to find out how people type fun characters in the wild) though I can appreciate this is somewhat sadistic, and probably the logs are taken too late for this. It would not be too much work to publish a set of post-processing scripts that could perform those normalisations that people are interested in; I don't think any two people will agree exactly on what information is useful, and removing information unnecessarily is just draconian.

...

* display searches per week (?) that have been searched for at least 10 times from at least 5 different IP hashes (to avoid people searching their own name 100 times...)

I don't think the IP addresses should come into the analysis at all, though possibly a cut-off at 5 or 10 searches might be useful to prevent a huge tail-end of probably useless information (it also might exclude cases where people have typed things into the search box by accident - maybe they got distracted while logging in)

...

The logs are probably combined across wikis, so I'd change that to #start_datetime end_datetime projectcode hits search_string

If these files were to be provided regularly, it would make sense to have the time period and the wiki defined in the file name, either a month or a week at a time, this would leave the file contents very simple, just the raw number of hits followed by a space, followed by what was typed into the Search box (or as close to as is available). $ cat enwiktionary-2010-01-failedsearches.lis 123919 MLIF .... 12873 mlif ... 103 MILF definition ... 1 what does M.I.L.F meen???? Conrad ( http://en.wiktionary.org/w/index.php?oldid=4055082 for MILF explanation)

Aryeh Gregor

5:51 p.m.

On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin <conrad.irwin(a)googlemail.com> wrote:

...

The logs are taken from the Squids, long before MediaWiki touches them, so they shouldn't be normalized at all.

...

Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms.

Conrad Irwin

6:15 p.m.

On 01/14/2010 05:51 PM, Aryeh Gregor wrote:

...

On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin <conrad.irwin(a)googlemail.com> wrote:

The logs are taken from the Squids, long before MediaWiki touches them, so they shouldn't be normalized at all.

Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms.

Such people would be able to deny searching for such terms, I don't see this as posing any more problems than the history dumps. Thinking further though, it would be possible to tie a search to an IP address or User when a page is created with the search term (as it is highly likely if there was only one search that it was this user who did it). It thus seems likely that a cut off point is needed, and that it can only be chosen arbitrarily or by someone with relevant permission scanning logs to find out this information. Looking at "prior art", it seems that 25 is high enough or more than: http://wikistics.falsikon.de/2008/wiktionary/fr/wanted/ but obviously, the higher the number, the less complete the lists are. Conrad

Conrad Irwin

6:26 p.m.

...

scanning logs to find out this information. Looking at "prior art", it seems that 25 is high enough or more than: http://wikistics.falsikon.de/2008/wiktionary/fr/wanted/ but obviously, the higher the number, the less complete the lists are. Conrad

Whoops, that should have been 14, can't do maths any more; sorry for the spam. Conrad

Robert Stojnic

8:03 p.m.

...

I think the biggest issue is that people expect their search queries to be private. When the word got out that search log might become available we got a couple of angry remarks like "we didn't sign up for this", and "even google doesn't do it". Some form of statistics would probably be fine with our users, but the cut-off numbers would need to be high enough so that people no longer feel it is "their query".. r.

Platonides

11:32 p.m.

Aryeh Gregor wrote:

...

The logs are taken from the Squids, long before MediaWiki touches them, so they shouldn't be normalized at all.

Search isn't cached, so it may be easier to just log it at the backend. I expect many people using things like "please tell me how many people live in China", as revealed by such titles being created. My conclusion is that some people (10%?) don't know how to search in a encyclopedia. I mean, we have an article called [[China]] with a proper Population section... While reading this thread I have deleted a page called "Why do ghosts manifest themselves?" with content "fogcpijkñldjlkcmvlkmc.,vmblcjgmlkjglkjmf,.mfdgfdolfgdjk" [1]. I'm thinking in an extension to feed with regex extracting the actual title they may be loking for. Sampled search logs are unlikely to reveal them though, since what they are repeating are the non-keywords, not the full query. 1-http://es.wikipedia.org/w/index.php?title=Special:Log&page=Por_que_se…

Gregory Maxwell

15 Jan 15 Jan

2:16 a.m.

On Thu, Jan 14, 2010 at 6:32 PM, Platonides <Platonides(a)gmail.com> wrote:

...

Sampled search logs are unlikely to reveal them though, since what they are repeating are the non-keywords, not the full query.

Sampling is fine, but aggregated logs aren't likely to… thats the primary reason for reporting things other than the topmost queries.

Gregory Maxwell

14 Jan 14 Jan

7:30 p.m.

On Thu, Jan 14, 2010 at 12:22 PM, Conrad Irwin <conrad.irwin(a)googlemail.com> wrote:

...

You've missed the point of the normalization here. It's not to be helpful to users: As you observe, it's easy for the recipient of the list to perform their own. The reason to normalize is to push more queries above the reporting threshold. For example, 5 people might search for "john f. kinndey" (a misspelling of "John F. Kennedy"?) but all capitalize it differently. A redirect on this misspelling would be useful regardless of the case. All things equal I'd rather *not* normalize the data... it's just more stuff that may have surprising behaviour. But I think this is something which may need to be balanced against the disclosure threshold. It would also be possible to do the disclosure calculation against normalized data while releasing the raw values... but I must admit a little bit of uneasiness that the normalization might be ignoring some piece of information relevant to privacy. For example, if we were to go that route we might employ some fairly aggressive normalization... removing all whitespace and punctuation. If we went as far as also removing all *numbers* from the check we'd run into things like "Greg Maxwell (555)-555-1212" getting published because enough distinct people searched for "greg maxwell". Obviously the answer to that one is "don't remove numbers" from the check, but I worry about the cases I haven't thought of. On Thu, Jan 14, 2010 at 12:51 PM, Aryeh Gregor <Simetrical+wikilist(a)gmail.com> wrote:

...

Some people might search for their own name more than five times in a week, possibly together with other embarrassing or incriminating search terms.

Yes, it's possible that someone may search 5 times, from 5 IPs (which *might* be from one machine due to proxy round-Robbins), an identical string ... "MyFullName seen on friday night with a woman other than his wife" ... but what to do? Any information which is disclosed has some risk of disclosing something that someone would rather not be. This risk can be made arbitrarily small, but it can't be eliminated. I think the benefit to the readers of having this information available easily outweighs some sufficiently fringe confidentiality concern. At some point your frequently repeated search is a statistic, which no reasonable privacy policy would frown on disclosing. This is important to our operations, disclosing it is in the public interest, and failing to do work in this area puts us at a disadvantage compared to other parties who might be far less scrupulous. (e.g. If WMF's search performs poorly, you might feel compelled to use Search Engine X — which happens to secretly sell your data to the highest bidder.) Is there some sufficiently high number which *no one* paying attention here has a concern about? We could simply start with that.... and possibly lower the threshold over time as the lowest hanging fruit are solved, tracking our disclosure comfort. I think we all have an interest and obligation to take every reasonable means, but no one can ask for more than that. Would anyone feel more comfortable if this ignored queries made via the secure server? Non-HTTPS traffic can be watched by anyone on the path between you and Wikimedia... any illusion of absolute privacy on the insecure traffic is patently false already.

Gregory Maxwell

4 p.m.

On Thu, Jan 14, 2010 at 10:47 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

Suggestion : * log search and SHA1 IP hash (anonymous!)

*Any* mapping of the IP is not anonymous. Please see the AOL search results where unique IDs were connected between searches to disclose information. (More over a straight simple hash of an IP can be reversed simply by making a table of all expected IPs) However: Since this is just for internal logging there is no need to hash the IP. Just log it directly, and thus avoid the risk that someone later will think the hash is something which can be disclosed.

...

* search queries are logged in a standardized fashion (for grouping), e.g. lowercase, single spaces, no leading/trailing spaces, special chars converted to spaces, etc.

Excellent.

...

* display searches per week (?) that have been searched for at least 10 times from at least 5 different IP hashes (to avoid people searching their own name 100 times...)

What I've suggested elsewhere was at least 4 different IPs, 5 sounds fine to me too. I don't know that the minimum of 10 queries matters once the 5 IP check is in place. Per week would be okay. No shorter though. If someone gives me a log format, I'll gladly write a fast tool for producing this output. (I did something like that before where I gave Brion a tool to produce stats from access logs) I think I have a C code for a parser for wikimedia's squid logs... so if its just that I already have a good chunk of it done.

5212

days inactive

5213

days old

wikitech-l@lists.wikimedia.org

Manage subscription

19 comments

11 participants

tags (0)

participants (11)

Apoc 2400
Aryeh Gregor
Bryan Tong Minh
Conrad Irwin
David Gerard
Gregory Maxwell
Magnus Manske
Nikola Smolenski
Platonides
Robert Stojnic
Tei