I don't know what the 14###'s are. I googled them, thinking they were IDs of some sort, but found nothing.For those and the buildup queries, and others, I'd love to get a source—referrer for web or app for API—and tell them to do something different.Trey JonesSoftware Engineer, Discovery
Wikimedia FoundationOn Fri, Jul 24, 2015 at 5:20 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:(Don't worry, Erik and I have been brainstorming on ways to get more information in)This is awesome work! Deleting underscores and using language detection sound like great approaches to take :). What's 14####, etc?One of the worries I have here is the fact that the cirrus logs very deliberately don't (yet) give us enough information to identify readers. Are those buildup queries or the most common queries automata? We simply don't know :/.On 24 July 2015 at 17:16, Trey Jones <tjones@wikimedia.org> wrote:_______________________________________________Hey everyone,I got access to some logs and I've been slogging through the data. In particular, I've partially analyzed a sample of 100K zero-result full_text searches against enwiki, over the course of about an hour (2015-07-23 07:51:29 to 2015-07-23 08:55:42). My results and opinions are below.TL;DR Summary: If these patterns hold for another sample (and across languages), we should be able to get some decent mileage out of these simple approaches:- find sources of weird patterns and either ignore them, or contact the source and redirect them to a more appropriate destination- use language or character set detection to redirect queries to other wikis- filter the term "quot" from queries- filter 14###########: from the front of queries- replace _ with space in queriesAll of this is somewhat rough, and exact numbers aren't guaranteed. Also the categories may overlap. I also intend to look for these same patterns from another sample from a different day and make sure they are more general and not just temporary idiosyncrasies. I also plan to look through other language wikis (i.e., Spanish and French to start) to see if there are cross-linguistic patterns like these.I think we have to some how come to terms with the fact that some queries don't deserve results, and maybe figure out the source of such "illegitimate" queries and filter them. (I'd really like to be able to track down the referrer, if there is one, for a lot of the weirder queries.)Top query:- 248 Dounload feer game- all via web... and Google can't find it. That's just weird.Some other categories of queries are below. The numbers are "<total queries> / <unique queries>". Since this is a 100K sample of zero-result queries, and zero-results are about 25% of all results, each 1,000 of total queries here represents about 0.25% of all search queries.253 / 171 string of numbers3610 / 2505 no Latin letters- I see Korean, Thai, Japanese, Cyrillic, doi #s (see below), Arabic, Hebrew, Greek, Armenian, Georgian, Devanagari, Burmese, Chinese, and some emoji (e.g., 11 searches for 😜💗🎨❤️💋😞☀️💦).- I also saw instances of mixed Latin / non-Latin queries- Includes gibberish, which is hard to grep for, but easy to spot by eye- Lots of the non-gibberish ones are clearly in other languages, and I saw queries in other Latin-alphabet languages go by, too.2630 / 2627 DOIs, all in quotes3015 / 1017 have quot in them (which gets auto-corrected to "quote", obviously)- 327 are one word: quot ... quot- I don't know where these are coming from, but they are weird. If we strip "quot" we would get many of these. This must be coming from some source that is adding quotes, then escaping them as """ and then stripping & and ;. Weird.7155 / 6337 #:Name- almost all are 14###########:Text- e.g., 1436755654740:Sherlock Holmes- These all look like Wikipedia titles!- Two each of 0:... and 6000:...114 / 85 actual http(s):// URLs488 / 244 URL-like things starting with www... and ending with .com, .ru, etc.211 / 132 other searches starting with “www.”1085 / 1083 article searches in this format: ('"<TITLE>"', '<AUTHOR(S)>')2457 / 2060 TV episodes (based on the presence of "S#E#"—that's season #, episode #)8419 / 7523 AND boolean searches703 / 701 OR boolean searches- Many of these look auto-generated, esp in the aggregate.- For example: there are 498 / 249 "House_of_Gurieli" AND ... queries6310 / 5742 queries with _ in them- only 934 / 790 if we skip the 14###########:Text and boolean AND queriesOther things I noticed:- lots of queries for books, articles, movies, tv, mp3s, and porn (in multiple languages)- lots of "building up" searches (and these are all marked full_text), for example:achevmachevmeachevmenachevmentachevmentsachevments oachevments ofachevments ofachevments of hachevments of heachevments of hellachevments of helleachevments of hellenachevments of hellen kachevments of hellen kachevments of hellen kellachevments of hellen kelleachevments of hellen keller- reasonable-looking ~ queries don't work:intitle:George~ intitle:Washin~ gives 0 resultsintitle:Washington intitle:George gives 279 resultsFinally, I did see a bunch of typos, but I didn't try to quantify them because I was digging into all of these other interesting patterns.Have a good weekend.—TreyTrey JonesSoftware Engineer, Discovery
Wikimedia Foundation
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
--Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search