Discovery July 2016

discovery@lists.wikimedia.org

15 participants
25 discussions

Code review norms
by Kevin Smith 12 Aug '16

12 Aug '16

Another team at the foundation has published a list of code repositories they manage and/or monitor, along with norms for reviewing code[1]. Should Discovery create something similar? [1] https://www.mediawiki.org/wiki/Wikimedia_Language_engineering/Code_review_s… Kevin Smith Agile Coach, Wikimedia Foundation

5 9

Discovery Weekly Update for the week starting 2016-07-25
by Chris Koerner 29 Jul '16

29 Jul '16

Hello, I hope you are doing OK. Here are the updates from the Discovery department for this week. * Last weekend, Interactive team had a Graph hackathon, and presented maps at the State of the Maps US conference in Seattle. See slides and presentation video. [1] [2] [3] * Discovery is hiring for an Engineering Manager. [4] * Trey published an investigation of top unsuccessful search queries. [5] * TextCat language ID and cross-wiki searching production release was featured in a WMF blog post, twitter and Facebook. [6] [7] [8] * here's a picture of a couple of sleeping puppies to smile at [9] [1] https://www.mediawiki.org/wiki/DataViz_Seattle_hackathon [2] https://commons.wikimedia.org/wiki/File:Wikipedia_maps_presentation_-_-sotm… [3] https://www.youtube.com/watch?v=Kscsp0YS7nw&index=35&list=PLqjPa29lMiE3eR-g… [4] https://boards.greenhouse.io/wikimedia/jobs/217682?t=6d7mfq#.V5uEQu3L9z0 [5] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Sea… [6] https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/ [7] https://twitter.com/Wikimedia/status/758397964851159042 [8] https://www.facebook.com/wikipedia/posts/10154269437833346 [9] https://commons.wikimedia.org/wiki/File:Sleeping_Pups.jpg ---- The full update, and archive of past updates, can be found on Mediawiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Wikipedia seeks to speak your language
by Deborah Tankersley 27 Jul '16

27 Jul '16

Using language detection to search the right Wikipedia Wikipedia readers speak many languages, so it’s not a surprise that sometimes they search for phrases not in the language of the wiki that they’re currently reading. This, unfortunately, can lead to poor search results. A recent survey we completed on English Wikipedia identified searches done in 40 different languages <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimizat…> [1]! The Wikimedia Discovery department <http://www.mediawiki.org/wiki/Wikimedia_Discovery> [2] wants to help people easily find what they are looking for. In order to do this, the Discovery Search team is rolling out new language identification software to the Wikipedia search engine. This new software will detect when a search is unsuccessful, but appears to be in a different language. When this happens, the search results page will include results from the Wikipedia of the automatically detected language. These new cross-wiki results will be displayed along with the local-wiki results, if there are any. We’ve recently enabled the language identification and search results for the English, French, German, Italian, and Spanish-language Wikipedias. The next group of Wikipedias to have language detection enabled will includeIndonesian, Japanese, Portuguese, and Russian <http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_…> [3]. We are investigating ways to bring language detection to more Wikipedias and to other Wikimedia projects. The Search team has other language detection ideas and plans in the works. We’re thinking about ways to improve language detection with smarter measures of confidence <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confiden…> [4]. We are also exploring detection of search in one character set while using a keyboard from another character set. Early experiments with English and Russian <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_…> [5] are promising! You can find technical details about our new language detection module (TextCat) onMediaWiki.org <https://www.mediawiki.org/wiki/TextCat> [6]. PHP <https://github.com/wikimedia/wikimedia-textcat> [7] and updated Perl <https://github.com/Trey314159/TextCat> [8] libraries are also available and the libraries include language models for dozens of languages. You can also test the language detection using our online demo <https://tools.wmflabs.org/textcatdemo/> [9]. The demo lets you try all the different language models on your own text. It also includes tutorials and lots of additional information about TextCat’s internal workings. Let’s get searching - now with language detection and better results! You can read theblog post <https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/> [10] and more detailed information is here <https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Langu…> [11]. *Here's some nice screenshots of what it looked like before we added in the language detection...[12]* *and after we added in the language detection for a Russian query on English Wikipedia [13]:* *Thanks for reading - from the Discovery Search Team Gnomes!* [1] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimizat… [2] http://www.mediawiki.org/wiki/Wikimedia_Discovery [3] http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_… [4] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confiden… [5] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_… [6] https://www.mediawiki.org/wiki/TextCat [7] https://github.com/wikimedia/wikimedia-textcat [8] https://github.com/Trey314159/TextCat [9] https://tools.wmflabs.org/textcatdemo/ [10] https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/ [11] https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Langu… [12] https://commons.wikimedia.org/wiki/File%3AExisting-search_no-textcat.png [13] https://commons.wikimedia.org/wiki/File%3ANew-search_with-textcat.png -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation

2 1

Microsoft adding research tool to Office
by Pine W 26 Jul '16

26 Jul '16

I'm wondering if MS will pull citations from Wikipedia and/or make use of Wikidata. I'm also wondering if this will decrease Wikimedia site traffic in a similar way to how search engine knowledge panels may have decreased Wikimedia site traffic, particularly in this case from users who have access to MS Office. https://blogs.office.com/2016/07/26/the-evolution-of-office-apps-new-intell… Pine

1 0

Fwd: [Wikimedia-l] Improving search (sort of)
by Pine W 25 Jul '16

25 Jul '16

Forwarding Pine ---------- Forwarded message ---------- From: "James Heilman" <jmh649(a)gmail.com> Date: Jul 15, 2016 06:33 Subject: [Wikimedia-l] Improving search (sort of) To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org> Cc: A while ago I requested a list of the "most frequently searched for terms for which no Wikipedia articles are returned". This would allow the community to than create redirect or new pages as appropriate and help address the "zero results rate" of about 30%. While we are still waiting for this data I have recently come across a list of the most frequently clicked on redlinks on En WP produced by Andrew West https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of these can be reasonably addressed with a redirect as the issue is often capitals. Do anyone know where things are at with respect to producing the list of most search for terms that return nothing? -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

2 2

Re: [discovery] [MediaWiki-l] Specific search page
by Guillaume Lederrey 22 Jul '16

22 Jul '16

That's probably a question that Discovery could answer. Forwarding to that list... On Fri, Jul 22, 2016 at 11:01 AM, Ad Strack van Schijndel <ad.strackvanschijndel(a)gmail.com> wrote: > A configuration setting would be nice, but I couldn't find one. Same with hooks. > > So any way would do I suppose. > > We created an extension with a new search page by making a subclass of SpecialSearch and changing some functions and adding some hooks. That seems like a good idea but when searching there is always a redirect to Special:Search and I can't figure out how to change that. > > >> On 21 Jul 2016, at 16:29, John <phoenixoverride(a)gmail.com> wrote: >> >> like what? >> >> On Thu, Jul 21, 2016 at 10:27 AM, Ad Strack van Schijndel < >> ad.strackvanschijndel(a)gmail.com> wrote: >> >>> Hi, >>> >>> It seems that a search always leads to Special:Search, no matter what. >>> That has to do with the search parameter in the url. >>> Is there an elegant way to make a search request go to another search page? >>> >>> Thanks! >>> Ad >>> _______________________________________________ >>> MediaWiki-l mailing list >>> To unsubscribe, go to: >>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l >>> >> _______________________________________________ >> MediaWiki-l mailing list >> To unsubscribe, go to: >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > > _______________________________________________ > MediaWiki-l mailing list > To unsubscribe, go to: > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation

2 1

Better search results on wiki via TextCat
by Deborah Tankersley 19 Jul '16

19 Jul '16

We're happy to announce that after numerous tests and analyses[1] and a fully operational demo[2], the Discovery Team is ready to release TextCat[3] into production on wiki. What is TextCat? It detects the language that the search query was written in which allows us to look for results on a different wiki. TextCat is a language detection library based on n-grams[4]. During a search, TextCat will only kick in when the following three things occur: 1. fewer than 3 results are returned from the query on the current wiki 2. language detection is successful (meaning that TextCat is reasonably certain what language the query is in, and that it is different from the language of the current wiki) 3. the other wiki (in the detected language) has results Our analysis of the A/B test[5] (for English, French, Spanish, Italian and German Wikipedia's) showed that: "...The test groups not only had a substantially lower zero results rate (57% in control group vs 46% in the two test groups), but they had a higher clickthrough rate (44% in the control group vs 49-50% in the two test groups), indicating that we may be providing users with relevant results that they would not have gotten otherwise." This update will be scheduled for production release during the week of July 25, 2016 on the following Wikipedia's: - English [6] - German [7] - Spanish [8] - Italian [9] - French [10] TextCat will then be added to this next group of Wikipedia's at a later date: - Portugese[11] - Russian[12] - Japanese[13] This is a huge step forward in creating a search mechanism that is able to detect - with a high level of accuracy - the language that was used and produce results in that language. Another forward-looking aspect of TextCat is investigating a confidence measuring algorithm[14], to ensure that the language detection results are the best they can be. We will also be doing more[15] A/B tests using TextCat on non Wikipedia sites, such as Wikibooks and Wikivoyage. These new tests will give us insight into whether applying the same language detection configuration across projects would be helpful. Please let us know if you have any questions or concerns, on the TextCat discussion page[16]. Also, for screenshots of what this update will look like, please see this one[17] showing an existing search typed in on enwiki in Russian "первым экспериментом" and this one[18] for showing what it will look like once TextCat is in production on enwiki. Thanks! [1] https://phabricator.wikimedia.org/T118278 [2] https://tools.wmflabs.org/textcatdemo/ [3] https://www.mediawiki.org/wiki/TextCat [4] https://en.wikipedia.org/wiki/N-gram [5] https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_… [6] https://en.wikipedia.org/ [7] https://de.wikipedia.org/ [8] https://es.wikipedia.org/ [9] https://it.wikipedia.org/ [10] https://fr.wikipedia.org/ [11] https://pt.wikipedia.org/ [12] https://ru.wikipedia.org/ [13] https://ja.wikipedia.org/ [14] https://phabricator.wikimedia.org/T140289 [15] https://phabricator.wikimedia.org/T140292 [16] https://www.mediawiki.org/wiki/Talk:TextCat [17] https://commons.wikimedia.org/wiki/File:Existing-search_no-textcat.png [18] https://commons.wikimedia.org/wiki/File:New-search_with-textcat.png -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation

1 0

Discovery Weekly Update for the week starting 2016-07-11
by Chris Koerner 15 Jul '16

15 Jul '16

Hello again, I hope your day is going well. Here is this week's update from the Discovery department. == Search == * Mikhail completed a review of the language-identification A/B test (using TextCat). The full report is on Commons. [1] * Trey has more language-identification–related stuff: notes on how to introduce confidence to TextCat [2] (comments and ideas especially welcome!), a re-review of the languages of poorly performing queries on English Wikipedia [3] using our now-standard corpus creation process (TL; DR: quite similar to earlier results); and similar first-time reports for Portuguese and Russian Wikipedias [4], with Japanese and Indonesian in the pipeline. * Trey and Jan collaborated on a new design of the TextCat demo [5] * Team discussing deploying the new TextCat language detection to wikis: en, it, fr, de, es with more language wikis to come in the future [6] * a cute picture of a cat and dog, just because [7] == Maps == * Map's geoshapes service is now available in graphs: Interactive maps with regions [8] * Wikivoyage map customizations have moved into the extension code, now available on all WV languages. Customize it via MediaWiki:Kartographer.js and MediaWiki:Kartographer.css [9] [10] == Wikipedia.org Portal== * Conversation with the community continuing for new Wikipedia.org portal page layout with a new language by article count dropdown. [11] ** The page has been translated, so far, in 8 languages and posted on various Village Pump sites; ** a new working prototype has been added [12] ** and a new mock up design has been added which was based off of community feedback. [13] * Released findings and conclusions of the recent survey ran on the Wikipedia.org portal page - how did you arrive at Wikipedia - that also contains good feedback on the portal design page, the sister project links with descriptive text, search and usage on various platforms. [14] * Conversation with the community on handling search queries with question marks also continues [15] ** this page has been translated into 5 additional languages * Team conversations started about how to package portions of the wikipedia portal page to make it easier to re-use across all projects that want it. [16] * Team conversations started about how to automate stats updates on the wikipedia portal page [17] == Events and News== * Erik was promoted to Senior Software Engineer, yay! * Discovery Quarterly review was held [18] * Geo Search was launched by the Discovery Search team on behalf of the Reading team [19] [1] https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_… [2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confiden… [3] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimizat… [4] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization… [5] https://tools.wmflabs.org/textcatdemo/ [6] https://phabricator.wikimedia.org/T121541 [7] https://commons.wikimedia.org/wiki/File:Foxteriér_a_kocúr.jpg [8] https://www.mediawiki.org/wiki/Template:Graph:Country_with_regions_and_capi… [9] https://en.wikivoyage.org/wiki/MediaWiki:Kartographer.js [10] https://en.wikivoyage.org/wiki/MediaWiki:Kartographer.css [11] https://www.mediawiki.org/wiki/Wikipedia.org_updated_page_layout [12] https://people.wikimedia.org/~jdrewniak/collapsed-languages/index.html [13] https://www.mediawiki.org/wiki/File:Wikipedia.org_portal_more_languages_but… [14] https://www.mediawiki.org/wiki/File:Discovery_-_Wikipedia.org_Portal_Study_… [15] https://meta.wikimedia.org/wiki/Discovery/Handling_question_marks_in_search… [16] https://phabricator.wikimedia.org/T136151 [17] https://phabricator.wikimedia.org/T140159 [18] https://docs.google.com/presentation/d/1gVXPwLtUZc1H5V82wv1TJU-abmmfaZkXYlo… [19] https://www.mediawiki.org/wiki/Help:CirrusSearch#Geo_Search ---- The full update, and archive of past updates, can be found on Mediawiki.org: https://www.mediawiki.org/wiki/Discovery/Status_updates -- Yours, Chris Koerner Community Liaison - Discovery Wikimedia Foundation

1 0

Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of)
by Trey Jones 15 Jul '16

15 Jul '16

Hey James, When we first started looking at zero results rate (ZRR), it was an easy metric to calculate, and it was surprisingly high. We still look at ZRR <https://searchdata.wmflabs.org/metrics/#failure_rate> because it is so easy to measure, and anything that improves it is probably a net positive (note the big dip when the new completion suggester was deployed!!), but we have more complex metrics that we prefer. There's user engagement <https://searchdata.wmflabs.org/metrics/#kpi_augmented_clickthroughs>/augmented clickthroughs, which combines clicks and dwell time and other user activity. We also use historical click data in a metric that improves when we move clicked-on results higher in the results list, which we use with the Relevance Forge <https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/relevan…> . And I didn't mean to give the impression that *most* zero-results queries are gibberish, though many, many are. And that was something we didn't really know a year ago. There are also non-gibberish results that correctly get zero results, like most DOI <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…> and many media player <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…> queries. We also see a lot of non-notable (not-yet-notable?) public figures (local bands, online artists, youtube musicians), and sometimes just random names. The discussion in response to Dan's original comment in Phab mentions some approaches to reduce the risk of automatically releasing private info, but I still take an absolute stand against unreviewed release. If I can get a few hundred people to click on a link like this <https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&f…"James+is+a+nice+guy">, I can get any message I want on that list. (Curious? Did you click?) The message could be less anonymous and much more obnoxious, obviously. 50 character limits won't stop emails and phone numbers from making the list (which invites spam and cranks). Those can be filtered, but not perfectly. I've only looked at these top lists by day in the past, but on that time scale the top results are usually under 1000 count (and that includes IP duplicates), so the list of queries with 100 IPs might also be very small. As I said, I'm happy to do the data slogging to try this in a better fashion if this task is prioritized, and I'd be happy to be wrong about the quality of the results, but I'm still not hopeful. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation On Fri, Jul 15, 2016 at 11:44 AM, James Heilman <jmh649(a)gmail.com> wrote: > Hey Trey > > Thanks for the in depth discussion. So if the terms people are using that > result in "zero search results" are typically gibberish why do we care if > 30% of our searches result in "zero search results"? A big deal was made > about this a while ago. > > If one was just to look at those search terms that more than 100 IPs > searched for would that not remove the concerns about anonymity? One could > also limit the length of the searches displaced to 50 characters. And just > provide the first 100 with an initial human review to make sure we are not > miss anything. > > James > > On Fri, Jul 15, 2016 at 9:31 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > >> Pine, thanks for the forward. Regulars on the Discovery list may know me, >> but James probably does not. I've manually reviewed tens of thousands of >> generally poorly performing queries (fewer than 3 results) and skimmed >> hundreds of thousands more from many of the top 20 Wikipedias—and to a >> lesser extent other projects—over the year I've been at the WMF and in >> Discovery. You can see my list of write ups here >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes>. >> >> So I want to say that this is an awesome idea—which is why many people >> have thought of it. It was apparently one of the first ideas the Discovery >> department had when they formed (see Dan's notes linked below). It was also >> one of the first ideas I had when I joined Discovery a few months later. >> >> Dan Garry's notes on T8373 >> <https://phabricator.wikimedia.org/T8373#1856036> and the following >> discussion pretty much quash the idea of automated extraction and >> publication from a privacy perspective. People not only divulge their own >> personal information, they also divulge other people's personal >> information. One example: some guy outside the U.S. was methodically >> searching long lists of real addresses in Las Vegas. I will second Dan's >> comments in the T8373 discussion; all kinds of personal data end up in >> search queries. A dump of search queries *was* provided in September 2012 >> <https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedi…>, >> but had to be withdrawn over privacy concerns. >> >> Another concern for auto-published data: never underestimate the power of >> random groups of bored people on the internet. 4chan decided to arrange >> Time Magazine poll results >> <https://techcrunch.com/2009/04/27/time-magazine-throws-up-its-hands-as-it-g…> so >> the first letter spelled out a weird message. It would be easy for 4chan, >> Reddit, and other communities to get any message they want on that list if >> they happened to notice that it existed. See also Boaty McBoatface >> <https://en.wikipedia.org/wiki/RRS_Sir_David_Attenborough#Name> and Mountain >> Dew "Diabeetus" >> <https://storify.com/cbccommunity/hitler-did-nothing-wrong-wins-crowdsourced…> >> (which is not at all the worst thing on *that* list). We don't want to >> have to try to defend against that. >> >> In my experience, the quality of what's actually there isn't that great. >> One of my first tasks when I joined Discovery was to look at daily lists of >> top 100 zero-results queries that had been gathered automatically. I was >> excited by this same idea. The top 100 zero-results query list was a >> wasteland. (Minimal notes on some of what I found are here >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…>.) >> We could make it better by focusing on human-ish searchers, using basic >> bot-exclusion techniques >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization…>, >> ignoring duplicates from the same IP, and such, but I don't think it would >> help. And while Wikipedia is not for children, there could be an annoying >> amount of explicit adult material on the list, too. We would probably find >> some interesting spellings of Facebook and WhatsApp, though. >> >> If we're really excited about this, I could imagine using better >> techniques to pull zero-results queries and see if anything good is in >> there, but we'd have to commit to some sort of review before we publish it. >> For example, Discernatron <https://discernatron.wmflabs.org/> data, >> after consulting with legal, is reviewed independently by two people, who >> then have to reconcile any discrepancies, before being made public. So I >> think we'd need an ongoing commitment to have at least two people under NDA >> who would review any list before publication. 500-600 queries takes a >> couple hours per person (we’ve done that for the Discernatron), so the top >> 100 would probably be less than an hour. I'd even be willing to help with >> the review (as I am for Discernatron) if we found there was something >> useful in there—but I'm not terribly hopeful. We'd also need more people to >> efficiently and effectively review queries for other languages if we wanted >> to extend this beyond English Wikipedia. >> >> Finally, if this is important enough and the task gets prioritized, I'd >> be willing to dive back in and go through the process once and pull out the >> top zero-results queries, this time with basic bot exclusion and IP >> deduplication—which we didn't do early on because we didn't realize what a >> mess the data was. We could process a week or a month of data and >> categorize the top 100 to 500 results in terms of personal info, junk, >> porn, and whatever other categories we want or that bubble up from the >> data, and perhaps publish the non-personal-info part of the list as an >> example, either to persuade ourselves that this is worth pursuing, or as a >> clearer counter to future calls to do so. >> —Trey >> >> Trey Jones >> Software Engineer, Discovery >> Wikimedia Foundation >> >> On Fri, Jul 15, 2016 at 10:09 AM, Pine W <wiki.pine(a)gmail.com> wrote: >> >>> Forwarding >>> >>> Pine >>> ---------- Forwarded message ---------- >>> From: "James Heilman" <jmh649(a)gmail.com> >>> Date: Jul 15, 2016 06:33 >>> Subject: [Wikimedia-l] Improving search (sort of) >>> To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org> >>> Cc: >>> >>> A while ago I requested a list of the "most frequently searched for terms >>> for which no Wikipedia articles are returned". This would allow the >>> community to than create redirect or new pages as appropriate and help >>> address the "zero results rate" of about 30%. >>> >>> While we are still waiting for this data I have recently come across a >>> list >>> of the most frequently clicked on redlinks on En WP produced by Andrew >>> West >>> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many >>> of >>> these can be reasonably addressed with a redirect as the issue is often >>> capitals. >>> >>> Do anyone know where things are at with respect to producing the list of >>> most search for terms that return nothing? >>> >>> -- >>> James Heilman >>> MD, CCFP-EM, Wikipedian >>> >>> The Wikipedia Open Textbook of Medicine >>> www.opentextbookofmedicine.com >>> _______________________________________________ >>> Wikimedia-l mailing list, guidelines at: >>> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines >>> New messages to: Wikimedia-l(a)lists.wikimedia.org >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, >>> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe> >>> >>> _______________________________________________ >>> discovery mailing list >>> discovery(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> > > > -- > James Heilman > MD, CCFP-EM, Wikipedian > > The Wikipedia Open Textbook of Medicine > www.opentextbookofmedicine.com >

1 0

Re: [discovery] [Wikimedia-l] Fwd: Improving search (sort of)
by Dan Garry 15 Jul '16

15 Jul '16

On 15 July 2016 at 08:44, James Heilman <jmh649(a)gmail.com> wrote: > > Thanks for the in depth discussion. So if the terms people are using that > result in "zero search results" are typically gibberish why do we care if > 30% of our searches result in "zero search results"? A big deal was made > about this a while ago. > Good question! I originally used to say that it was my aspiration that users should never get zero results when searching Wikipedia. As a result of Trey's analysis, I don't say that any more. ;-) There are many legitimate cases where users should get zero results. However, there are still tons of examples of where giving users zero results is incorrect; "jurrasic world" was a prominent example of that. It's still not quite right to say that *all* the terms that people use to get zero results are gibberish. There is an extremely long tail <https://en.wikipedia.org/wiki/Long_tail> of zero results queries that aren't gibberish, it's just that the top 100 are dominated by gibberish. This would mean we'd have to release many, many more than the top 100, which significantly increases the risk of releasing personal information. > If one was just to look at those search terms that more than 100 IPs > searched for would that not remove the concerns about anonymity? One could > also limit the length of the searches displaced to 50 characters. And just > provide the first 100 with an initial human review to make sure we are not > miss anything. > The problem with this is that there are still no guarantees. What if you saw the search query "DF198671E"? You might not think anything of it, but I would recognise it as an example of a national insurance number <https://en.wikipedia.org/wiki/National_Insurance_number>, the British equivalent of a social security number [1]. There's always going to be the potential that we accidentally release something sensitive when we release arbitrary user input, even if it's manually examined by humans. So, in summary: - The top 100 zero results queries are dominated by gibberish. - There's a long tail of zero results queries, meaning we'd have to reduce many more than the top 100. - Manually examining the top zero results queries is not a foolproof way of eliminating personal data since it's arbitrary user input. I'm happy to answer any questions. :-) Thanks, Dan [1]: Don't panic, this example national insurance number is actually invalid. ;-) -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery July 2016