Another team at the foundation has published a list of code repositories
they manage and/or monitor, along with norms for reviewing code[1]. Should
Discovery create something similar?
[1]
https://www.mediawiki.org/wiki/Wikimedia_Language_engineering/Code_review_s…
Kevin Smith
Agile Coach, Wikimedia Foundation
I'm wondering if MS will pull citations from Wikipedia and/or make use of
Wikidata.
I'm also wondering if this will decrease Wikimedia site traffic in a
similar way to how search engine knowledge panels may have decreased
Wikimedia site traffic, particularly in this case from users who have
access to MS Office.
https://blogs.office.com/2016/07/26/the-evolution-of-office-apps-new-intell…
Pine
Forwarding
Pine
---------- Forwarded message ----------
From: "James Heilman" <jmh649(a)gmail.com>
Date: Jul 15, 2016 06:33
Subject: [Wikimedia-l] Improving search (sort of)
To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org>
Cc:
A while ago I requested a list of the "most frequently searched for terms
for which no Wikipedia articles are returned". This would allow the
community to than create redirect or new pages as appropriate and help
address the "zero results rate" of about 30%.
While we are still waiting for this data I have recently come across a list
of the most frequently clicked on redlinks on En WP produced by Andrew West
https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many of
these can be reasonably addressed with a redirect as the issue is often
capitals.
Do anyone know where things are at with respect to producing the list of
most search for terms that return nothing?
--
James Heilman
MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
_______________________________________________
Wikimedia-l mailing list, guidelines at:
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l(a)lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
That's probably a question that Discovery could answer. Forwarding to
that list...
On Fri, Jul 22, 2016 at 11:01 AM, Ad Strack van Schijndel
<ad.strackvanschijndel(a)gmail.com> wrote:
> A configuration setting would be nice, but I couldn't find one. Same with hooks.
>
> So any way would do I suppose.
>
> We created an extension with a new search page by making a subclass of SpecialSearch and changing some functions and adding some hooks. That seems like a good idea but when searching there is always a redirect to Special:Search and I can't figure out how to change that.
>
>
>> On 21 Jul 2016, at 16:29, John <phoenixoverride(a)gmail.com> wrote:
>>
>> like what?
>>
>> On Thu, Jul 21, 2016 at 10:27 AM, Ad Strack van Schijndel <
>> ad.strackvanschijndel(a)gmail.com> wrote:
>>
>>> Hi,
>>>
>>> It seems that a search always leads to Special:Search, no matter what.
>>> That has to do with the search parameter in the url.
>>> Is there an elegant way to make a search request go to another search page?
>>>
>>> Thanks!
>>> Ad
>>> _______________________________________________
>>> MediaWiki-l mailing list
>>> To unsubscribe, go to:
>>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>>>
>> _______________________________________________
>> MediaWiki-l mailing list
>> To unsubscribe, go to:
>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
>
> _______________________________________________
> MediaWiki-l mailing list
> To unsubscribe, go to:
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
We're happy to announce that after numerous tests and analyses[1] and a
fully operational demo[2], the Discovery Team is ready to release
TextCat[3] into production on wiki.
What is TextCat? It detects the language that the search query was written
in which allows us to look for results on a different wiki. TextCat is a
language detection library based on n-grams[4]. During a search, TextCat
will only kick in when the following three things occur:
1. fewer than 3 results are returned from the query on the current wiki
2. language detection is successful (meaning that TextCat is reasonably
certain what language the query is in, and that it is different from the
language of the current wiki)
3. the other wiki (in the detected language) has results
Our analysis of the A/B test[5] (for English, French, Spanish, Italian and
German Wikipedia's) showed that:
"...The test groups not only had a substantially lower zero results rate
(57% in control group vs 46% in the two test groups), but they had a higher
clickthrough rate (44% in the control group vs 49-50% in the two test
groups), indicating that we may be providing users with relevant results
that they would not have gotten otherwise."
This update will be scheduled for production release during the week of
July 25, 2016 on the following Wikipedia's:
- English [6]
- German [7]
- Spanish [8]
- Italian [9]
- French [10]
TextCat will then be added to this next group of Wikipedia's at a later
date:
- Portugese[11]
- Russian[12]
- Japanese[13]
This is a huge step forward in creating a search mechanism that is able to
detect - with a high level of accuracy - the language that was used and
produce results in that language. Another forward-looking aspect of TextCat
is investigating a confidence measuring algorithm[14], to ensure that the
language detection results are the best they can be.
We will also be doing more[15] A/B tests using TextCat on non Wikipedia
sites, such as Wikibooks and Wikivoyage. These new tests will give us
insight into whether applying the same language detection configuration
across projects would be helpful.
Please let us know if you have any questions or concerns, on the TextCat
discussion page[16]. Also, for screenshots of what this update will look
like, please see this one[17] showing an existing search typed in on enwiki
in Russian "первым экспериментом" and this one[18] for showing what it will
look like once TextCat is in production on enwiki.
Thanks!
[1] https://phabricator.wikimedia.org/T118278
[2] https://tools.wmflabs.org/textcatdemo/
[3] https://www.mediawiki.org/wiki/TextCat
[4] https://en.wikipedia.org/wiki/N-gram
[5]
https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_…
[6] https://en.wikipedia.org/
[7] https://de.wikipedia.org/
[8] https://es.wikipedia.org/
[9] https://it.wikipedia.org/
[10] https://fr.wikipedia.org/
[11] https://pt.wikipedia.org/
[12] https://ru.wikipedia.org/
[13] https://ja.wikipedia.org/
[14] https://phabricator.wikimedia.org/T140289
[15] https://phabricator.wikimedia.org/T140292
[16] https://www.mediawiki.org/wiki/Talk:TextCat
[17] https://commons.wikimedia.org/wiki/File:Existing-search_no-textcat.png
[18] https://commons.wikimedia.org/wiki/File:New-search_with-textcat.png
--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation
Hey James,
When we first started looking at zero results rate (ZRR), it was an easy
metric to calculate, and it was surprisingly high. We still look at ZRR
<https://searchdata.wmflabs.org/metrics/#failure_rate> because it is so
easy to measure, and anything that improves it is probably a net positive
(note the big dip when the new completion suggester was deployed!!), but we
have more complex metrics that we prefer. There's user engagement
<https://searchdata.wmflabs.org/metrics/#kpi_augmented_clickthroughs>/augmented
clickthroughs, which combines clicks and dwell time and other user
activity. We also use historical click data in a metric that improves when
we move clicked-on results higher in the results list, which we use with
the Relevance Forge
<https://gerrit.wikimedia.org/r/#/admin/projects/wikimedia/discovery/relevan…>
.
And I didn't mean to give the impression that *most* zero-results queries
are gibberish, though many, many are. And that was something we didn't
really know a year ago. There are also non-gibberish results that correctly
get zero results, like most DOI
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…>
and many media player
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…>
queries. We also see a lot of non-notable (not-yet-notable?) public figures
(local bands, online artists, youtube musicians), and sometimes just random
names.
The discussion in response to Dan's original comment in Phab mentions some
approaches to reduce the risk of automatically releasing private info, but
I still take an absolute stand against unreviewed release. If I can get a
few hundred people to click on a link like this
<https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&f…"James+is+a+nice+guy">,
I can get any message I want on that list. (Curious? Did you click?) The
message could be less anonymous and much more obnoxious, obviously.
50 character limits won't stop emails and phone numbers from making the
list (which invites spam and cranks). Those can be filtered, but not
perfectly.
I've only looked at these top lists by day in the past, but on that time
scale the top results are usually under 1000 count (and that includes IP
duplicates), so the list of queries with 100 IPs might also be very small.
As I said, I'm happy to do the data slogging to try this in a better
fashion if this task is prioritized, and I'd be happy to be wrong about the
quality of the results, but I'm still not hopeful.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
On Fri, Jul 15, 2016 at 11:44 AM, James Heilman <jmh649(a)gmail.com> wrote:
> Hey Trey
>
> Thanks for the in depth discussion. So if the terms people are using that
> result in "zero search results" are typically gibberish why do we care if
> 30% of our searches result in "zero search results"? A big deal was made
> about this a while ago.
>
> If one was just to look at those search terms that more than 100 IPs
> searched for would that not remove the concerns about anonymity? One could
> also limit the length of the searches displaced to 50 characters. And just
> provide the first 100 with an initial human review to make sure we are not
> miss anything.
>
> James
>
> On Fri, Jul 15, 2016 at 9:31 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
>
>> Pine, thanks for the forward. Regulars on the Discovery list may know me,
>> but James probably does not. I've manually reviewed tens of thousands of
>> generally poorly performing queries (fewer than 3 results) and skimmed
>> hundreds of thousands more from many of the top 20 Wikipedias—and to a
>> lesser extent other projects—over the year I've been at the WMF and in
>> Discovery. You can see my list of write ups here
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes>.
>>
>> So I want to say that this is an awesome idea—which is why many people
>> have thought of it. It was apparently one of the first ideas the Discovery
>> department had when they formed (see Dan's notes linked below). It was also
>> one of the first ideas I had when I joined Discovery a few months later.
>>
>> Dan Garry's notes on T8373
>> <https://phabricator.wikimedia.org/T8373#1856036> and the following
>> discussion pretty much quash the idea of automated extraction and
>> publication from a privacy perspective. People not only divulge their own
>> personal information, they also divulge other people's personal
>> information. One example: some guy outside the U.S. was methodically
>> searching long lists of real addresses in Las Vegas. I will second Dan's
>> comments in the T8373 discussion; all kinds of personal data end up in
>> search queries. A dump of search queries *was* provided in September 2012
>> <https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedi…>,
>> but had to be withdrawn over privacy concerns.
>>
>> Another concern for auto-published data: never underestimate the power of
>> random groups of bored people on the internet. 4chan decided to arrange
>> Time Magazine poll results
>> <https://techcrunch.com/2009/04/27/time-magazine-throws-up-its-hands-as-it-g…> so
>> the first letter spelled out a weird message. It would be easy for 4chan,
>> Reddit, and other communities to get any message they want on that list if
>> they happened to notice that it existed. See also Boaty McBoatface
>> <https://en.wikipedia.org/wiki/RRS_Sir_David_Attenborough#Name> and Mountain
>> Dew "Diabeetus"
>> <https://storify.com/cbccommunity/hitler-did-nothing-wrong-wins-crowdsourced…>
>> (which is not at all the worst thing on *that* list). We don't want to
>> have to try to defend against that.
>>
>> In my experience, the quality of what's actually there isn't that great.
>> One of my first tasks when I joined Discovery was to look at daily lists of
>> top 100 zero-results queries that had been gathered automatically. I was
>> excited by this same idea. The top 100 zero-results query list was a
>> wasteland. (Minimal notes on some of what I found are here
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Resul…>.)
>> We could make it better by focusing on human-ish searchers, using basic
>> bot-exclusion techniques
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization…>,
>> ignoring duplicates from the same IP, and such, but I don't think it would
>> help. And while Wikipedia is not for children, there could be an annoying
>> amount of explicit adult material on the list, too. We would probably find
>> some interesting spellings of Facebook and WhatsApp, though.
>>
>> If we're really excited about this, I could imagine using better
>> techniques to pull zero-results queries and see if anything good is in
>> there, but we'd have to commit to some sort of review before we publish it.
>> For example, Discernatron <https://discernatron.wmflabs.org/> data,
>> after consulting with legal, is reviewed independently by two people, who
>> then have to reconcile any discrepancies, before being made public. So I
>> think we'd need an ongoing commitment to have at least two people under NDA
>> who would review any list before publication. 500-600 queries takes a
>> couple hours per person (we’ve done that for the Discernatron), so the top
>> 100 would probably be less than an hour. I'd even be willing to help with
>> the review (as I am for Discernatron) if we found there was something
>> useful in there—but I'm not terribly hopeful. We'd also need more people to
>> efficiently and effectively review queries for other languages if we wanted
>> to extend this beyond English Wikipedia.
>>
>> Finally, if this is important enough and the task gets prioritized, I'd
>> be willing to dive back in and go through the process once and pull out the
>> top zero-results queries, this time with basic bot exclusion and IP
>> deduplication—which we didn't do early on because we didn't realize what a
>> mess the data was. We could process a week or a month of data and
>> categorize the top 100 to 500 results in terms of personal info, junk,
>> porn, and whatever other categories we want or that bubble up from the
>> data, and perhaps publish the non-personal-info part of the list as an
>> example, either to persuade ourselves that this is worth pursuing, or as a
>> clearer counter to future calls to do so.
>> —Trey
>>
>> Trey Jones
>> Software Engineer, Discovery
>> Wikimedia Foundation
>>
>> On Fri, Jul 15, 2016 at 10:09 AM, Pine W <wiki.pine(a)gmail.com> wrote:
>>
>>> Forwarding
>>>
>>> Pine
>>> ---------- Forwarded message ----------
>>> From: "James Heilman" <jmh649(a)gmail.com>
>>> Date: Jul 15, 2016 06:33
>>> Subject: [Wikimedia-l] Improving search (sort of)
>>> To: "Wikimedia Mailing List" <wikimedia-l(a)lists.wikimedia.org>
>>> Cc:
>>>
>>> A while ago I requested a list of the "most frequently searched for terms
>>> for which no Wikipedia articles are returned". This would allow the
>>> community to than create redirect or new pages as appropriate and help
>>> address the "zero results rate" of about 30%.
>>>
>>> While we are still waiting for this data I have recently come across a
>>> list
>>> of the most frequently clicked on redlinks on En WP produced by Andrew
>>> West
>>> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks Many
>>> of
>>> these can be reasonably addressed with a redirect as the issue is often
>>> capitals.
>>>
>>> Do anyone know where things are at with respect to producing the list of
>>> most search for terms that return nothing?
>>>
>>> --
>>> James Heilman
>>> MD, CCFP-EM, Wikipedian
>>>
>>> The Wikipedia Open Textbook of Medicine
>>> www.opentextbookofmedicine.com
>>> _______________________________________________
>>> Wikimedia-l mailing list, guidelines at:
>>> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
>>> New messages to: Wikimedia-l(a)lists.wikimedia.org
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> discovery(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
>
On 15 July 2016 at 08:44, James Heilman <jmh649(a)gmail.com> wrote:
>
> Thanks for the in depth discussion. So if the terms people are using that
> result in "zero search results" are typically gibberish why do we care if
> 30% of our searches result in "zero search results"? A big deal was made
> about this a while ago.
>
Good question! I originally used to say that it was my aspiration that
users should never get zero results when searching Wikipedia. As a result
of Trey's analysis, I don't say that any more. ;-) There are many
legitimate cases where users should get zero results. However, there are
still tons of examples of where giving users zero results is incorrect;
"jurrasic world" was a prominent example of that.
It's still not quite right to say that *all* the terms that people use to
get zero results are gibberish. There is an extremely long tail
<https://en.wikipedia.org/wiki/Long_tail> of zero results queries that
aren't gibberish, it's just that the top 100 are dominated by gibberish.
This would mean we'd have to release many, many more than the top 100,
which significantly increases the risk of releasing personal information.
> If one was just to look at those search terms that more than 100 IPs
> searched for would that not remove the concerns about anonymity? One could
> also limit the length of the searches displaced to 50 characters. And just
> provide the first 100 with an initial human review to make sure we are not
> miss anything.
>
The problem with this is that there are still no guarantees. What if you
saw the search query "DF198671E"? You might not think anything of it, but I
would recognise it as an example of a national insurance number
<https://en.wikipedia.org/wiki/National_Insurance_number>, the British
equivalent of a social security number [1]. There's always going to be the
potential that we accidentally release something sensitive when we release
arbitrary user input, even if it's manually examined by humans.
So, in summary:
- The top 100 zero results queries are dominated by gibberish.
- There's a long tail of zero results queries, meaning we'd have to
reduce many more than the top 100.
- Manually examining the top zero results queries is not a foolproof way
of eliminating personal data since it's arbitrary user input.
I'm happy to answer any questions. :-)
Thanks,
Dan
[1]: Don't panic, this example national insurance number is actually
invalid. ;-)
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation