On Wed, Jan 25, 2017 at 10:15 AM, Brad Jorsch (Anomie) <
> On Wed, Jan 25, 2017 at 2:09 AM, <byeh(a)yahoo-inc.com> wrote:
>> While I was developing some services based on API:Opensearch, I found
>> that the response of the same url request can be either Simplified Chinese
>> or Traditional Chinese. To be more specific, I would love to know how can I
>> determine the response language form from API layer ( Or other factors that
>> may have impact ) ? Since the document of API:Opensearch doesn't seem to
>> take language into consideration,
> The OpenSearch Suggestions extension specification does not allow for
> returning additional metadata such as language with the response. You may
> want to look at the prefixsearch query module instead which allows for
> returning the same results in a different format, although I don't know the
> details of how language variants are handled in the search output.
> : http://www.opensearch.org/Specifications/OpenSearch/
> : https://www.mediawiki.org/wiki/API:Prefixsearch
> Brad Jorsch (Anomie)
> Senior Software Engineer
> Wikimedia Foundation
> Mediawiki-api mailing list
The Interactive Team in Discovery is in the process of putting its work on
pause. The team's aim during this period is to get its work to a stable and
Currently, work on new features is on hold. It is not yet known what the
timeline is for this transition to a paused state, or whether there will be
further deployments of features that have already been completed. I will
update this list when there is more information.
Lead Product Manager, Discovery
What started out as an attempt to derive useful confidence measures for
language identification (with TextCat
<https://www.mediawiki.org/wiki/TextCat>) turned into a generalized
improvement effort. We still don't have useful external confidence
measures—though there's a little work yet to be done there (T149323
<https://phabricator.wikimedia.org/T155670>). However, I did get a sizable
improvement to the F0.5 <https://en.wikipedia.org/wiki/F1_score> accuracy
scores by improving TextCat internals that don't really generalize to
externally useful measures. The result was a mean improvement of just under
5% across the corpora from nine Wikipedias. The two worst performing
corpora, enwiki and nlwiki, each went up around 10%! All nine are now above
90% F0.5 score.
You can read the final summary and recommendations
or read the rest of the page, too, if you want to know more about the whole
odyssey, or if you have trouble sleeping. ; )
Next steps for language identification are to get these changes deployed,
and then to look at other measures of confidence, and/or extend language
identification to more wikis, though the latter two may take a backseat to
working on new and improved language analyzers
<https://phabricator.wikimedia.org/T154511> for the rest of this quarter.
Software Engineer, Discovery
As we keep coming up with more ways to try to rescue unsuccessful
queries—"Did you mean" suggestions, language detection, quote stripping,
wrong keyboard detection, etc—we have to have a plan for how they interact
with each other.
I've put together a straw man proposal for how to deal with all of this to
have a more co-ordinated conversation:
Comments and questions here or on the talk page are welcome!
Software Engineer, Discovery
tl/dr: Can feature vectors about relevance of (query, page_id) pairs be
released to the public if the final dataset only represents query's with
Over the past 2 months i've been spending free time working on
investigating machine learning for ranking. One of the earlier things i
tried, to get some semblance of proof it had the ability to improve our
search results, was port a set of features for text ranking from an open
source kaggle competitor to a datset i could create from our own data. For
relevance targets I took queries that had clicks from at least 50 unique
sessions over a 60 day period and ran them through a click model (DBN).
Perhaps not as useful as human judgements but working with what I have
This actually showed it has some promise, and I've been moving further
along. An idea was provided to me though about releasing the feature
vectors from my initial investigation in an open format that might be
useful for others. Each feature vector is for a (query, hit_page_id) pair
that was displayed to at least 50 users.
I don't have my original data, but I have all the code and just ran through
it with 100 normalized queries to get a count, and there are 4852 features.
Lots of them are probably useless, but choosing which ones is probably half
the battle. These are ~230MB in pickle format, which stores the floats in
binary. This can then be compressed to ~20MB with gzip, so the data size
isn't particularly insane. In a released dataset i would probably use 10k
normalized queries, meaning about 100x this size Could plausibly release as
csv's instead of pickled numpy arrays. That will probably increase the data
size further, but since we are only talking ~2GB after compression could go
The list of feature names is in https://phabricator.wikimedia.org/P4677 A
few example feature names and their meaning, which hopefully is enough to
understand the rest of the feature names:
- dice distance of bigrams in normalized (stemmed) query string versus
outgoing links. outgoing links are an array field, so the dice distanece is
calculated per item and this feature has the max value.
- Number of digits in the raw user query
- Cosine similarity of the top 50 terms, as reported by elasticsearch
termvectors api, of the normalized query vs the category.plain field of
matching document. More terms would perhaps have been nice, but doing this
all offline in python made that a bit of a time+space tradeoff.
- log base 10 of the score from the elasticsearch termvectors api on the
raw user query applied to the opening_text field analysis chain.
- mean longest match, in number of characters of the query vs the list of
headings for the page
The main question here i think revolves around is this still PII? The exact
queries would be normalized into id's and not released. We could leave the
page_id in or out of the dataset. With it left in people using the dataset
could plausibly come up with their own query independent features to add.
With a large enough feature vector for (query_id, page_id) the query could
theoretically be reverse engineered, but from a more practical side I'm not
sure that's really a valid concern.
Thoughts? Concerns? Questions?
As part of our goals for Q3 FY 2016-17
(Jan - Mar 2017), the Search Team will be researching, testing, and
deploying new language analysers.
Language analysers are features in Elasticsearch that analyse and alter
queries to give users better results. Language analysers perform important
functions such as tokenisation
<https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)>, and can
also alter queries with language-specific features, such as:
- The English analyser would make the query "john's" also search for
- The German analyser would make the query "äußerst" also search for
These alteration to users queries improve the relevance of the results
given to users compared to not analysing the queries, because they can add
extra documents that may be relevant into the results. Elastic has a bunch
you want to read more about the language analysers do.
Some of the criteria we'll be using to evaluate the new analysers are:
- how much better we expect the analyser to be than the one we have
- the maturity and maintainability of the code of the analyser
- flexibility of customisation of the plugin
We'll be testing using our standard search metrics, such as zero results
We'll be starting with Polish, since we already have some ideas for
possible new plugins, and that'll allow us to more precisely figure out
what criteria we want to use when evaluating the plugin.
As always, if there are any questions, please let me know!
Lead Product Manager, Discovery