Hey Purodha,
On 21 May 2016 at 05:13, Purodha Blissenbach <purodha(a)blissenbach.org>
wrote:
> On the long run, I think, these portals and their texts should
> be translatable. Browser settings determining the target language.
> Looking forward to have them on translatewiki.net !
>
I agree that localising these strings would be helpful and in-line with our
practices. That's definitely something that we're interested in doing, and
we're going to be doing an investigation on that soon. We're hoping it'll
be fairly straightforward to get this done... but if it's not, we may need
to deprioritise the work. We'll see.
Thanks!
Dan
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
Hi everyone,
Mikhail, Data Analyst Extraordinaire, recently published his report, "From
Zero to Hero"[1] on the relationship between various features of queries as
strings (rather than the content of the query) and those queries getting no
results.
Today for my 10% project I took a quick look at the two most impactful
features, quotes and question marks. These two features stood out in
Mikhail's report as having both relatively high volume and a relatively
higher chance of getting no results.
I'm not planning on doing a more formal report right now, though I will
probably copy this email to my Notes page.
Quotes make sense, as we try to get an exact match for strings inside
quotes, which limits our options for making a match. Question marks are
actually a little-known, little-used, poorly documented, and poorly
understood wildcard: they stand for any single character. Most users use
them to ask questions.
I took a random sample of 50,000 English Wikipedia queries (using my
now-favorite criteria at [2]—basically, full text queries from normal
humans (as best as we can tell) with fewer than 3 results). I extracted all
the queries with quotes (170) and all the queries that ended in question
marks, that is, looked like questions (274). There were 4 queries that were
all questions and spaces (e.g., ???? ???????? ????)—they caused problems as
they are very expensive queries that repeatedly failed on the test cluster,
so I discarded them. I also took a random sub-sample of 1K queries from the
larger sample of 50K.
All samples had plenty of gibberish queries (e.g.,
"fhdsfhsdjkfgdsjklgsdl"?), queries in other languages, and the other usual
cruft.
*For the sample with quotes,* I used Relevance Forge to compare the results
of running queries as is vs replacing quotes with spaces. The summary stats
are below. The zero results rate for queries with quotes went down by
almost half, and more than half of queries has changes in their top 5
results. The TotalHits stats are wildly skewed by one query that increased
it's results by over 300,000. (There always seems to be an outlier!)
*Metrics:*
*Query Count:* 170
Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00
*Zero Results:* 38.2% (-37.1%)
*Top 5 Sorted Results Differ:* 51.8%
*Top 5 Unsorted Results Differ:* 51.2%
Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00
*For the sample with question marks, *I used Relevance Forge to compare the
results of running queries as is vs dropping all trailing question marks
and spaces. Some queries ended in multiple question marks (removed), and
some queries had other question marks in the middle of the query (kept).
The summary stats are below. The summary is similar to those with quotes:
almost half of the zero results queries got results, and more than half of
all queries had changes to their top 5 results, and the mean number of
total hits is blown out by one query that got more than 300K additional
results.
*Metrics:*
*Query Count:* 274
Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00
*Zero Results:* 43.1% (-39.1%)
*Top 5 Sorted Results Differ:* 53.3%
*Top 5 Unsorted Results Differ:* 53.3%
Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00
*For the 1K sample query,* I used Relevance Forge to compare the results of
running queries as is vs (a) replacing quotes with spaces, (b) dropping all
trailing question marks and spaces, and (c) doing both (there are even a
very few queries with both quotes and trailing question marks!).
Keep in mind that these are all poorly performing queries (fewer than 3
results). Summary results:
(a) quotes
*Metrics:*
*Query Count:* 1000
Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00
*Zero Results:* 79.5% (-0.1%)
*Top 5 Sorted Results Differ:* 0.1%
*Top 5 Unsorted Results Differ:* 0.1%
Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00
(b) question marks
*Metrics:*
*Query Count:* 1000
Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00
*Zero Results:* 79.4% (-0.2%)
*Top 5 Sorted Results Differ:* 0.4%
*Top 5 Unsorted Results Differ:* 0.4%
Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00
(c) quotes and question marks (pretty much the sum of the previous two!)
*Metrics:*
*Query Count:* 1000
Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00
*Zero Results:* 79.3% (-0.3%)
*Top 5 Sorted Results Differ:* 0.5%
*Top 5 Unsorted Results Differ:* 0.5%
Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00
Overall, it's a pretty small effect, and a lot of the results are not
always great when quotes are dropped, but it's a very small effort to make
the change.
A quick look at the queries with question marks didn't show any that were
obviously intended to be used as wildcards (except maybe
all-question-marks, like ????—but who knows what that is supposed to be?).
It has been suggested before and I would also now recommend disabling ? as
a wildcard—it causes many more problems than it solves.
Re-running poor-performing queries that have quotes without the quotes is
an easy win. We should do that too!
Thoughts, comments, and suggestions welcome!
—Trey
[1]
https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/…
[2]
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization…
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
Hello,
Just a quick note: the Discovery Portal team updated the articles by
language statistics on wikipedia.org this morning. I've uploaded a small
screenshot to Commons
<https://commons.wikimedia.org/wiki/File:Wikipedia_portal_stats_update_as_of…>
that displays the new numbers.
Cheers,
Deb
--
Deb Tankersley
Product Manager, Discovery
Wikimedia Foundation
Forwarding to the public Discovery list.
In summary, we're upgrading from Elasticsearch 1.7 to Elasticsearch 2.3. If
everything goes according to plan, users should not notice us doing this.
Stay tuned for more info. :-)
Thanks,
Dan
On 25 May 2016 at 05:22, Guillaume Lederrey <glederrey(a)wikimedia.org> wrote:
> Hello all!
>
> There was some communication and multiple phab tickets [1], but I want
> to make sure the message is going through correctly. Let me know if
> there are some more steps that needs to be taken in term of
> communication. I will update the deployment page [2] with key actions
> this evening (I'm waiting for a more detailed timing).
>
> We are starting to upgrade Elasticsearch to version 2.3 this Thursday
> (May 26). If all goes as planned, this should be entirely transparent
> (but we all know what happens to the best laid plans).
>
> Rough timeline:
>
> May 26-27: upgrade Elasticsearch and Mediawiki on beta
> May 30: upgrade Elasticsearch in codfw, search traffic sent to eqiad
> May 31: upgrade Mediawiki (as part of the standard deploy train)
> June 3-6: upgrade Elasticsearch in eqiad, search traffic sent to
> codfw, send traffic back to normal routes once upgrade is completed
>
> A more detailed timeline is available in phab [1].
>
> Things to note:
> * This change affect all extensions talking to Elasticsearch,
> including Translate, ApiFeatures and GeoData
> * New mediawiki code is compatible with Elasticsearch 2.3, but NOT
> with 1.7. Rollback of that deployment means we have to also re route
> traffic to a 1.7 Elasticsearch cluster.
> * Logstash upgrade will be done in a different change (tracked in phab [3])
> * We tested this as much as we could, but this is a major upgrade. The
> Discovery search team will be available for fast patching of any issue
> we find.
>
> @Christopher: while we do not expect user impact, it might make sense
> to send a more general warning about those operations to our
> community. If only to let them know we are working hard to keep our
> systems up to date. Let me know if you want to have a chat about that.
>
> Thanks all for your patience!
>
> Guillaume
>
>
> [1] https://phabricator.wikimedia.org/T133124 (and related)
> [2] https://wikitech.wikimedia.org/wiki/Deployments
> [3] https://phabricator.wikimedia.org/T136001
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
>
> _______________________________________________
> discovery-private mailing list
> discovery-private(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/discovery-private
>
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
Purodha Blissenbach, 21/05/2016 14:13:
> On the long run, I think, these portals and their texts should
> be translatable. Browser settings determining the target language.
> Looking forward to have them on translatewiki.net !
Adding English-only text to the Wikipedia portal is unacceptable.
Special powers on a Wikimedia domain must not be used to contradict and
impoverish the Wikimedia mission. The portal seize by a small WMF clique
has shown its failure and should immediately be reversed, as the
Meta-Wiki administrators have proven to be more competent.
Nemo
Hey DJ,
Thanks for the feedback! Responses in-line.
On 21 May 2016 at 04:52, Derk-Jan Hartman <d.j.hartman+wmf_ml(a)gmail.com>
wrote:
> I like the addition of the descriptive subtitles. But I would suggest
> taking them to meta to settle on what they should be exactly, and also then
> documenting them (including arguments) to make sure that they can be used
> consistently throughout the projects.
>
I agree that some standardisation of the phrases used here would be useful.
I'll pass that feedback on to Communications, who I believe handles most of
these kinds of situations. In the mean time, I think Discovery can take a
quick pass on the phrases that are being used on the portal to make them a
bit more consistent.
> I was wondering about the colors. Have we considered the MediaWiki/OOjs UI
> color theme already. In my opinion the portal feels more cologneblue than
> Vector right now..
>
I can see what you mean. A lot of the styles of the new elements have been
made to fit the old style of the page. I'm not a designer, so I don't know
specifically what to recommend to the team here, but I think they can keep
this in mind for the future.
Also, I do wonder a bit about the consistency of the portals and the lack
> of options for reuse of these improvements by other portals, and I
> personally think it would be great to start expanding parts of the
> development to other portals now.
> I think it would be wonderful if we could create a pipeline of reusable
> elements among the portals, that allows for some consistency, but trying to
> avoid blandness and uniformity. Simple things like a library of Less
> variables usable by all portal pages can mean a lot for these kinds of
> efforts and I'd love to see some attention devoted to that, so that other
> portal pages can benefit.
>
As explained in T110070#1653320
<https://phabricator.wikimedia.org/T110070#1653320>, Discovery is not
actively maintaining the other portals. That said, I agree that trying to
get our code to a state where it's easily re-useable for other portals so
that interested people can migrate it over to the other portals would be
good to do. I filed T136151 <https://phabricator.wikimedia.org/T136151> to
track that work.
> For community participation, I also have some ideas:
> 1: There is no README.md
>
Good point. I saw you filed T135902
<https://phabricator.wikimedia.org/T135902> and T135903
<https://phabricator.wikimedia.org/T135903> for this. Thanks! The licensing
question is somewhat complicated given the history of the portals; we will
need to consult with Legal on this to make sure we get it right.
> 2: Make sure that it's easy to test the master version.
The great thing about github for instance is that you can do tricks like:
>
> https://cdn.rawgit.com/wikimedia/portals/master/prod/wikipedia.org/index.ht…
> That's powerful to be able to preview straight from a git repo. If you
> have links like that to the readme/meta page.
>
I'll pass this feedback on to the engineers working on the project.
> 3: Update https://meta.wikimedia.org/wiki/Project_portals <
> https://meta.wikimedia.org/wiki/Project_portals>
>
I wasn't aware of this page. I'll pass it on to Chris Koerner, Discovery's
community liaison, so he can look at updating it.
Thanks!
Dan
--
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
Just a reminder, thursday/friday we will start upgrading the beta cluster
to elasticsearch 2.3, followed by the codfw cluster on monday and the eqiad
cluster the following friday/monday.