We're happy to announce that after numerous tests and analyses[1] and a
fully operational demo[2], the Discovery Team is ready to release
TextCat[3] into production on wiki.
What is TextCat? It detects the language that the search query was written
in which allows us to look for results on a different wiki. TextCat is a
language detection library based on n-grams[4]. During a search, TextCat
will only kick in when the following three things occur:
1. fewer than 3 results are returned from the query on the current wiki
2. language detection is successful (meaning that TextCat is reasonably
certain what language the query is in, and that it is different from the
language of the current wiki)
3. the other wiki (in the detected language) has results
Our analysis of the A/B test[5] (for English, French, Spanish, Italian and
German Wikipedia's) showed that:
"...The test groups not only had a substantially lower zero results rate
(57% in control group vs 46% in the two test groups), but they had a higher
clickthrough rate (44% in the control group vs 49-50% in the two test
groups), indicating that we may be providing users with relevant results
that they would not have gotten otherwise."
This update will be scheduled for production release during the week of
July 25, 2016 on the following Wikipedia's:
- English [6]
- German [7]
- Spanish [8]
- Italian [9]
- French [10]
TextCat will then be added to this next group of Wikipedia's at a later
date:
- Portugese[11]
- Russian[12]
- Japanese[13]
This is a huge step forward in creating a search mechanism that is able to
detect - with a high level of accuracy - the language that was used and
produce results in that language. Another forward-looking aspect of TextCat
is investigating a confidence measuring algorithm[14], to ensure that the
language detection results are the best they can be.
We will also be doing more[15] A/B tests using TextCat on non Wikipedia
sites, such as Wikibooks and Wikivoyage. These new tests will give us
insight into whether applying the same language detection configuration
across projects would be helpful.
Please let us know if you have any questions or concerns, on the TextCat
discussion page[16]. Also, for screenshots of what this update will look
like, please see this one[17] showing an existing search typed in on enwiki
in Russian "первым экспериментом" and this one[18] for showing what it will
look like once TextCat is in production on enwiki.
Thanks!
[1] https://phabricator.wikimedia.org/T118278
[2] https://tools.wmflabs.org/textcatdemo/
[3] https://www.mediawiki.org/wiki/TextCat
[4] https://en.wikipedia.org/wiki/N-gram
[5]
https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_…
[6] https://en.wikipedia.org/
[7] https://de.wikipedia.org/
[8] https://es.wikipedia.org/
[9] https://it.wikipedia.org/
[10] https://fr.wikipedia.org/
[11] https://pt.wikipedia.org/
[12] https://ru.wikipedia.org/
[13] https://ja.wikipedia.org/
[14] https://phabricator.wikimedia.org/T140289
[15] https://phabricator.wikimedia.org/T140292
[16] https://www.mediawiki.org/wiki/Talk:TextCat
[17] https://commons.wikimedia.org/wiki/File:Existing-search_no-textcat.png
[18] https://commons.wikimedia.org/wiki/File:New-search_with-textcat.png
--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation
https://www.mediawiki.org/wiki/Scrum_of_scrums/2016-07-20
= 2016-07-20 =
== Product ==
=== Reading ===
==== Reading Web ====
* No update, working on language switcher on mobile web
==== iOS native app ====
* 5.0.5 heading to regression today, expected release to Apple store later
this week or early next week
* Development of 5.1 is in progress
* Planning of 5.2 is in progress
==== Android native app ====
* Feed is released in beta! (Follow-up bugfix beta release is cooking as
we speak.)
* We are starting work on the navigation overhaul.
* Heads-up to RelEng: we are going to talk this week about whether we have
bandwidth this Q to transition to Differential code reviews.
==== Mobile Content Service ====
* First public feed endpoints are deployed: aggregated + smart random
==== Reading Infrastructure ====
=== Community Tech ===
* Patch for numeric sorting is ready for review (
https://gerrit.wikimedia.org/r/#/c/299108/)
** Will be rolling out on test wiki first. Need another test wiki before
English WIkipedia (preferably already using UCA collation)
* Fixed security bug in Pageviews Analysis
* Architechure Committee RFC meeting about Cross-wiki watchlist back-end
today 2pm
=== Editing ===
==== Collaboration ====
* Blocked - None
* Blocking - Krinkle would like us to stop using buildCssLinks to pave the
way for a refactoring. Otherwise, no change.
* Updates
** Turned off Echo transition flags, now that the maintenance scripts are
done. This should improve performance and avoid unexpected side effects.
** Echo features (such as animation when notifications move in list) and
bug fixes.
** Flow security fixes merged to master; they were already on the cluster.
==== Parsing ====
* In collaboration with Services (Marko) & Ops (Giuseppe), we transitioned
Parsoid to be based on service-runner. Parsoid deploys will resume tomorrow
/ Monday.
* Tim working on addressing HHVM segfault in preprocessor which was
reported by Giuseppe in a security bug.
* Scott & Tim working on a PHP only Tidy replacement which is close to
being done.
* OCG (Offline Content Generator) outage this week due to unrelated proxy
misconfiguration: T140789
==== VisualEditor ====
* Blocked: None.
* Blocking: None known.
* Update: Quiet week. Mostly working on bugs and the new wikitext editor.
CustomData extension dependency removed from all three remaining Wikivoyage
extensions that used in in master; will be able to de-deploy it in the next
few weeks.
=== Discovery ===
* '''Blocking''': none
* '''Blocked''': none
* logstash.wikimedia.org upgraded to latest Kibana version
* TextCat A/B test results are in:
https://commons.wikimedia.org/wiki/File:Report_on_Cirrus_Search_TextCat_AB_…
** TLDR: Success
* TextCat demo has new design: https://tools.wmflabs.org/textcatdemo/
* GeoSearch launched:
https://www.mediawiki.org/wiki/Help:CirrusSearch#Geo_Search
=== Interactive ===
* Launched maps on meta, cawiki, hewiki, mkwiki
* This Friday (July 22) in Seattle - data visualization hackathon
https://www.mediawiki.org/wiki/DataViz_Seattle_hackathon
* This Weekend - the whole team is in Seattle for State of the Map US
conference
== Technology ==
=== Analytics ===
* issues with eventbus deployment and new schemas, service was rejecting
events, had to be restarted.
* reconstructing edit history from mw database, pretty sure it is possible
but will know better after thsi week
* scaling pageview API, our new cluster has issues with being able to load
data and compact (we needed to change compaction from old scheme)
=== Research ===
* Memory issues on scb1001/1002 related to ORES have been partially
addressed.
** Lower number of uwsgi processes
** Periodic restart of celery workers to address memory leak
https://phabricator.wikimedia.org/T140020
** Explore changing model from Random Forest to Gradient Boosting
https://phabricator.wikimedia.org/T139963
* We'll be seeking dedicated hardware. (Anyone in ops want to reach out to
us to help with that process would be great)
=== Services ===
* Feed endpoints deployed, but need to revisit using `/feed/featured` as it
takes a looong time to MCS to compute it
**
https://en.wikipedia.org/api/rest_v1/?doc#!/Feed/get_feed_featured_yyyy_mm_…
* Parsoid move to service-runner and service::node completed
* service-template-node v0.4.0 is out - please update soon.
** security issue addressed
** new feature - automatic metrics collection
* Marko out next week
=== Security ===
* Verifying T140366
* Reviewing 296699
* Drafting/editing security team job descriptions
* Request security reviews:
https://www.mediawiki.org/wiki/Wikimedia_Security_Team/Security_reviews
* MediaWIki 1.27.1 security release planed for early August
=== RelEng ===
* Blocking
** Android to differential
* Blocked
** None
* Updates
** Zuul upgraded this week, should address a bunch of issues
** New SWAT deploy process going ok, reminder to install
https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug if you're putting
things up for SWAT
=== Fundraising Tech ===
* Civi post-upgrade bugfixes
** more work on batch contact de-duplication
* CentralNotice deployed (last week), watching closely for glitches
** No longer serving CN modules on special pages and action=edit (
https://phabricator.wikimedia.org/T139439)
* Upgraded payments to MW 1.27 (LTS!)
** Still hoping to get closer to master, but this buys us a lot of time
* Killed ancient homegrown form template engine (-9,000 loc !)
* Experimenting with scrutinizer-ci
* Pivoting to ActiveMQ replacement work
* Building out new servers
* No blockers
=== TechOps ===
'''Blocking'''
** None
* '''Blocked'''
** https://phabricator.wikimedia.org/T135483 - HHVM crashes - raised to
UBN! after issue recurrence. Currently no one owns the ticket.
looks like there's already a patch at
https://gerrit.wikimedia.org/r/299710 -- [cscott, for parsing (and tim
starling, who wrote the patch)]
* Updates:
** Insecure (non-HTTPS) POST traffic blocked completely as of yesterday,
may see reports of broken bots/tools -
https://phabricator.wikimedia.org/T105794
== Wikidata ==
* No blockers.
* Back into regular 2 weeks Scrum sprint. Connecting loose ends to get
stuff done.
* Reworking jQuery based UI code (minimizing the code base).
* Still working on structured data for Commons.
Hello folks,
service-template-node v0.4.0 has just been released~[1]. The new version
represents an important security and feature upgrade from v0.3.2 and you
are urged to update as soon as possible~[2].
On the feature side, this release brings out-of-the-box support for sending
metrics for all requests made against a service, which means that after
upgrading you will be able to set up your own grafana dashboard with
relevant metrics~[3] very easily.
Security-wise, there were some possible RegEx exploits in one of the node
module dependencies. This has been mitigated by updating the relevant
modules to a version that does not have the deficiency. Additionally, from
now on a node-module-security scan is being run every time the service is
tested to ensure our infrastructure is kept safe.
Please update as soon as possible if you have a service based on the
service template running in WMF production. And, as always, should you have
any questions or concerns, feel free to reach out to me.
Cheers,
Marko
[1] https://github.com/wikimedia/service-template-node/tree/v0.4.0
[2] you can follow the guide on
https://www.mediawiki.org/wiki/ServiceTemplateNode/Updating
[3] dashboards a la
https://grafana.wikimedia.org/dashboard/db/restbase?panelId=16&fullscreen
--
Marko Obrovac, PhD
Senior Services Engineer
Wikimedia Foundation
If you have a client that connects to RCStream
<https://wikitech.wikimedia.org/wiki/RCStream>, please take a moment to
make sure that you are using a secure connection. Are you connecting to '
http://stream.wikimedia.org/rc' or '//stream.wikimedia.org/rc'? If so, the
only thing you need to change is the URL scheme, replacing any http: and
protocol-relative URLs with HTTPS (that is, '
https://stream.wikimedia.org/rc').
Some of the mobile apps teams have been wanting some way to store data "in
the cloud", so they can sync things like app preferences and reading lists
between devices. They asked us (Reading Infrastructure) to help draft an
RFC, which is now posted in https://phabricator.wikimedia.org/T128602.
Please comment there, there are several open questions. Thanks!
--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
Kibana, the software running logstash.wikimedia.org, as well as
elasticsearch, the server behind the scenes storing all the logs, have both
been upgraded today to their latest versions. Several historical indices
could not be directly loaded into the new servers and instead have been
dumped and are being reloaded. This reloading process may take some time,
so please be patient. The data will be there sooner or later :)
Play around with them, and create tickets in the wikimedia-logstash project
for any issues you find.
Hey,
This is the (12 + 1)th [1] weekly update from revision scoring team that we
have sent to this mailing list.
New developments:
- ORES review tool is deployed as a beta feature in Turkish Wikipedia.
Now six Wikis have this tool. [2]
- CopyPatrol tool soon will show ORES scores if they pass a certain
threshold. [3]
- We are talking about integration with Detox, feel free to chime in. [4]
Maintenance and robustness:
- Currently we are dealing with increasing memory pressure on scb nodes.
Actions we did to reduce this pressure:
- We migrated most of RandomForrest models to GradientBoosting, which
will reduce memory pressure greatly without affecting accuracy of models
noticeably [5]
- It seems there is a memory leak issue with celery. We bypassed that by
setting a periodic restart of workers. [6]
- We reduced maximum number of precaching requests in order to prevent
spikes that might cause memory pressure on other services. [7]
- We reduced number of web processes to 2/3. It is still fine. [8]
- We finished up the refactor and it will soon goes to the production
cluster. [9]
- Damaging and goodfaith models had issue in Dutch Wikipedia. Got fixed.
[10]
- Our metrics collector now sends timed requests. We will have a
dashboard in grafana for that soon. [11]
- There will be a link to "ORES review tool" page [12] in legend of
RecentChanges and Watchlist. [13]
- When revscoring fails for any unknown reason, ORES return a proper
message now. [14]
- We fixed a puppet issue that caused trouble while creating new web
instances. Got fixed [15]
1. It's 13. We are not superstitious, just kidding ;)
2. https://phabricator.wikimedia.org/T139992
3. https://phabricator.wikimedia.org/T139009
4. https://phabricator.wikimedia.org/T139007
5. https://phabricator.wikimedia.org/T139963
6. https://phabricator.wikimedia.org/T140020
7. https://gerrit.wikimedia.org/r/299559
8. https://gerrit.wikimedia.org/r/298739
9. https://github.com/wiki-ai/ores/pull/155
10. https://phabricator.wikimedia.org/T140038
11. https://phabricator.wikimedia.org/T137442
12. https://www.mediawiki.org/wiki/ORES_review_tool
13. https://phabricator.wikimedia.org/T140361
14. https://phabricator.wikimedia.org/T140301
15. https://phabricator.wikimedia.org/T140265
Sincerely,
Amir from the Revision Scoring team