Hey,
We are fixing a rather urgent bug [1] introduced in the refactor so we will
have an unscheduled deployment today. Also we need to clear ORES redis
cache meaning it'll be a little slower than usual for a while but since
ORES review tool has its own caching and this bug didn't effect that tool,
it won't affect users of the extension.
1. https://phabricator.wikimedia.org/T142857
Sorry for any inconvinece
Best
Hey folks,
This is the 16th weekly update from revision scoring team that we have sent
to this mailing list.
New developments:
- We created dashboards for the ORES service in the Beta cluster[1] and
created panes for tracking failed jobs[2].
- We extended the documentation for the ORES review tool[3,4]
Maintenance:
- We did some work to make the Beta cluster look more like production so
that we can do better testing before the next deployment
- We set up a password on the Beta redis server[5]
- We configured the Beta ORES extension to actually use the Beta ORES
service[6]
- We also prepared a set of puppet changes for the deployment of a
refactored version of ORES to production[7]
Issues in WMFLabs
- We investigated a series of timeout errors that were appearing in the
logs[8]
- We investigated a periodic redis-related error that shower up when
scoring edits[9]
- We fixed our "05" web node that was periodically running out of
memory[10]
Estimating future resource needs
- In preparation for buying new hardware, we measured our past memory
usage and extrapolated forward two years to estimate what hardware
requirements we'll have[11]
1. https://phabricator.wikimedia.org/T142294 - Dashboard or pane for
ORES service in beta
1. https://phabricator.wikimedia.org/T142119 - Dashboard or pane for
ORES failed jobs on beta
1. https://phabricator.wikimedia.org/T140150 - Make user-centered
documentation for review tool
1. https://www.mediawiki.org/wiki/ORES_review_tool
1. https://phabricator.wikimedia.org/T141823 - Set up password on ORES
Beta redis server
1. https://phabricator.wikimedia.org/T141825 - Config beta ORES
extension to use the beta ORES service
1. https://phabricator.wikimedia.org/T141575 - Puppet config changes for
ORES refactor
1. https://phabricator.wikimedia.org/T141368 - [Investigate] ORES time
out errors in logs
1. https://phabricator.wikimedia.org/T141946 - [Investigate] Periodic
redis related errors in wmflabs
1. https://phabricator.wikimedia.org/T141523 - [Investigate] web-05
downtime
1. https://phabricator.wikimedia.org/T142046 - Extrapolate memory usage
per worker forward 2 years
Sincerely,
Aaron from the Revision Scoring team
This means ORES in labs will be down at period of time since our redis will
be restarted.
Best
---------- Forwarded message ---------
From: Andrew Bogott <abogott(a)wikimedia.org>
Date: Sat, Aug 6, 2016 at 8:58 PM
Subject: [Labs-l] [Labs-announce] Some Labs instances rebooting TODAY,
19:00 UTC
To: <labs-announce(a)lists.wikimedia.org>
Hello!
We have discovered a surprisingly terrible bug in the kernel
that's running on two of the nova-compute hosts. To remedy this, we
will be downgrading and rebooting both hosts in a few hours, at high
noon San Francisco time, 19:00UTC.
We will shuffle things around to so there is no impact on Tools.
VMs on these hosts only (listed below) will experience a single reboot
and accompanied downtime.
Sorry for the short notice... it's worth it, believe me. We'll
work on getting a post-mortem incident report written, but that may not
happen until Monday. In the meantime, here is a complete list of
instances that will be affected:
| 330f16d6-374d-44a8-bf96-53e763d5dd3a | captcha-apiproxy-02
| privpol-captcha
| 00422b7c-9711-4a92-bba4-73db0efb5889 | deployment-db01
| deployment-prep
| 1597e37a-99e6-431e-8117-3901e4ac9858 | encoding02
| video
| 01518800-406c-4689-a289-9e3e33fd387b | kafka501
| analytics
| 6a5289e9-1b69-48a3-a6fd-5de63c2ee285 | labstore-test-05
| testlabs
| 37fc9d21-6614-4442-9b45-a7ae349b6d09 | secgroup-server-labvirt1012
| testlabs
| d35fedde-9f0f-42b2-a22d-780cb1477a17 | spice-test-102
| admin
| 21706518-ccd4-45b1-9853-37f463d02393 | striker-uwsgi01
| striker
| c4bc63f8-cbd7-4384-b349-54b115e91a5c | util-abogott
| testlabs
| 3d3288cd-3523-449c-bf48-1bd48cac3e5d |
captcha-proxypostgres-01 | privpol-captcha
| 7ebb2447-7c74-4fd1-aa5e-19c48011e39d |
deployment-depurate01 | deployment-prep
| 9336405f-cf57-4234-9c73-0a539d97580e |
deployment-kafka05 | deployment-prep
| 60ece1f0-327b-49b2-aa1e-a3df241f70fb |
encoding03 | video
| 8b90e688-aba7-4b6c-9b6b-bbb13ec84da5 |
gerrit-test3 | git
| 833bb2a0-a442-445e-8210-ac77e950ead7 |
integration-slave-jessie-1003 | integration
| a64e731e-c987-4b84-85b9-cf450275d26d |
integration-slave-jessie-android | integration
| a18beafb-2279-49fa-8b88-049b3f55a2f5 |
kafka601 | analytics
| fe12b525-5a64-48ff-a3a3-e69c65043e26 |
labstore-test-01 | testlabs
| 9b45a5c1-c6a5-4a59-8305-9d4542a4c27a |
labstore-test-02 | testlabs
| f5bcb8c1-51b2-4820-bda3-a7a3155b0a6f |
mwdiffstuff | catgraph
| 59664505-be72-4294-b639-b1ea2218a44b |
ores-redis-02 | ores
| d5b434ce-2cae-4a56-9b95-4459913516e9 |
pole | wikidata-query
| fdc47372-b60c-4c61-ab27-9231a5fd2d4e |
secgroup-server-labvirt1013 | testlabs
| c8e73c70-9d13-480f-a6a9-01c46328a83d |
striker-build | striker
| 55e6d403-c916-44f9-881f-23dec8968111 |
striker-deploy03 | striker
| d54ea2da-f3d5-4d8b-aa47-41837accb285 |
utrs-secondary | utrs
-Andrew
_______________________________________________
Labs-announce mailing list
Labs-announce(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-announce
_______________________________________________
Labs-l mailing list
Labs-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/labs-l
Hey,
This is the 15th weekly update from revision scoring team that we have sent
to this mailing list.
*New developments:*
- We'll no longer unnecessarily load the models into memory on the web
workers[1].
- We can now score multiple models against the same revision ID for
(essentially) free[2].
- Our precaching system will take advantage of this to drop load by
about 3X[3].
- Update wmflabs deploy repo for new version of ORES[4].
*Documentation & maintenance:*
- We completed deployment and maintenance docs for Wiki labels[5], which
means we've now got complete docs for our systems[6].
- We implemented basic continuous integration tests for the ORES
extension[7].
*Downtime:*
- We had a 1 hour long downtime while trying to deploy new code to
ores.wikimedia.org[8]. We've filed two critical tasks for making sure
we don't make the mistake again[9,10].
1. https://phabricator.wikimedia.org/T134606 - Score multiple models with
the same cached dependencies
2. https://phabricator.wikimedia.org/T139407 - Don't load models into
memory of web workers
3. https://phabricator.wikimedia.org/T141376 - Update precached to group
requests by model
4. https://phabricator.wikimedia.org/T141377 - Update wmflabs deploy repo
for new version of ORES
5. https://phabricator.wikimedia.org/T131768 - Wikilabels deployment docs
6. https://phabricator.wikimedia.org/T106271 - Document maintenance tasks
7. https://phabricator.wikimedia.org/T140455 - CI test for ORES extension
8. https://wikitech.wikimedia.org/wiki/Incident_documentation/20160801-ORES
9. https://phabricator.wikimedia.org/T141823 - Set up password on ORES Beta
redis server
10. https://phabricator.wikimedia.org/T141825 - Config beta ORES extension
to use the beta ORES service
Sincerely,
Aaron from the Revision Scoring team
Hey,
This is the (12 + 1)th [1] weekly update from revision scoring team that we
have sent to this mailing list.
New developments:
- ORES review tool is deployed as a beta feature in Turkish Wikipedia.
Now six Wikis have this tool. [2]
- CopyPatrol tool soon will show ORES scores if they pass a certain
threshold. [3]
- We are talking about integration with Detox, feel free to chime in. [4]
Maintenance and robustness:
- Currently we are dealing with increasing memory pressure on scb nodes.
Actions we did to reduce this pressure:
- We migrated most of RandomForrest models to GradientBoosting, which
will reduce memory pressure greatly without affecting accuracy of models
noticeably [5]
- It seems there is a memory leak issue with celery. We bypassed that by
setting a periodic restart of workers. [6]
- We reduced maximum number of precaching requests in order to prevent
spikes that might cause memory pressure on other services. [7]
- We reduced number of web processes to 2/3. It is still fine. [8]
- We finished up the refactor and it will soon goes to the production
cluster. [9]
- Damaging and goodfaith models had issue in Dutch Wikipedia. Got fixed.
[10]
- Our metrics collector now sends timed requests. We will have a
dashboard in grafana for that soon. [11]
- There will be a link to "ORES review tool" page [12] in legend of
RecentChanges and Watchlist. [13]
- When revscoring fails for any unknown reason, ORES return a proper
message now. [14]
- We fixed a puppet issue that caused trouble while creating new web
instances. Got fixed [15]
1. It's 13. We are not superstitious, just kidding ;)
2. https://phabricator.wikimedia.org/T139992
3. https://phabricator.wikimedia.org/T139009
4. https://phabricator.wikimedia.org/T139007
5. https://phabricator.wikimedia.org/T139963
6. https://phabricator.wikimedia.org/T140020
7. https://gerrit.wikimedia.org/r/299559
8. https://gerrit.wikimedia.org/r/298739
9. https://github.com/wiki-ai/ores/pull/155
10. https://phabricator.wikimedia.org/T140038
11. https://phabricator.wikimedia.org/T137442
12. https://www.mediawiki.org/wiki/ORES_review_tool
13. https://phabricator.wikimedia.org/T140361
14. https://phabricator.wikimedia.org/T140301
15. https://phabricator.wikimedia.org/T140265
Sincerely,
Amir from the Revision Scoring team
Hey,
This is the 11th weekly update from revision scoring team that we have sent
to this mailing list.
*New developments:*
- ORES review tool as a beta feature is enabled in Dutch Wikipedia. More
wikis to come soon this week [1].
- We have basic edit quality model for Czech Wikipedia ready and merged.
To be deployed this week [2].
- We also have basic models for English Wiktionary too. This is the
second non-Wikipedia project we support after Wikidata [3].
- Thanks to Tar Lócesilion, we have Polish edit quality campaign
completed, We are working on building damaging and goodfaith models at the
moment [4].
*Maintenance and robustness:*
- We decreased our web capacity in order to reduce memory pressure on
scb nodes. You should not get any overload error since our capacity is
still very high but if you do, please contact us immediately and we will
bring it back up [5].
- We improved documentation on ores.wikimedia.org page a little bit. To
be deployed this week [6].
We are working on a rather big refactor on ores which will give us
performance boost on scoring multiple models at the same time [7] and
reduce memory usage [8]. Feel free to chime in and give us feedback [9].
1. https://phabricator.wikimedia.org/T139432
2. https://phabricator.wikimedia.org/T138885
3. https://phabricator.wikimedia.org/T138630
4. https://phabricator.wikimedia.org/T130269
5. https://phabricator.wikimedia.org/T139177
6. https://phabricator.wikimedia.org/T138089
7. https://phabricator.wikimedia.org/T134606
8. https://phabricator.wikimedia.org/T139407
9. https://phabricator.wikimedia.org/T139408
Sincerely,
Amir from the Revision Scoring team.