Thanks for the summary. :)
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Fri, Aug 10, 2018 at 1:43 AM Krinkle krinklemail@gmail.com wrote:
How did we do in our strive for operational excellence since last month? Read on to find out!
## The month in numbers
- 2 documented incidents since July 19. [1]
- 55 Wikimedia-log-errors tasks closed after July 19. [2]
- 31 Wikimedia-log-errors tasks created after July 19. [3]
Logstash (type=mediawiki, last 7 days):
- 2,048 fatals. (channel=fatal)
- 117,372 exceptions. (channel=exception)
- 21,043 PHP errors. (channel=error)
- 6,368,647 total error-level events. (channel=*, level=ERROR)
## Highlights
### New database partition
@Josve05a reported that Special:Log was timing out on commons.wikimedia.org for certain queries. Database administrator @Marostegui, investigated the underlying query and found out this was caused by one of the backend database servers having an unpartitioned 'logging' table. Manuel took the server out of rotation for re-partitioning, which was completed later that day.
– https://phabricator.wikimedia.org/T199790
### Disappearing audio players, mystery solved
When Étienne Beaulé (@Ebe123) found PHP-Notice errors in the Score extension, they immediately investigated. It began as the fixing of a typo that caused inefficient (but working) parsing of audio data. Upon closer inspection, a bigger story was uncovered. The computation of audio lengths was being skipped due to a mismatch in MIME-types between Score and TimedMediaHandler. The player needs this length, and as a result, browsers had to download and parse the audio data entirely client-side, creating a delay of 5-20 seconds or more.
Four months earlier, Andre reported that pressing play on an audio player, made the player disappear for a long time. It all makes sense now.
– https://phabricator.wikimedia.org/T192550 / https://phabricator.wikimedia.org/T200835
### Packet loss
After noticing that exception IDs from error pages were not found in Logstash, Tim Starling started an investigation. He created a new Grafana dashboard and the culprit was quickly identified. Over 3000 packets were being dropped, every second. That's over 90% of server logs – missing!
14 deployments, 9 SAL entries, and 6 days later, we finally reached 0% packet loss.
Many thanks to Filippo Giunchedi, @BBlack, @herron, @Gehel who got to the bottom of this.
Our weekly error numbers increased 100X since last month, and.. that's a good thing!
–
https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&from=1530097... – https://phabricator.wikimedia.org/T200960
### Vips or no Vips
We use the VipsScaler extension to create thumbnails of large TIFF and PNG files in some cases. Test requests for it failed with "10.2.1.21 port 80: Connection refused". The error was puzzling because the IP does not belong to MediaWiki or an image-scaling service. Rather, it belongs to Proton, a Chromium PDF service.
Investigation from @MoritzMuehlenhoff, @Reedy, and others revealed the service IP used by Proton since June 2018 previously belonged to the mediawiki-imagescaler pool (dissolved in April 2018). Configuration for VipsScaler was outdated and stopped working in April. The issue was not noticed until the IP address started working again, with an unrelated service producing errors.
– https://phabricator.wikimedia.org/T199937 / https://phabricator.wikimedia.org/T199938
## Higher impact
These cause users (of web or api) to see errors.
New:
- [ProofreadPage extension] https://phabricator.wikimedia.org/T201506 -
MWContentSerializationException: The serialization is an invalid JSON array.
- [Flow extension] https://phabricator.wikimedia.org/T201654 -
InvalidArgumentException "The Title object yields no ID" from Flow\LinksTableUpdater.
- [MediaWiki-Logging] https://phabricator.wikimedia.org/T201411 - Date
input on Special:Log can cause fatal error.
Carried over:
- [Page deletion] https://phabricator.wikimedia.org/T195692 - Undelete for
certain pages aborted by IncompleteRevisionException.
- [AbuseFilter extension] https://phabricator.wikimedia.org/T187153 -
Special:Abuselog throws BadMethodCallException on details/examine.
- [Flow extension] https://phabricator.wikimedia.org/T70526 -
InvalidDataException "Flow workflow is for different page".
- [MobileFrontend] https://phabricator.wikimedia.org/T199066 -
Special:MobileContributions shows "Special:Badtitle" (Revision::ensureTitle error).
## Noise
These are caused by code behaving unexpectedly, but with limited impact due to graceful recovery by PHP, or other handling. These harm our ability to detect and prevent higher impact issues (through Scap and Fatal-Monitor), and may be masking other issues.
New:
- [FileImporter extension] https://phabricator.wikimedia.org/T200837 - PHP
Notice: Undefined index from WikiTextContentCleaner.php.
- [PagedTiffHandler] https://phabricator.wikimedia.org/T200839 - PHP
Notice: Undefined index from PagedTiffHandler_body.php.
Carried over: None!
All of last month's noise mentions were fixed! 🎉
## Thank you
Thank you to everyone for helping investigate/resolve #Wikimedia-log-errors.
Including:
- Jdforrester-WMF (James D. Forrester)
- matmarex (Bartosz Dziewoński)
- Marostegui (Manuel Aróstegui)
- zeljkofilipin (Željko Filipin)
- Ebe123 (Étienne Beaulé)
- jcrespo (Jaime Crespo)
- dcausse (David Causse)
- Jdlrobson (Jon Robson)
- Addshore (Adam_WMDE)
- EBjune (Erika Bjune)
- Anomie (Brad Jorsch)
- Aaron (Aaron Schulz)
- Reedy (Sam Reed)
Thanks!
Until next time, -- Timo Tijhof
[1]
https://wikitech.wikimedia.org/w/index.php?title=Category:Incident_documenta... [2] https://phabricator.wikimedia.org/maniphest/query/h1j5IXlqAUPJ/#R [3] https://phabricator.wikimedia.org/maniphest/query/MtotJEtlSU5_/#R _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l