I added a link to http://tinyurl.com/n3twd8k to the channel topic of #wikimedia-tech & #wikimedia-operations. It points to a live graph of MediaWiki's error rate over the last 24 hours . I hope to automate monitoring of this data sometime soon, but in the meantime let's keep an eye on it collectively, especially right after deployments.
--- Ori Livneh ori@wikimedia.org
On 6/19/13, Ori Livneh ori@wikimedia.org wrote:
I added a link to http://tinyurl.com/n3twd8k to the channel topic of #wikimedia-tech & #wikimedia-operations. It points to a live graph of MediaWiki's error rate over the last 24 hours . I hope to automate monitoring of this data sometime soon, but in the meantime let's keep an eye on it collectively, especially right after deployments.
Ori Livneh ori@wikimedia.org _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Cool.
Is there any *public* list of which exceptions/errors they are. Seeing how many isn't all that helpful unless we know which ones. (yeah yeah I know, there's concerns about data leakage with backtraces, but just the exception names w/o backtrace should be safe (?))
--bawolff
On Wed, Jun 19, 2013 at 1:36 PM, Brian Wolff bawolff@gmail.com wrote:
Is there any *public* list of which exceptions/errors they are. Seeing how many isn't all that helpful unless we know which ones. (yeah yeah I know, there's concerns about data leakage with backtraces, but just the exception names w/o backtrace should be safe (?))
Maybe, e.g. the current one I see if I tail -f fluorine:/a/mw-logs/fatals.log is:
[20-Jun-2013 18:54:45] Fatal error: Call to a member function getCode() on a non-object at /usr/local/apache/common-local/php-1.22wmf8/includes/GlobalFunctions.php on line 1288
Seems OK to display, but meanwhile in exceptions.log:
2013-06-20 18:30:45 mw1076 bswiki: [6d110124] /wiki/[redacted] Exception from line 3303 of /usr/local/apache/common-local/php-1.22wmf7/includes/User.php: User::addToDatabase: hit a key conflict attempting to insert user '[redacted], but it was not present in select!
So the exception/error alone can reveal stuff. And I guess it could hint at an exploit (I hope neither of those do :-/ ).
If there's a problem on a WMF site, unless it's reproduceable on a stock test wiki, I think it'll need someone with access to the fluorine logs machine. For those that have access, < https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Test_and_monitor_your... and https://wikitech.wikimedia.org/wiki/Logs have advice about monitoring logs and graphs.
-- =S Page engineer on Editor Engagement Experiments
On Thu, Jun 20, 2013 at 12:06 PM, S Page spage@wikimedia.org wrote:
On Wed, Jun 19, 2013 at 1:36 PM, Brian Wolff bawolff@gmail.com wrote:
just the exception names w/o backtrace should be safe (?))
[20-Jun-2013 18:54:45] Fatal error: Call to a member function getCode() on a non-object at /usr/local/apache/common-local/php-1.22wmf8/includes/GlobalFunctions.php on line 1288
And that error alone is useless without the stack trace identifying the caller. (BTW it's bug 49897.)
Is there any *public* list of which exceptions/errors they are.
Well, the Ganglia graphs distinguish different types of errors (out-of-memory fatals, time limit fatals, miscellaneous fatals, exceptions, catchable fatals, and query errors). At present there is nothing that is more granular than that, private or public. The error log we consult is an undifferentiated stream of text.
However, it is an area of our code that could easily welcome contributions from the community. Hashar enabled error logging for the beta cluster, so labs is now a viable development environment for a generic error-processing solution.
Relevant code exists in two locations:
https://git.wikimedia.org/blob/operations%2Fpuppet.git/9792c164d10f9f9f20922... (this is the script that is emitting stats to Ganglia)
and
https://git.wikimedia.org/tree/mediawiki%2Ftools%2Ffluoride.git (set of regexps to parse the data even further; not currently used anywhere.)
I've been working on this in my spare time, but I'd be happy to provide mentorship, code review & deployment from interested contributors. If someone competent (a category which explicitly includes you, Brian!) wants to take over and "own" this problem, that's cool with me too.
There's a lot we could do in this area. It should be possible to probabilistically trace an error to the commit(s) that introduced it.
Le 19/06/13 21:55, Ori Livneh a écrit :
I added a link to http://tinyurl.com/n3twd8k to the channel topic of #wikimedia-tech & #wikimedia-operations. It points to a live graph of MediaWiki's error rate over the last 24 hours . I hope to automate monitoring of this data sometime soon, but in the meantime let's keep an eye on it collectively, especially right after deployments.
If you can get ops to adds in a nagios plugin to check a ganglia metric, we could get IRC spam whenever something is wrong :)
A fairly recent plugin:
https://github.com/MrMichaelWill/nagios-ganglia-plugin
There is even a python plugin in Ganglia/contrib (5 years old though):
https://github.com/ganglia/monitor-core/blob/master/contrib/check_ganglia.py
wikitech-l@lists.wikimedia.org