Le 10/05/13 00:00, Antoine Musso a écrit :
Hello,
Jenkins crashed again today. The first time at 6am UTC, I got it fixed. And again between 9pm and 10pm UTC.
This has been a recurring event since we have upgraded our installation and the bug is: https://bugzilla.wikimedia.org/show_bug.cgi?id=48025
Tonight I got Jenkins access log enabled and made Zuul query jenkins directly instead of passing via SSL + an Apache frontend proxy. That will help a little bit.
The root cause is some weird issue in Jenkins where one of its thread will use 100% CPU. I have yet to determine what that thread is doing though nor what trigger the exact issue. Whenever I get some useful informations I will fill a bug upstream and make sure it get attention.
So I went to bed, and in the morning Jenkins was unsurprisingly stuck again. Enjoying coffee and croissant, my morning newspapers have been replaced by obscure web browsers windows titled: "how to read a java heap dump" "help reading a 2GB head dump" (trivia: you need a ton of memory) "java stack trace" "google: enable java debugging symbols" "Garbage Collection in the Java HotSpot Virtual Machine" [1]
All of that while breaking the #1 WMF rule: "do not work in pyjama".
I found out the Java Heap memory was full.
Also took time to look at a Jenkins notice that is warning about some mysterious old data format. After some reading, they are XML elements from the history build files which points to non existent entry points in Jenkins. That can happens when a plugin is removed.
When Jenkins parse the build history, it will record an in memory entry for each occurrences, with the thousands of builds we keep, that turns in a memory killer.
Jenkins offer the possibility to clean the, now invalid, elements for us but it is eventually terribly slow. I thus resurrected my sed skills and altered the XML file. That ran from 12:25am UTC till 17:19am UTC.
The invalid data gone, I hope Jenkins is not going to fill its memory again :-] I will monitor that tonight and on Monday then probably call it done.
I am really sorry for the multiple inconveniences since the upgrade on May 2nd and for the long time it took me to figure out the issue :(
Thanks Chad for the helpful tips regarding Java Heap memory size and thank you Timo for the Java Melody monitoring system.
The bug report: https://bugzilla.wikimedia.org/show_bug.cgi?id=48025#c19
[1] http://www.devx.com/Java/Article/21977 recommended reading