BladeSystem Server Blades
Showing results for 
Search instead for 
Did you mean: 

BL460c G6 CPU Spikes?

Trusted Contributor

BL460c G6 CPU Spikes?

Jonathan was looking for advice on a customer issue:




My customer has seen a significant performance issue on the BL460G6 where they are seeing CPU spikes every 30/120 seconds. 

Please can anyone advise if they have seen this issue with the BL460cG6 servers and if there is a resolution? Having searched the advisories I can find no reference to this issue.




Mark replied:




Can only slightly help if this is Windows based. This following assumes this is a one off Blade, and not lots of them doing the same thing.


Just getting %CPU from the process isn’t particularly helpful, given there is a usermode element to a process and a kernel mode element. Obviously the SmartArray SAS/SATA is a driver, so that’s pure kernel, but the management agents have a user mode element too.


BTW, I wouldn’t suggest running without the SmartArray Event Notification just in case you get an error, like a drive failure. This would mean you wouldn’t see it until a reboot (and even then it would be at Post so you may still miss it). The Event Notification tool passes the error to the System Log, and the management agents see it and shoot it out to the management station. If you ditch this then the system could find itself with a non-redundant array for quite sometime, and you wouldn’t know about it.


So that one is going to be difficult to troubleshoot. The way it’s normally done is via a utility called XPerf but ideally you need the public symbols which ISS L3 will not entertain supplying to you. So this means can you reproduce it, and if so, get an ISS L2 (GCC) call on it for them to do the XPerf session.


Regarding the agents, I’d try to look at what the process is doing. Is more Kernel ? User Space? What’s Interrupt Service Routine (ISR) Percentage, its Deferred Procedure Call (DPC) percentage? If these are high comparative with the process’s CPU percentage, then it’s in kernel, and not the fault of the process. If pure UserSpace you can use things like Process Monitor (from SysInternals) to try and workout what it’s doing, or worst case do a user mode dump of the process when at the Percentage CPU you think is bad (a SysInternals tool called CoreDump is really useful for that from a timing perspective). The UserMode dump can be analysed by a Microsoft engineer in L2, but to get to the nitty gritty of the functions used then you’d need to get public symbols, which is another chat with ISS L2/L3.


These would help, as the procedures to do the above (particularly XPerf) are quite complicated.




And Richard had some input:




If the "spikes" were short (much less than a second) I would suggest increasing the NIC's RX queue.  This assumes the drops are at the NIC and say up at a UDP socket buffer.


I'd probably also be inclined to make sure that the cores taking interrupts from the NICs were distinct from those on which these agents run.




Anyone else have some input?