ProLiant Servers (ML,DL,SL)
cancel
Showing results for 
Search instead for 
Did you mean: 

CPU Drift on DL585 G1

MT19
Valued Contributor

CPU Drift on DL585 G1

We're having an issue with CPU drifting on a DL585 G1. We have applied the following already which was described on an HP Customer Advisory but it did not resolve the issue. I've also included the report from Microsoft analysis. Anyone know if this something Microsoft needs to continue to troubleshoot or is there an HP solution to it?
Thanks!
MT

_____________________
For AMD Opteron 200-series, Opteron 800-series, Opteron 1200-series, Opteron 2200-series and Opteron 8200-series processors , to ensure proper operation of tasks sensitive to system time, perform either of the following actions, based on the operating system environment:

Microsoft Windows Server 2003 (any edition)
Edit the BOOT.ini file and add the parameter "/usepmtimer," then reboot the server. Adding the "/usepmtimer" parameter to the BOOT.INI file configures the Windows operating system to use the PM_TIMER, rather than the Time Stamp Counter.

Note 1: When installing the AMD Opteron Processor with AMD PowerNow! Technology driver Version 1.3.2.16 (or later) from AMD, the BOOT.INI file will automatically be updated with the "/usepmtimer" parameter. While the driver itself does not resolve this issue, the installation process will make the necessary changes to the BOOT.INI file to prevent the issue from occurring.
___________________________

Below is the analysis report from Microsoft

Hi Everyone,



Pertaining to the PSSDIAG research that I have done, please find the analysis that I have done and the issues that we are facing related to the CPU drift.



ANALYSIS

============



PSSDIAG

===========



PSSDIAG.LOG



2008/09/23 18:25:58.56 PSSDIAG Collection started. Press Ctrl+C to stop.

2008/09/23 18:39:21.27 PSSDIAG Ctrl+C pressed. Shutting down the collector



Pssdiag ran for 14 mins







SQLDIAG

============



2008-09-23 14:24:57.400 Server Microsoft SQL Server 2005 - 9.00.3239.00 (X64)

Apr 9 2008 22:41:28

Copyright (c) 1988-2005 Microsoft Corporation

Enterprise Edition (64-bit) on Windows NT 5.2 (Build 3790: Service Pack 2)



2008-09-23 14:24:57.480 Server SQL Server is starting at normal priority base (=7). This is an informational message only. No user action is required.

2008-09-23 14:24:57.480 Server Detected 4 CPUs. This is an informational message; no user action is required.



2008-09-23 14:24:58.620 Server The time stamp counter of CPU on scheduler id 1 is not synchronized with other CPUs.

2008-09-23 14:24:58.860 spid1s A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): 37560, committed (KB): 88504, memory utilization: 42%.



2008-09-23 14:25:01.790 spid5s Server name is 'CRPSCSMSQ69V1\PUB'

2008-09-23 14:25:01.790 spid5s The NETBIOS name of the local node that is running the server is 'CRPSCSMCS69N1'.



2008-09-23 14:54:02.410 spid61 Using 'xpSLS.dll' version '0005.00.00' to execute extended stored procedure 'xp_backup_database'. This is an informational message only; no user action is required.



2008-09-23 17:25:05.960 Server The time stamp counter of CPU on scheduler id 1 is not synchronized with other CPUs.

2008-09-23 17:25:05.960 Server CPU time stamp frequency has changed from 2399922 to 1816584 ticks per millisecond. The new frequency will be used.

2008-09-23 17:29:05.960 Server The time stamp counter of CPU on scheduler id 2 is not synchronized with other CPUs.

2008-09-23 17:52:56.150 spid4s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\Log\Results_Log.ldf] in database [Results] (5). The OS file handle is 0x00000000000008A8. The offset of the latest long I/O is: 0x0000005dbfa400

2008-09-23 17:53:07.620 Server The time stamp counter of CPU on scheduler id 3 is not synchronized with other CPUs



2008-09-23 18:13:13.040 spid4s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\Log\Results_Log.ldf] in database [Results] (5). The OS file handle is 0x00000000000008A8. The offset of the latest long I/O is: 0x0000005fdbe000

2008-09-23 18:15:04.570 spid4s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Data\Results_Data_02.ndf] in database [Results] (5). The OS file handle is 0x000000000000088C. The offset of the latest long I/O is: 0x000002bbbb0000

2008-09-23 18:18:54.760 spid4s SQL Server has encountered 4 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\Log\Results_Log.ldf] in database [Results] (5). The OS file handle is 0x00000000000008A8. The offset of the latest long I/O is: 0x00000003661e00

2008-09-23 18:19:04.770 spid4s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Data\Results_Data_01.mdf] in database [Results] (5). The OS file handle is 0x0000000000000870. The offset of the latest long I/O is: 0x000000b070e000

2008-09-23 18:22:55.040 spid4s SQL Server has encountered 3 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Data\Results_Data_02.ndf] in database [Results] (5). The OS file handle is 0x000000000000088C. The offset of the latest long I/O is: 0x0000020cf78000

2008-09-23 18:24:15.060 spid4s SQL Server has encountered 46 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\Log\Results_Log.ldf] in database [Results] (5). The OS file handle is 0x00000000000008A8. The offset of the latest long I/O is: 0x0000002849e800

2008-09-23 18:25:08.210 Server The time stamp counter of CPU on scheduler id 1 is not synchronized with other CPUs.

2008-09-23 18:25:39.430 spid188 Configuration option 'blocked process threshold' changed from 5 to 5. Run the RECONFIGURE statement to install.

2008-09-23 18:25:45.030 spid188 SQL Trace ID 2 was started by login "CORP\NBEQ5YK".

2008-09-23 18:25:50.090 spid191 DBCC TRACEON 3004, server process ID (SPID) 191. This is an informational message only; no user action is required.

2008-09-23 18:25:50.100 spid191 DBCC TRACEON 3605, server process ID (SPID) 191. This is an informational message only; no user action is required.

2008-09-23 18:29:08.870 Server The time stamp counter of CPU on scheduler id 2 is not synchronized with other CPUs.

2008-09-23 18:29:46.110 spid4s SQL Server has encountered 4 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Data\Results_Data_02.ndf] in database [Results] (5). The OS file handle is 0x000000000000088C. The offset of the latest long I/O is: 0x00000076a50000

2008-09-23 18:29:46.150 spid4s SQL Server has encountered 6 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\MSSQL\Log\Results_Log.ldf] in database [Results] (5). The OS file handle is 0x00000000000008A8. The offset of the latest long I/O is: 0x00000044d4ca00

2008-09-23 18:35:56.260 spid4s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\MSSQL\Data\Results_Data_02.ndf] in database [Results] (5). The OS file handle is 0x000000000000088C. The offset of the latest long I/O is: 0x000002ce534000



Inference from logs:



1. The server has got working set size issues, but got the error message only once.



2. CPU drift issues still persists. This may impact the performance tuning on the box. As specified by Sharmila, we have to first ensure that the CPU drift is resolved.



3. There are tons of IO messages in the SQL Server error logs. The disk\IO\Hw\drivers looks to be issue for contention.

Not sure if these messages are bogus due to CPU drift.



sp_configure

==================



name minimum maximum config_value run_value

----------------------------------- ----------- ----------- ------------ -----------

max degree of parallelism 0 64 0 0

max server memory (MB) 16 2147483647 12288 12288

max worker threads 128 32767 0 0

min memory per query (KB) 512 2147483647 1024 1024

min server memory (MB) 0 2147483647 0 0

network packet size (B) 512 32767 32767 32767





Inference:



1. Max server memery-> 12GB

2. Netwrok packet sixe-> 32767. And this is very large packet size.

By default, the Microsoft .NET Framework Data Provider for SQL Server sets the network packet size to 8,192 bytes. Microsoft recommends to use not more that 8192 as network packet sixe, otherwise Cx might see performance issues.

http://support.microsoft.com/kb/903002/en-us





-> xp_msver

===============

8 ProcessorType 8664 NULL

19 PhysicalMemory 32599 32599



<\SQL-DIAG>





SQLNEXUS REPORT

===================



In the SQLNexus, I see lot of blocking and the top wait type as PageLatch_EX (Page Latch (non IO)). The max waititme is 30000secs.

But the all these blocking does not last long. The max duration of blocking remain is 24 secs Another thing is that we are not sure if the amount of time is correct since we are seeing the CPU drift issues.

The top waittypes are

PageLatch_Ex, the average waittime is 15123s LazyWriter_Sleep, the average waittime is 14929s SQLTrace_Buffer_Flush, the avg waittime is 14919s



In the SQL Nexus report the head blocker is waiting on Page latch in all blocking chains. I am seeing the blocking chain having Duration as 24 sec and Blocking 38 spids..



Head of Blocker: 79

Blocking start: 9/23/2008 6:36:58PM

Blocking Stop: 9/23/2008 6:37:22PM



Blocking duration: 24secs

Max Blocking Chain Size: 38



SPID 79 was running following command:

INSERT INTO CDSens_20080922 (RevalDate, JobRunUID, TradeLegUID, BucketUID, CreditCurveUID, ValuationAttribUID, ValueType...



and all other SPIDs are majorly waiting for the PageLatch_EX



This means, the SQL Server is doing lot of inserts\deletes\updates on the data pages and hence taking the latch on it. This may be due to the bad disk IO and bad query plans. Need to investigate it further.



Checked the perfmon counters and it looked good to me..

Physical Disk Minimum Maximum Average

============================================================================================================

Avg. Disk sec/Transfer : 0.001 | 0.002 | 0.001

Disk: 0 C: : 0 | 0.005 | 0

Disk: 1 D: : 0 | 0.006 | 0

Disk: 10 V: : 0 | 0 | 0

Disk: 11 U: : 0 | 0 | 0

Disk: 12 K: : 0 | 0 | 0

Disk: 13 F: : 0.001 | 0.067 | 0.008

Disk: 14 Y: : 0 | 0.01 | 0.001

Disk: 15 J: : 0 | 0 | 0

Disk: 16 M: : 0 | 0 | 0

Disk: 2 G: : 0.001 | 0.002 | 0.001

Disk: 3 : 0 | 0 | 0

Disk: 4 H: : 0 | 0.009 | 0.001

Disk: 5 E: : 0 | 0.003 | 0.001

Disk: 6 : 0 | 0.004 | 0

Disk: 7 : 0 | 0.008 | 0

Disk: 8 S: : 0 | 0 | 0

Disk: 9 T: : 0 | 0 | 0

Avg. Disk Queue Length : 0.446 | 1.649 | 0.868

Disk: 0 C: : 0 | 0.296 | 0.01

Disk: 1 D: : 0 | 0.275 | 0.003

Disk: 10 V: : 0 | 0 | 0

Disk: 11 U: : 0 | 0 | 0

Disk: 12 K: : 0 | 0 | 0

Disk: 13 F: : 0.011 | 0.826 | 0.167

Disk: 14 Y: : 0 | 0.017 | 0.003

Disk: 15 J: : 0 | 0 | 0

Disk: 16 M: : 0 | 0 | 0

Disk: 2 G: : 0.231 | 0.775 | 0.477

Disk: 3 : 0 | 0 | 0

Disk: 4 H: : 0 | 0.031 | 0.003

Disk: 5 E: : 0.09 | 0.756 | 0.203

Disk: 6 : 0 | 0.006 | 0.001

Disk: 7 : 0 | 0.012 | 0.001

Disk: 8 S: : 0 | 0 | 0

Disk: 9 T: : 0 | 0 | 0



So, the disk counters are perfect, we need to look for the plan of the query..

So, have to look in the query plan..



Bur before doing this, I would request you to resolve the CPU drift issue which can cause all sorts of problems for SQL, including incorrect query durations reported in Profiler or SET STATISTICS TIME ON, incorrect wait times in sysprocesses or SQLPERF(WAITSTATS)/sys.dm_os_wait_stats, bogus "I/O requests taking longer than 15 seconds to complete" warnings in the SQL errorlog, and even scheduler hangs.



**So, not looking at the plan at this point in time.



==To resolve the CPU drft issue we have already followed:

1. Set the server's power scheme to "Always 2. Install AMD's "Dual-Core Optimizer" patch.

I wanted you to set the processor affinity as well, but you cannot use procesor affinity since this is a cluster and on node 1 you have 4 dual core procs and on node 2 you have 4 single core procs.



==I requested you to run the RDTSCTEST.exe which would tell us the CPU drift on the Server, and here are the results:



G:\PSSDIAG>rdtsctest.exe



======================= Processor =======================

Processor Architecture: x64

Page Size: 4096

Processor Type: AMD64

Processor Type: Family 15, Model 5, Stepping 10

CPU Speed: 2400Mhz (2400000 ticks per ms)

Cache Line Size: 64

Cache Size: 1024KB

Active Processor Mask: 15 (0x000000000000000F)

Warning: CPUID Leaf 4 is not supported or disabled in BIOS. May be unable to det ermine core/processor mapping.



Hyper-Threading: Not capable

Multi-core: No

Physical Package/Socket Count: 4

Processor Core Count: 4 (1 per package)

Logical Processor Count: 4 (1 per core)



Logical Processor/Core/Socket Mapping

CPU APIC ID Physical ID Core ID HT/SMT ID

------ ----------- ----------- ----------- -----------

0 0 0 0 0

1 1 1 0 0

2 2 2 0 0

3 3 3 0 0



======================= NUMA ============================

Max NUMA Node: 3

Node 0 Processor Mask: 1 (0x0000000000000001)

Node 1 Processor Mask: 2 (0x0000000000000002)

Node 2 Processor Mask: 4 (0x0000000000000004)

Node 3 Processor Mask: 8 (0x0000000000000008)



======================= Memory ==========================

Total Physical Memory: 32599 MB

Avail Physical Memory: 17747 MB

% Memory In Use: 45

Total Page File: 64669 MB

Avail Page File: 50207 MB

NUMA Node 0 Avail Memory: 0 MB

NUMA Node 1 Avail Memory: 3101 MB

NUMA Node 2 Avail Memory: 0 MB

NUMA Node 3 Avail Memory: 3477 MB



======================= Actual CPU Speed ================

CPU 00 Current Speed: 2397Mhz

CPU 01 Current Speed: 2398Mhz

CPU 02 Current Speed: 2391Mhz

CPU 03 Current Speed: 2377Mhz



======================= RDTSC Deltas ==================== CPU 00 Ticks: 0x0000B00905A7D976, Drift: 2449197.2ms (0x55898781649 ticks) *Excessive Drift =============> IMP to NOTE CPU 01 Ticks: 0x0000AAB06D2FC32D, Drift: 0.0ms (0x00000000 ticks) CPU 02 Ticks: 0x0000ABC7CB7E1F7A, Drift: 499949.2ms (0x1175E4E5C4D ticks) *Excessive Drift ==============>IMP to NOTE

CPU 03 Ticks: 0x0000AB59C6B7137A, Drift: 303063.1ms (0xA95987504D ticks) *Excessive Drift ================IMP to NOTE



So, this clearly explains that the CPU drift is excessive on the server node.



==Recommended you that we need to contact the hardware vendor and windows paltform team (if they can help un reducing the CPU drift).



==At this point we will not do further research on the case (related to query plan etc).



ACTION PLAN

================



1. You will contact Hw vendor to reduce the CPU drift on the server



2. You will engage Platforms Engg (Dedicated to BOA) to check if he can help.



3. Once this is done, we will run PSSDIAG again on the server to get the data and do analysis.



4. Have advised you to reduce the Network packet size to atmost 8192KB (4096 would be the best)





3 REPLIES
KarloChacon
Honored Contributor

Re: CPU Drift on DL585 G1

hi

have you read this website?

http://digital.ni.com/public.nsf/allkb/D0FC3BE7D9F331398625731A005D76B8

and this
http://developer.amd.com/assets/TSC_Dual-Core_Utility.pdf

regards
Didn't your momma teach you to say thanks!
MT19
Valued Contributor

Re: CPU Drift on DL585 G1

Hi Karlo -

I'm currently presenting this to my team. Will let you know how it goes after implementing it.
KarloChacon
Honored Contributor

Re: CPU Drift on DL585 G1

oh thanks for the update

OK let us know the results
Didn't your momma teach you to say thanks!