Operating System - Tru64 Unix
1752287 Members
4670 Online
108786 Solutions
New Discussion юеВ

Re: What are Idle Ticks

 
Victor Semaska_3
Esteemed Contributor

What are Idle Ticks

I noticed on one of our Alhpas running V5.1B PK5 that it has at times no idle time. This goes on for many minutes. I used collect to see what the CPU is doing and got this:

# CPU SUMMARY
# USER SYS IDLE WAIT INTR SYSC CS RUNQ AVG5 AVG30 AVG60 FORK VFORK
11 3 0 85 486 2027 2237 0 0.14 0.21 0.23 0.00 0.00
# SINGLE CPU STATISTICS
# CPU USER SYS IDLE WAIT
0 11 3 0 85

85% in WAIT state. The manpage for 'collect' says 'WAIT Idle ticks while waiting for I/O to happen.' What are Idle ticks?

I've never seen this type of behavior before. It's an Oracle server but we have other Oracle servers, V5.1B PK5 as well, that don't do this.

Any ideas?

Vic
There are 10 kinds of people, one that understands binary and one that doesn't.
7 REPLIES 7
Ivan Ferreira
Honored Contributor

Re: What are Idle Ticks

IOWAIT is the percent of time that the CPU is "not processing" waiting for complete I/O. I/O relates to disk or network operation.

A high IOWAIT (like your case) indicates a problem. The database or application is doing something wrong, or your disk are too slow for your requirements.

Ideally WAIT should be 0.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Harmanjit_1
Frequent Advisor

Re: What are Idle Ticks

Hi,

This shows that one of your disk is taking more IO wait time.

You can use "collect -sd" to check which disk is showing more average wait time and which is busy.

Once determined, check the domain on that disk and see the level of defrag for the same.

Once both above things are satisfied, then check your application database if its a database domain to see what it is doing.

Hope this will help.

Manish PATHAK_2
Regular Advisor

Re: What are Idle Ticks

idle ticks is the unused processor time...

What are the application you are running and how many cpu's you have...

can you post the current process details.

Br
Manish
Victor Semaska_3
Esteemed Contributor

Re: What are Idle Ticks

The server is a DS25 with 2 GBs memory and 1 EV6.8CB (21264C) processor. It has 11 Oracle databases on it for reporting reasons. It seems when a report is run, that's when the high CPU wait time occurs.

Disk I/O seems low, about 300 I/Os per second when another DS25 with similar hardware is getting upwards to 1,000 I/Os per second.

The RAID controller is a SmartArry 5304A with an external shelf. It's at a remote location so I haven't seen how they have the disks set up in the two RAID sets. The 5304A has 4 channels but the disks may be all on one or two channels. Could that cause this problem?

Vic
There are 10 kinds of people, one that understands binary and one that doesn't.
Hein van den Heuvel
Honored Contributor

Re: What are Idle Ticks

You may have some slow IO component.
The could explain the wait time.
What do the Oracle Statspack reports show as read time in milliseconds? Similar for all (3 - 8 ms perhpas) or some exceptional values ( > 20 ms ? ).

Maybe you have a failed Raid-5 drive and the system has to reconstruct each block by reading the other drives? Maybe a write back cache (battery) failure on a controller and slowing down to (redo) writes?

How about do disk baseline testing.
Simple stuff to get started:
# time dd -count=10000 -bs=8192 -of=/dev/null -if=/dev/r...

Run on each box, compare times.
Or google for more serious IO benchmarks like 'iobench'.

Double check the startup log (/var/adm/messages)... maybe a controller mentions reduced availabilty ?

Call in support / consultants as appropriate.

Good luck!
Hein van den Heuvel
HvdH Performance Consulting.
Victor Semaska_3
Esteemed Contributor

Re: What are Idle Ticks

Thanks everyone for their suggestions. Again, this server is at a remote location so it's not easy for me to check every thing. I am scheduled to go to the site on Friday so I can check things further.

This is what I've determined so far:
* No disks with fault lights on so no failed disks.
* Nothing unusual in /var/adm/messages or any other system logs including binary.errlog.
* The 4th diagnostic LED on the DS25 OCP is on but if I'm reading it correctly in the manual, it means 'RMC (Remote Management Console) power-up done' so I assume that's OK. I'm trying to verify this in the 'Servers' forum.
* It looks like the RAID set with all the Oracle databases is a 5 disk RAID-5 set and they're all located on the external shelf.
* I checked the domain in question and it's not heavily fragmented. Also ran '/sbin/advfs/verify -a' on it and no errors were reported.

Hein, I wasn't able to try your 'time dd' example since that requires an unused partition. Instead I tried 'cp /dev/null' where is a multi-gigabyte file and was able to reproduce the problem, very high CPU Wait time. So it doesn't look like an Oracle problem.

If I'm right, the RAID-5 set with the databases are on the same controller channel. The shelf is not split-bus. Could this be the problem. I know that when you create a RAID set it's best to have the disks spread across multiple channels but could this cause such a problem?

There is the internal 6-disk cage (only 2 slots used for system disks) in the server so I could redo the RAID-5 set splitting the disks between the shelf and the internal disk cage. Would it be worth it?

Hein, you mentioned checking the cache battery. Any idea how to do this. I've looked at the Smart Array 5300 User Guide and don't see anything with ORCA.

Thanks,
Vic
There are 10 kinds of people, one that understands binary and one that doesn't.
Hein van den Heuvel
Honored Contributor

Re: What are Idle Ticks

>> It has 11 Oracle databases on it for reporting reasons. It seems when a report is run, that's when the high CPU wait time occurs.

I know you mentioned this earlier, but 11 Oracle databases is a lot with 'just' 2GB memory. That suggests that either the SGAs are set conservativly at a around 100MB or the memory is overcomitted and the system is paging, or you carefully make sure not all database as started at the same time.

Check the oracle statspack (level 7) for a typical report window and specifically look for the "Buffer Pool Advisory". Does it indicate that with more buffers you'll save lots of IO? OF course also look for the "Tablespace IO Stats" to get an impression on the Rd(ms), Rd and Wrt rates, and Blks/Rd (large IOs or single page IOs?)

I agree that it is suspect to see one system able to do 1000 IO/sec and an other 300 on similar hardware, but it could all be perfectly normal for the specific loads.

That's why I asked for the 'dd', but a cp will do fine also. Execute it on both system to suck down a large oracle dbf file (at least 2GB) and time it. How many mb/sec on buth systems?


>> Hein, I wasn't able to try your 'time dd' example since that requires an unused partition.

nah, just pick a rdsk partition were the dsk partition has a good few oracle datafiles. Try a few IO sizes, notably 8KB (or whatever your DB pagessize is), 64KB and 1024KB. Output to null, read from partition.

>> Instead I tried 'cp /dev/null' where is a multi-gigabyte file and was able to reproduce the problem, very high CPU Wait time. So it doesn't look like an Oracle problem.

That's fine also, just less predicatable as the filesystem might 'help'. You would expect (or at least I would :-) IO-wait from a simple copy, as it should do little but wait for IO to come in. It's the rate it manages to accomplish which will give an indication on the IO subsystem. That's why I suggested to time it. That + size give mb/sec. Try also on the other system.


>> best to have the disks spread across multiple channels but could this cause such a problem?

Too early to tell, but I am with you that it is not likely a 100% effect but more like a 10% effect.

>> Hein, you mentioned checking the cache battery. Any idea how to do this. I've looked at the Smart Array 5300 User Guide and don't see anything with ORCA.

I would assume the management software can tell you, but admit to never having used it. Actually, now that i realize the may usage is reporting this is less likley to be a problem. The cache batteries are needed to protect writes, not reads.

At some point you may just need to get a performance consultant involved who know a bit about tru64 and Oracle and such. I would know just the right person for that ! :-)

Regards,
Hein van den Heuvel
HvdH Performance Consulting.