General
cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle 10g and turning off/on hyperthreading while db is up

 
SOLVED
Go to solution
Hein van den Heuvel
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

Alzhy ->>Let us know how HT works out and if it is a GOOD thing for Oracle Databases (considering most oracle processes are not highly threaded processes).

Yes, concrete application based feedback is much welcomed, even if no two applications are alike.

Please note that the thread process comment IMHO is misleading/confusing. HT co-thread processors are full, independent, freely schedule-able entities.
They are available for any runnable thread, whether as only thread from fresh process or for a multi-threaded process.
You do NOT need multi-threaded applications to benefit from HT.
You do need lots of concurrently running threads any which way.
You also need a good bit of main memory stall s (= cache misses) to create 'micro-idle-time' to flip the threads.

TwoProc>> change the HT on and off every 60 seconds during a heavy load (40% cpu load)

Hmmm, I don't think that is a well constructed test for performance evaluation.
HT works best to create more total CPU throughput under high load... well over 50%, more like 80% (Such as tpc benchmarks :-)
Let's face it. If there is less than 50% CPU load then the OS can do best to not schedule anything on cpu where the co-cpu is already scheduled. In this case that would mean to frequently want to run many more then 24 cpus, 40+ or 60+

Turn up the volume to 11 (out of 10 :-).
For example, let's say that your suggested 250 user Mercury test creates 70% load on 24 non-HT CPU's. I expect those to use 40% or more CPU when switch to 48-HT... the price of more concurrency.
But now go to 400 users, approaching 100% CPU on 24-non. Typically response times will tank. Switch on HT and you'll find the system using 60 - 70% cpu... out of 48 with deteriorated but manageable response times.
And you may find you can add an other 50 users or so before approaching 90% cpu (out of 48) with acceptable response times.

The total throughput gain... in what would be overload situation for 24 CPU's is not unlikely to be 10% - 20%.
But it will not help didly-squat when using just 24 cpu's and may even hurt some, notably when there is lots of latch contention. (Cpu's actively spinning to wait for a memory flag to clear).

Hope this helps some
Hein van den Heuvel
HvdH Performance Consulting







Twoproc>> I could easily believe that it's possible that Oracle knows which procs are real and which ones are HT, and I could just as well believe that it has no idea. And, I think that the truth is that it's probably the latter.

The latter. Each thread in an HT enviroment is as real or as unreal as the next.
They can not be distinguished operationally other then by number.
Alzhy
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

Alzhy ->>Let us know how HT works out and if it is a GOOD thing for Oracle Databases (considering most oracle processes are not highly threaded processes).

Please note that the thread process comment IMHO is misleading/confusing. HT co-thread processors are full, independent, freely schedule-able entities.

Herr Hein, yours truly did not meant to "mislead" (as you allege) that highly threaded processes are a fit for HT processors. What I meant was since Oracle processes are not highly threaded - each Oracle fore/background process will have more access to "full independnet freely schedulable" processing entities with HT Threadin "ON". Now whether a "thread" processing entity has more "oomph" for Oracle (and some) processes pn REAL cores or THREAD processing entities is what I really am after as based on our experience there are certain processes that crunch logic much faster on REAL CPU cores.

But I do agree - there will be processing scenarios wherein the workload will have more processing capacity on HT Processors -- HT being a scheme of duping the OS into thinking there are more real CPUs that there are physical cores.

I have friends who work both for INTEL (HT-leaning) and AMD (historically insistent on real CORES). It is usually fun to have them both over a couple of b33rs. There were so many b33rs that flowed I could no longer remember what HT in Intel's approach or AMDs core persistence is all about.

But in the end -- it really all "depends Migz.
Hakuna Matata.
TwoProc
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

Hein, it's not a performance test to switch every 60 seconds (in fact the load is just there to generate some form of Oracle work, most any work would do). It's a durability test. I want to know if switching HT on and off is safe. I figure if I flip it off an on an insane amount for an extended period of time, and I've got no corruption issues, then its good to go.

Re: 250 users, not a magical number - it's all I've got for load testing licenses. It's the max I can push with the tool without spending lots more $$$, and I believe I can do a good job of load testing with that level of sim users. I can buy temporary user days, but I don't feel I need it for the goals of this particular test scenario.

As for getting the load higher than 40-50%, we do that by reducing and/or removing the timing waits in the scripts. No problem with that. The reason is that we simply cannot come up with a good Mercury test that simulates a room fool of folks doing their jobs. Even when we calculate the waits from sample users, from real user interactions, the load we then simulate WAY overloads the servers more than Mercury load runner does. The code being run is correct, the timing, even though averaged, is almost meaningless, either from empirical evidence, or from samples. When used, these loads are much higher than what actual users generate. I believe part of the blame is the Hawthorne effect, and the other part is that screen event logs and averages still give averages that are practically meaningless because the std deviation of number of items handled, the number of xxxx and xxxx and xxxx would have to be calculated, evaluated for lead/lag influences, stripped of coincident data with Durbin Watson tests, degrees of freedom for the above exploded... etc.

And, what I'd end up with is something like the average number of children per family in the US is 1.5. 1.5 is almost useless to simulate, you really need to simulate 0,1, and children, and use gaming to see if your distributions match real #'s.

We don't have that much time or $$$ resources to do that, when I can see the "real scenario" running right now from my monitors, and I can just tweek timing wait percentages until I feel I'm close enough to make value judgements - and I don't feel it's that difficult to do if looking at this process flow in its various mutations of its basic form for 12 years to judge its ability to satisfice a problem solving challenge... in other words, experience helps.

Anyways, thank you all for your valuable and esteemed input. Very gracious of all of you, thanks very much!
We are the people our parents warned us about --Jimmy Buffett
Hein van den Heuvel
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

TwoProc, Thanks for the follow up reply. Excellent.

>> it's not a performance test to switch every 60 seconds ... It's a durability test

We agree. Good work.

>> I believe part of the blame is the Hawthorne effect

Hmm, I don't think too many Oracle slaves will work harder because they sense they are being watched ;-)

>> Re: 250 users, not a magical number - it's all I've got for load testing licenses.

Been there, done that. Understood.
You can reduced think times to increase load, but it does not 'feel' the same as actually having more concurrent sessions unless the sessions are managed/funneled through a transaction monitor of sort anyway (tuxedo and the likes).


>> when I can see the "real scenario" running right now from my monitors,

And that's what we ended up doing. The load graphs for a given day in the week were relatively predictable/comparable. The effect of HT could be judged from there.

In the case studied where interactively used systems were configured NOT to run anywhere near max capacity, HT was deemed to hurt more then help. Predictability and crisp understanding of the performance graphs played a large role. It seems ashame to leave the potential power on the table. But oh well. I will not hesitate to run it on for 'throughput' / high-load batch application, much as you concluded.

Alzhy>> Herr Hein, yours truly did not meant to "mislead" (as you allege)

Ouch, that sounds intense.
Make that 'Mijnheer Hein' if you must :-).
'van' = Dutch/Belgian, 'von' = German.
The way I read it, which may have been the way others read it also, it seemed to say that you needed a multithreaded application to really exploit HT. The way I understand it, you just need lots of threads ready to run from however many jobs they come, as long as there are often more than the original config can offer. Many Oracle usages have plenty of foreground and background jobs to keep those CPUs occupied.

>> HT being a scheme of duping the OS into thinking there are more real CPUs

No duping. Real CPUs with their own contexts. They just do not get to run all the time. They only get a change to run when their co-thread has to wait for main memory. For some applications that is 'all the time'. For others it is infrequent. "It depends!"

Peace,
Hein.
TwoProc
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

I said that I'd update this thread regarding our findings, and so pardon the length of the posting while I follow through.

We're going to start with the system with hyperthreading turned off during the day, and on at night for big, ugly batch processing. We've done a lot of testing and verifying of databases, and switching hyperthreading off and on doesn't hurt the running database a bit.

Currently, we're going to leave the max_parallel_servers setting in the init.ora at actual cpu count, and don't plan to double it just because we now "virtually" have twice that. We don't do many parallel queries anyway, much less without specifying the degree of parallelism explicity. So, this will only have impact when we kick off something large with unspecified parallelism, like a gather stats on a whole schema for example, wherein we're happy to let the parallelism value float on some of the medium sized schemas.

FWIW - for our purposes the conversion ratio for processing on Tukwilla coming from PA-8800 chips is safe to say double for a nice fully cached query coming from memory blocks in Oracle. I've seen cases where its much faster than this, and I've seen some cases where it's not quite up to double. Overall, throughout our simulations, double (well half depending on how look at it) is a good rule of thumb for what we're doing.

Another setting that we've found very important in tuning: setting the sched_no_age policy to have a value of 178 (and allowing it by setting privileges). This has dramatically lowered latch waits for buffers, enabling processes to really start running/responding well. Also, we've found that turning off NUMA in Oracle 10g was really necessary (due to bugs AND inefficiencies as per Oracle). We also settled on setting the machine to 37.5% local memory(ILM) and leaving the rest for SLM. We may reduce the allocation to ILM after watching it run for a week or so live for production data.

We've found that setting multiblock read count to both 0 and 16 end up being almost the same for us. Setting it zero and letting it float loads up the cpu just a bit more than hard setting it to 16. Keep in mind that Oracle usually recommends 8. We plan to set it to zero at night and let it float to adjust on its own for whatever types of jobs its running, and set it to 16 during the day to stabilize cpu consumption and make it a bit lower overall, and a bit more predictable.

We noticed that file I/O wasn't quite matched up, it improved when we set the file system block size to be the same as the tablespace block size on data directories. This was one of those, "well duh" moments, obviously - but sure enough, we missed it on initial setup.

Many of you have seen me, over the years, go on about the "lotsa lun" theory vs the "big lun" theory of using storage arrays. You'll be happy to note that I've agreed to compromises on this, in which I'm not using just a few humongous luns, but, I am using luns that are many, many times the sizes that I used to. We've checked the scsi queue depth and made sure we're OK for this, and it looks great. I've not come up with a name for my compromise on lun sizing, so I don't know what to call it yet. :-)

What I didn't compromise on. R5, and others. Nope, sorry, sticking with Raid 0/1 for performance.


From a lessons learned perspective, and what I would do differently if I could; I would have included an Oracle 11g database upgrade in the scope of work. From what I've heard/read we can try to re-embrace NUMA performance optimizations for Oracle 11g, which is planned for 2011. We thought that this was a given that using it was a best practice for Oracle 10g (as advertised) but this is not so. Had we fully known about this before hand (that it wouldn't work in for Oracle 10g), we would have included an 11g upgrade in the scope of work to move to the new platform to try and get more out of the machine from the start. Of course, we don't know how much performance we're missing out by not taking advantage of that technology. By the time we realized that it was a missed opportunity, it would be too much wasted project time to back pedal acceptance testing to work in an 11G upgrade, especially given that the potential gains (if any) are unknown in size.

Naturally, depending on what you're doing, your decisions and crux points will be different, these are just the things we've come up with after running test scenarios, and reviewing our current systems behavior. Naturally, there are lots of others that I'm not discussing here, I'm choosing to bring out the ones that stood out in my mind in this posting over what I was doing before, or what I missed. I hope that it gives some folks some things to think about (in a helpful way) in their conversion efforts.
We are the people our parents warned us about --Jimmy Buffett
TwoProc
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

Hein, once again, thanks for the reply above my big summary posting. Your input is always valued/appreciated especially when it comes to tuning! :-)

Anyways, just to clarify my comment on Hawthorne effect. The numbers we were using for waits were ones that we come up with by observing the users while performing their jobs. The Hawthorne effect would submit that they are doing their job "harder" because they are being watched. Therefore, when we put in "realistic" timings from a bit of time motion type studies, and then used it across our "sim" users, the load was much higher than we actually see from the same number of users. Even adding in timing waits between recording events, still left us with a simulation that was much "larger" than what we expected. Much, much larger! Even though we are running the exact same code on exact same hardware on same size configured databases and disk systems!

And, I've got one more that I bet you've seen in trying to model these things. Instead of nice load builds / decreases in these tests, we see these huge swings up and down during the tests. What we've begun to learn is that our sims aren't random enough, and the sim users start to have harmonics, in which they start to hit lulls together, and start to hit high processing requests together, and the loads look more like rough seas than gentle waves of change over time. In my mind, it reminds me of a resonance test on a piece of material, wherein certain frequencies of force can move materials a great deal, even though the total forces applied are really the same. The only difference I've seen is that it seems that the resonant frequencies are easy to hit at multiple points, instead of only happening at very cleanly defined points in a materials test. Maybe more like when you look at graph of human voice harmonies, even though they are not super clean like a machine created one, they are still highly evident.
We are the people our parents warned us about --Jimmy Buffett
Alzhy
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

TwoProc/Herr Hein,

Any of you two gents used and "believe" in the credibility of SwingBench as a load/benchmark test suite?

These days -- that's how I usually do benchmarks and stress tests. I am not DBA but have managed to make use of it following recipes out there for DB tuning and also referring to configs found on TPC.

TPC suite is expensive and Mercury too...
Hakuna Matata.
TwoProc
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

I'd use it if it was all I had. But you can't beat a copy of production and production code.
We are the people our parents warned us about --Jimmy Buffett

Re: Oracle 10g and turning off/on hyperthreading while db is up

TP,

Lots of intersting results there.. seems you have Oracle data on filesystems rather than raw, out of interest did you try any tests with/without the new Concurrent IO mount option for VxFS?

Cheers,

D

Accept or Kudo
TwoProc
Honored Contributor

Re: Oracle 10g and turning off/on hyperthreading while db is up

You know Duncan, I'm glad you brought that up! We sure did!

Mount point options for all data areas, plus redo logs and archive logs:

delaylog,nodatainlog,cio,mincache=direct,convosync=direct,largefiles

Thank you!
We are the people our parents warned us about --Jimmy Buffett