Operating System - OpenVMS
1827890 Members
1655 Online
109969 Solutions
New Discussion

Re: OpenVMS System stats collectiion

 
SOLVED
Go to solution
Kevin Raven (UK)
Frequent Advisor

OpenVMS System stats collectiion

What tools do you use ?

We are looking for a tool that will allow the following :

Data collection on a running OpenVMS system - Like the CA data collector provides ...but will allow investigation at a future date on percieved application hangs of as little as .5 seconds.
From what I remember of DEC/Unicenter/CA performance suite tools ...you could not get granularity as low as this. I think you could only go down to about 10 seconds ? And then you would end up with huge CPD files.

Does anyone use any tools that would give me what I'm looking for ?

Regards
Kevin
13 REPLIES 13
labadie_1
Honored Contributor
Solution

Re: OpenVMS System stats collectiion

May be HP Perfdat is not to far from what you want, I see at
http://hpperfdat.compinia.com/DOCUMENTATION/PerfDat_Arch_Tech_v33.pdf
about the data collector

sample interval is freely definable (minimum 1 second)
Kevin Raven (UK)
Frequent Advisor

Re: OpenVMS System stats collectiion

I will check out the HP Perfdat docs.

Thanks
Ian Miller.
Honored Contributor

Re: OpenVMS System stats collectiion

Look at T4 which has various collectors which can have small intervals although I don't know if they go that low.

You may have to write something based on TDC or pay someone to write something.
____________________
Purely Personal Opinion
Robert Gezelter
Honored Contributor

Re: OpenVMS System stats collectiion

Kevin,

Perfdat and T4 are worth solid investigation.

- Bob Gezelter, http://www.rlgsc.com
Jon Pinkley
Honored Contributor

Re: OpenVMS System stats collectiion

Kevin,

Although this doesn't answer your question, it does provide some guidelines on the type of tools that you choose. You don't state if there is a specific application you are attempting to measure, or if you are trying to determine the cause of 0.5-second deviation in response time for any arbitrary application.

Is this system dedicated to a specific application, or is it a general-purpose time-sharing system. In other words, why is 0.5 seconds considered unacceptable? Is there time that slow response is acceptable?

Several things you need to consider.

1. How do your users connect to the system? If they connect via a network, some of the delay may be network related, and nothing you measure at the VMS system will include that delay.

2. For measuring short duration events, you should consider event-based measurements. Google on Nyquist for appropriate sampling periods needed to detect events with polling. Also, when you are trying to measure small duration events with polling, you are likely to affect the results due to the sampling (Google "observer effect"). Using event-based measurement will probably require modifications to the application to insert instrumentation hooks, but it will give you the most accurate measurement. Also, remember point #1, these times will not include any time spent waiting for the network to deliver packets. It can show that the delay was not due to VMS.

3. If you want to measure the delays seen by the user, you will need to instrument the application from the PC's perspective. It the user interface is telnet, that is not going to be straight forward (at least I can't think of an easy way to do it). You could have something at the PC that pings the host, and measures the response time, or if possible, measure the time for acks to arrive for packets sent by the PC's TCP/IP stack (again, I don't know how that would be done, other than with wireshark/ethereal).

You should download the T4 & Friends package as it is free, and will give you the ability to see the overall conditions at the time you are investigating. I don't think it is going to give fine granularity "micro" picture you are looking for.

On the other end of the spectrum from T4 are the SDA extensions like PCS and PRF (and on Alpha, DCPI) give a much higher resolution view into the what is using the CPU, although they won't provide much information about thing like I/O and locking delays.

Have fun,

Jon
it depends
Ian Miller.
Honored Contributor

Re: OpenVMS System stats collectiion

there is some event based sampling in TDC
http://h71000.www7.hp.com/openvms/products/tdc/

but you will have to develop some software to use TDC
____________________
Purely Personal Opinion
Robert Gezelter
Honored Contributor

Re: OpenVMS System stats collectiion

Kevin,

Jon makes some good points.

Retrospectively resolving a 0.5 second perceived gap in response is a challenge. Sufficient data will need to be collected at many points.

Collecting data at resolutions finer than 0.5 second requires some thought in advance. It is likely that the actual sampling resolution will need to be between 0.1 second and 0.25 second in order to accurately identify anomalies occurring over a period of 0.5 second. Depending on the scope of information gathered, there are several issues, including:

- distortion caused by the sampling and recording process
- gathering sufficient information to examine the situation retrospectively in a constructive manner
- ensuring that the data collection process and the analysis process do not produce misleading artifacts

The correct solution may not be a simple package. As an example, T4 can be used for the analysis of data that is collected outside of the normal tools. Network performance and traffic may need to be collected using a LAN monitor (e.g. WireShark) for later correlation with performance data.

Having data collection tools developed specifically for this situation is certainly a valid option. Deeper planning of the situation is also likely called for (Disclosure: Our firm has done these types of projects)

Looking backward after an event is an interesting problem in a variety of spheres. The challenge is that the information must be captured going forward, once the event has occurred, there is often no potential to capture useful information.

- Bob Gezelter, http://www.rlgsc.com
Ian Miller.
Honored Contributor

Re: OpenVMS System stats collectiion

Agreeing with Bob, I think you need event based sampling.
____________________
Purely Personal Opinion
Wim Van den Wyngaert
Honored Contributor

Re: OpenVMS System stats collectiion

Very good question again.

You would need to trace the network too (to find out why the other party was saying).

And trace all the IO (to find out what the controller was saying).

And when you have all that, how to find what really happened ? E.g. someone unplugged a cable for a second, cpu taken by a real time process (or a system manager) on another node, ...

Wim
Wim
Kevin Raven (UK)
Frequent Advisor

Re: OpenVMS System stats collectiion

Thanks for all the reply and advise so far.
This does look like more of a task than first it did on first look.
I will correlate the responses and feed back to management.
It does look like a custom tool will need to be developed ...if we are serious about gathering good data for future analysis.

Thanks
Kevin
Wim Van den Wyngaert
Honored Contributor

Re: OpenVMS System stats collectiion

clarklk
Advisor

Re: OpenVMS System stats collectiion

Regarding a previous comment that TDC might be used to gether data at .5 second intervals...

Yes, that can be done. You would need to write a simple C/C++ application that could be easily derived from various examples provided with the SDK in the kit at http://www.hp.com/products/openvms/tdc.

Your application would be responsible for the timing of data collections (the built-in timer has a 1-second resolution); it would call TDC_COLLECT_SNAPSHOT() at expiration of each timed interval.

You would want to be careful in your selection of data records to be collected because 1) TDC collects lots of metrics and can create VERY large files (particularly if collecting at .5-sec intervals over an extended period), and 2) collecting all TDC metrics at the frequency you cite could certainly have system performance implications. Selection of data records to be collected can be easily controlled by your application (again, code examples and documentation are provided in the SDK).

You could further elaborate on your application to pull the data of interest at each collection point via TDC's API and store only that data in your own file without creating a TDC data file (I know of one commercial application that does exactly that). CSV might be a suitable format for such a file, as might T4's TLC format.

Lee Clark
OpenVMS engineering
Jon Pinkley
Honored Contributor

Re: OpenVMS System stats collectiion

Just a couple more comments on user perceived delays.

If humans are noticing the delays, then it is probably related to how long it takes to echo characters to the screen.

If the characters are being echoed by the VMS terminal driver, the delays the user see are almost guaranteed to be due to the network, since the character processing is done by code at elevated IPL. There are some applications that do their own character echoing, doing single character I/O without echo, and then processing and sending the character back. There can be noticeable delays in those applications, especially if resources (memory, processor, i/o) are over utilized.

Also, for readers that may not be familiar with all the terminology, here are some definitions.

Polling - measurements made periodically, usually reading a set of event counters, or in the case of CPU, determining the process that was active at the time of the interrupt triggering the taking of a sample. In general with polling, no changes are needed to the application.

Event-based - something in the code that records each occurrence of an event. E.g. each time an I/O is completed, a counter can be incremented.

Transaction timing can be accomplished by taking a snapshot of the time, and perhaps some other items right before starting an operation, and another snapshot can be taken when the operation is complete. Then the elapsed time, and resources used during the operation can be determined by taking the difference between the starting values and the ending values. A relatively simple example are the routines lib$init_timer, lib$stat_timer, and lib$show_timer.

Note that unlike polling, event based sampling requires modifications to the code, at the locations where you want to record things. The VMS operation system already has many event-based counters, which can be read by timer based polling routines, and in fact these are the main source of information monitor and TDC report.

Jon
it depends