1827474 Members
1824 Online
109965 Solutions
New Discussion

Task to task

 
Wim Van den Wyngaert
Honored Contributor

Task to task

I am using remote file access (node::file) between 150 nodes and 1 central node. All use the username system.

When the node or network is slow, the clients receive "network partner exited" and on the server side "file not accessed on channel".

I think the problem is related to the fact that a number of clients share the same server process. The clients are doing f$search("x::*.com") directed to the server and thus have a context on the server process.

Is there a way to get 1 process per client and not to re-use processes ? Or another workaround ?
Wim
37 REPLIES 37
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

Sorry for the bad title. Remote access would have been better.
Wim
Uwe Zessin
Honored Contributor

Re: Task to task

It sounds like you are getting timeouts when the 'node or network is slow'. I don't think you should stop re-using the FAL server process - that would make the matter worse because you incure the additional delay of process creation whenever you open a new file.

Perhaps you can rewrite your procedures to run the F$SEARCH locally on the server and transfer the results via $WRITE/READ. Then you can put a better error handling around it.

You might be able to tweak some DECnet counters, but that is a system-wide setup. Sorry, can't tell you which knobs to turn - somebody else?
.
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

Uwe,

Indeed I can rewrite the code. But I prefer not too.

I did some testing and found that the processes were not reused while the f$search was in progress. It must something else. Some timeout ? Some locking ptroblems ?
Btw : I'am using DecnetPlus.
Wim
Uwe Zessin
Honored Contributor

Re: Task to task

OK, if the processes don't get reused then I assume that the F$SEARCH loop is 'too fast'.
There is some processing involved when the I/O channel is closed - see SYS$SYSTEM:NETSERVER.COM. During that time the server process obviously cannot take over a connection request :-(
.
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

Uwe,

Further investigation (via accountng) indicate that the same problems arrised for another (1 !) process doing /out=xx:file.
Wim
Uwe Zessin
Honored Contributor

Re: Task to task

Well, what is the process doing exactly? If he is opening 2 files within a short amount of time you will get 2 server processes. Put something like these commands in a command procedure and run it:

$ open /write LL 0::tmp.tmp
$ close LL
$ open /write LL 0::tmp.tmp
$ close LL

On the other hand: what happens on the server when you create one file this way. Do you get a process named "FAL_xxxx" which changes after some time to "SERVER_xxxx"? or do these processes quit their work right after you close the I/O channel?
.
Antoniov.
Honored Contributor

Re: Task to task

Hi Wim,
I don't know if you are using DecNet IV or V so I answer by my knoledge about DecNet phase IV.
Stay warning when you change some parameter; if my mind support me you could change executor values; there are follows parameters:
NCP>SET EXEC INCOMING TIMER nn
NCP>SET EXEC INACTIVITY TIMER nn
NCP>SET EXEC MAX LINK nn
Before change any one you must see current value typing
NCP>SHOW EXEC CHAR
You could increase MAX LINK (the # of processes linked to external client), increase INCOMING TIMER (time-out to reject incoming connection) and decrease INACTIVITY TIMER (inactivity time process still live).
You can read SET command to check result but you need issue DEF command to store permanently.

H.T.H.
@Antoniov
Antonio Maria Vigliotti
Antoniov.
Honored Contributor

Re: Task to task

Sorry,
I'm starting when there is only 1 answer and I finished after you post use DecNet Plus.
I think yuo can find in DecNet plus documentation about translation of command I've posted

Regards
@Antoniov
Antonio Maria Vigliotti
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

Antoniov,

I simulated the problem by starting a process on prio=8 while running 10 jobs doing remote access in a loop to the node itself.

All jobs terminated with "network partner exited" and all net$server.log said "file not accessed on channel".

It could indeed be the timeout values you mention. But they are already at 45 and 60 seconds (mc ncl show ses con all).

Uwe : the task is created and stays until the f$search is done.
Wim
Lokesh_2
Esteemed Contributor

Re: Task to task

Hi,

and here is what I found:

If your system is too busy to get FAL started soon enough to
successfully complete the logical link, or if there is too much "stuff"
going on in SYSLOGIN.COM or LOGIN.COM that FAL doesn't start before the
delay timers run out, you may experience this error message.

The time-stamps in the NETSERVER.LOG files may indicate the delays between
the first stamp and the startup time for FAL. A LOGIN.COM which includes
a WAIT for 1 minute can result in an SYS-F-FILNOTACC error, in the
NETSERVER.LOG file.


HTH,
Best regards,
Lokesh
What would you do with your life if you knew you could not fail?
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

Antoniov,

Nope. The creation of the process goes fine. After serving a few operations it gives the channel error.
The timeout-abort is about 2 till 5 minutes after the last fal request msg in net$server.log.
After all batch job aborted, there were about 10 server process active.
Still something wrong.
Wim
Antoniov.
Honored Contributor

Re: Task to task

Hi Wim,
I don't know DecNet plus, so you could verify all my informations but I think task to task is like DecNet IV; server process starts, answer to client request and stay live for a short time (1-5 min), waiting for a new client request.
In you environment, where you have 150 client, perhaps, you could set 100-150 links and check for SYSTEM quotas in AUTHORIZE to support all processes.

@Antoniov
Antonio Maria Vigliotti
Stanley F Quayle
Valued Contributor

Re: Task to task

Strange question: Why don't you cluster those 150 nodes (VOTES=0) with the central node (VOTES>0)? Then you won't have to do any DECnet operations at all -- all files will be "local".
http://www.stanq.com/charon-vax.html
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

SFQ,

Question : How many nodes can be put in a cluster ?
Wim
Lokesh_2
Esteemed Contributor

Re: Task to task

An OpenVMS Cluster system cannot contain more than 96 Alpha and VAX (combined total) nodes
What would you do with your life if you knew you could not fail?
Stanley F Quayle
Valued Contributor

Re: Task to task

The official maximum is 96. My understanding is that there are that many VMS machines at the test center in Nashua.

However, there are several installations with many more nodes. Check the Ask the Wizard site and past postings in comp.os.vms for some potential pointers.

Since the node table has 255 entries, that's the hard upper limit.

For a supported system, you could cluster 95 satellites and then use DECnet for the remainder...
http://www.stanq.com/charon-vax.html
Uwe Zessin
Honored Contributor

Re: Task to task

Hm, I thought 96 nodes / VMScluster is the total supported limit due to the size of the lock value block which is used by the DECnet cluster alias.

Interesting playing: if I recall correctly, the LVB is max. 128 byte in size.
- 128 byte * 8 bit/byte = 1024 bits
- 1024 bites / 96 nodes = 10 bit/node
- the node part of a DECnet address (area.1-1023) is 10 bits

I remember those discussions from the DECUS days when customers were complaining that they had to buy a DECnet router license for one of their cluster members. (political requirement to fulfill what is offered in the software product description - I was told).
.
Stanley F Quayle
Valued Contributor

Re: Task to task

If you search comp.os.vms for postings by Steve Hoffman (who also does Ask the Wizard), you'll find mention of at least 115 node clusters, at undisclosed government installations. :-)

The difference is between supported and not-supported installations.

Keith Parris has some good stuff on this topic as well.

Building Large Local Area VAXcluster (LAVC) Configurations
http://www.geocities.com/keithparris/decus_presentations/biglavc_article.ps

His overall web page has lots more useful stuff, too:

http://www.geocities.com/keithparris/

http://www.stanq.com/charon-vax.html
Uwe Zessin
Honored Contributor

Re: Task to task

Stanley, did you respond to me? I thought I had made it clear that I was talking about the _supported_ limit.
.
Stanley F Quayle
Valued Contributor

Re: Task to task

Sorry to not respond more directly to your posting.

I'm not sure there's a one-to-one mapping between DECnet addresses and lock information. I'd have to dig out my internals books, and that sounds like work. ;-)

The big picture is that the original poster COULD solve this problem with clustering, not that he SHOULD do it this way.
http://www.stanq.com/charon-vax.html
Uwe Zessin
Honored Contributor

Re: Task to task

I see. I apologize if I did come over harsh.
.

Re: Task to task

Hi Wim

Another idea: Have a look at the NETSERVER$TIMEOUT logical at the server side. If it's not defined, the timeout value is about 5 minutes. You can assign aVMS delta time to it and keep the netserver up for more than 5 minutes. This prevents a slow system from permanent process creation and deletion.
Egs.

$ DEFINE/SYS/EXEC NETSERVER$TIMEOUT -
"0 00:01:00.00"

would keep the FAL process up for 1 hour.

Regards
Juerg

Re: Task to task


correction:

$ DEFINE/SYS/EXEC NETSERVER$TIMEOUT -
"0 01:00:00.00"

would keep the FAL process up for 1 hour.

Regards
Juerg
Wim Van den Wyngaert
Honored Contributor

Re: Task to task

Juerg,

The timeout is at 00:05 and there are not that many processes arriving at the timeout.
The 150 nodes do their access in a random way in a 60 minutes interval.

But to be complete :
1) at 5:30 a power cut is done. After that ALL 10 cluster stations start booting. Because of the power cut cluster stations are stopped without proper dismount. The server starts a shadow merge (1 on each node) for 2 9GB disks. Between 5:30 and 6:00 also a defragmentation was active and also some Sybase database dumping.

2) the 10 cluster stations start booting. Because the system disk is in shadow merge, they all have to do their reading twice. I checked and found that they transfer each 350 MByte over their 10 MBit network interface (normally 200). The last boot was finished at 8:00 !!!

3) at 6:00 the first remote accesses arrive. Each nodes requires about 20 remote accesses. The system is slow and I think we get timeouts.

BUT MY QUESTIONS ARE :
1) why don't we get timeout messages instead of the channel message ?
2) how is it possible that these processes don't get any cpu (shadow server is running with prio 1)
3) why don't I get the message after less than 1 minute (the decnet timeout) ?
4) why are the IO done on behalve of the cluster stations not visible (e.g. a vms process) ? Only VPA reports "virtual IO".
Wim