Server Clustering
1752571 Members
5229 Online
108788 Solutions
New Discussion

Problem with HPC when running big programs

 
ArAgOnHaMz
Occasional Contributor

Problem with HPC when running big programs

Hi,

 

Im not an expert on this matter but i would really appreciate if somebody could find out the solution for me.

I have an HPC with the master showing as below:

 

[root@masterserver ~]# df -H
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p6 21G 17G 3.8G 82% /
/dev/cciss/c0d0p9 70G 8.2G 58G 13% /opt
/dev/cciss/c0d0p8 11G 159M 9.8G 2% /tmp
/dev/cciss/c0d0p5 32G 7.1G 23G 24% /usr
/dev/cciss/c0d0p3 32G 2.2G 28G 8% /var
/dev/cciss/c0d0p2 100G 11G 85G 11% /export
/dev/cciss/c0d0p1 5.3G 181M 4.8G 4% /boot
tmpfs 17G 0 17G 0% /dev/shm
tmpfs 8.3G 19M 8.3G 1% /var/lib/ganglia/rrds
192.168.1.2:/global/home  9.6T 31G 9.1T 1% /global/home
192.168.2.2:/global/apps  8.3T 24G 7.9T 1% /global/apps

*****************************

 

In the storage master it is :

[root@storageserver ~]# df -H

Filesystem             Size   Used  Avail Use% Mounted on

/dev/cciss/c0d0p6       21G    20G      0 100% /

/dev/cciss/c0d0p7       11G   158M   9.8G   2% /tmp

/dev/cciss/c0d0p5       21G   2.2G    18G  11% /usr

/dev/cciss/c0d0p3       21G   7.0G    13G  36% /var

/dev/cciss/c0d0p2       21G   3.7G    17G  19% /opt

/dev/cciss/c0d0p1      2.1G    43M   2.0G   3% /boot

tmpfs                  4.2G      0   4.2G   0% /dev/shm

/dev/mapper/vg0-lv0    9.6T    31G   9.1T   1% /global/home

/dev/mapper/vg1-lv1    8.3T    24G   7.9T   1% /global/apps

******************************

 

Now when i run any big programs, the program halts, and when i df -H again in the master the line where it mounts the /global/home and /global/apps is missing:

[root@paramsheersh ~]# df -H

Filesystem             Size   Used  Avail Use% Mounted on

/dev/cciss/c0d0p6       21G    17G   3.7G  82% /

/dev/cciss/c0d0p9       70G   8.2G    58G  13% /opt

/dev/cciss/c0d0p8       11G   159M   9.8G   2% /tmp

/dev/cciss/c0d0p5       32G   7.1G    23G  24% /usr

/dev/cciss/c0d0p3       32G   2.2G    28G   8% /var

/dev/cciss/c0d0p2      100G    11G    85G  11% /export

/dev/cciss/c0d0p1      5.3G   181M   4.8G   4% /boot

tmpfs                   17G      0    17G   0% /dev/shm

tmpfs                  8.3G    19M   8.3G   1% /var/lib/ganglia/rrds

*********************************

 

I am not able to login to the storageserver at all. It hangs and stay like that till i forcibly shutdown. 

Now i guess there is a disconnection to the storageserver when running big programs. Its caused might be because of the space issue or maybe because of this "/dev/cciss/c0d0p6     -100%  usage/". 

Please help me on this.

 

Thanks in advance and Regards,

Hamar 

 

P.S. This thread has been moved from Servers>General to Server Clustering. -HP Forum Moderator