Server Clustering
Showing results for 
Search instead for 
Did you mean: 

Problem with HPC when running big programs

Occasional Contributor

Problem with HPC when running big programs



Im not an expert on this matter but i would really appreciate if somebody could find out the solution for me.

I have an HPC with the master showing as below:


[root@masterserver ~]# df -H
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p6 21G 17G 3.8G 82% /
/dev/cciss/c0d0p9 70G 8.2G 58G 13% /opt
/dev/cciss/c0d0p8 11G 159M 9.8G 2% /tmp
/dev/cciss/c0d0p5 32G 7.1G 23G 24% /usr
/dev/cciss/c0d0p3 32G 2.2G 28G 8% /var
/dev/cciss/c0d0p2 100G 11G 85G 11% /export
/dev/cciss/c0d0p1 5.3G 181M 4.8G 4% /boot
tmpfs 17G 0 17G 0% /dev/shm
tmpfs 8.3G 19M 8.3G 1% /var/lib/ganglia/rrds  9.6T 31G 9.1T 1% /global/home  8.3T 24G 7.9T 1% /global/apps



In the storage master it is :

[root@storageserver ~]# df -H

Filesystem             Size   Used  Avail Use% Mounted on

/dev/cciss/c0d0p6       21G    20G      0 100% /

/dev/cciss/c0d0p7       11G   158M   9.8G   2% /tmp

/dev/cciss/c0d0p5       21G   2.2G    18G  11% /usr

/dev/cciss/c0d0p3       21G   7.0G    13G  36% /var

/dev/cciss/c0d0p2       21G   3.7G    17G  19% /opt

/dev/cciss/c0d0p1      2.1G    43M   2.0G   3% /boot

tmpfs                  4.2G      0   4.2G   0% /dev/shm

/dev/mapper/vg0-lv0    9.6T    31G   9.1T   1% /global/home

/dev/mapper/vg1-lv1    8.3T    24G   7.9T   1% /global/apps



Now when i run any big programs, the program halts, and when i df -H again in the master the line where it mounts the /global/home and /global/apps is missing:

[root@paramsheersh ~]# df -H

Filesystem             Size   Used  Avail Use% Mounted on

/dev/cciss/c0d0p6       21G    17G   3.7G  82% /

/dev/cciss/c0d0p9       70G   8.2G    58G  13% /opt

/dev/cciss/c0d0p8       11G   159M   9.8G   2% /tmp

/dev/cciss/c0d0p5       32G   7.1G    23G  24% /usr

/dev/cciss/c0d0p3       32G   2.2G    28G   8% /var

/dev/cciss/c0d0p2      100G    11G    85G  11% /export

/dev/cciss/c0d0p1      5.3G   181M   4.8G   4% /boot

tmpfs                   17G      0    17G   0% /dev/shm

tmpfs                  8.3G    19M   8.3G   1% /var/lib/ganglia/rrds



I am not able to login to the storageserver at all. It hangs and stay like that till i forcibly shutdown. 

Now i guess there is a disconnection to the storageserver when running big programs. Its caused might be because of the space issue or maybe because of this "/dev/cciss/c0d0p6     -100%  usage/". 

Please help me on this.


Thanks in advance and Regards,



P.S. This thread has been moved from Servers>General to Server Clustering. -HP Forum Moderator