DOCUMENT ID: OZBEKBRC00000611 USING Q4 TO ANALYZE SYSTEM DUMP FILES (For 10.10-11.00) ============================================================== When HP-UX crashes, it saves a snapshot of RAM in disk-based swap space or dedicated dump space, reboots the system, and copies the resulting "dump" into /var/adm/crash. A utility called q4, normally loaded on the system, is available to make text files for fast analysis. A patched version of q4 must be loaded to interpret dumps resulting from a "hanging" operating system. To preprocess the dump, follow these steps and email the resulting files to the HP Response Center for analysis. Steps vary depending on the version of the O/S and the version of q4. ============================================================== STEP 1 ===== WHERE IS THE DUMP? =========================== ============================================================== Verify a current dump exists in the dump directory: # ll /var/adm/crash/c* A recent core.N(10.X) or crash.N(11.X) directory should be listed. (NOTE:N is the next available dump index, which increments with each successive dump.) The INDEX file and /etc/shutdownlog contains the "panic" statement. # touch /etc/shutdownlog (if it does not exist) If a current dump is not in /var/adm/crash, do # grep _DIR /etc/rc.config.d/save* The value pointed to by SAVECORE_DIR=(10.X) or SAVECRASH_DIR=(11.X) is where the system places dump files. If the system dump is not in the expected location try to re-save the dump with: 10.X : # savecore -vr 11.00: # savecrash -vr A return message "invalid dump header" means the dump is non-existent. NOTE: If the current dump directory gets full with a dump save, update the directory variable with a directory with more space, and make the new directory to capture future dumps. ============================================================== STEP 2 ===== IS A VERSION OF Q4 LOADED? =================== ============================================================== Determine if and which version of q4 is loaded: # swlist -l fileset | grep -i Q4 The following two are unpatched versions supplied with the OS: OS-Core.Q4 B.10.20 HP-UX Crash Dump Debugger for PA-RISC systems or OS-Core.Q4 B.11.00 HP-UX Crash Dump Debugger for PA-RISC systems If one of the following patched versions are listed, proceed to STEP 3: (10.20) PHCO_20261.PHCO_20261 B.10.00.00.AA q4 patch version A.11.10c or (11.0) PHCO_20262.Q4 1.0 OS-Core.Q4 If the system does not have q4 or the dump was the result of a hang, load the patched version. Loading the patched version will not cause a system reboot. Installation instructions accompany the patch. Download the appropriate version from this site: ftp://i3107ffs.external.hp.com/hp-ux_patches/s700_800/ (select 10.X or 11.X) Retrieve patch:PHCO_20262 for 11.0, patch:PHCO_20261 for 10.10 or 10.20. NOTE: the patch number may be superceded over time If web access is unavailable and no version of q4 is on the system and the install CD is available, proceed to load the standard version of q4: Mount the INSTALL media and verify a matching version of Q4 is available: # swlist -l fileset -s / | grep Q4 OS-Core.Q4 B.10.10 HP-UX Crash Dump Debugger for PA-RISC systems ^^^^^ -matches the O/S Use swinstall to install it: # swinstall -vs / OS-Core.Q4 ============================================================== STEP 3 ===== CD TO THE DUMPS DIRECTORY ==================== ============================================================== NOTE: csh (c-shell) will cause errors with q4. Use ksh. # cd (IMPORTANT!) eg: cd /var/adm/crash/core.0 OR /var/adm/crash/crash.0 ============================================================== STEP 4 ===== IF USING UNPATCHED Q4 ======================== ============================================================== 4.1) Perform this command: # /usr/contrib/bin/gunzip vmunix.gz (uncompresses the kernel file) For 10.20 and later, type this command and then skip to 4.2: # /usr/contrib/bin/q4prep -p If at 10.10, type the following commands: # uncompress /usr/contrib/lib/Q4Lib.tar.Z (ignore the error if this was done previously) # tar -xf /usr/contrib/lib/Q4Lib.tar (output goes into the current directory) # cp q4lib/sample.q4rc.pl ~/.q4rc.pl (Note the use of a tilde. Also, the .pl are two letters, not .p1 (digit 1)) # /usr/contrib/bin/q4pxdb vmunix (this may complain if vmunix is already preprocessed) 4.2) If the next command causes "/var: file system full", move the core. directory to a file system with adequate space (approximately 2x the sum of the core.x.y.gz files) and continue at this point. Type: # q4 -p . ___(note the "dot" at the end) Then: q4> trace event 0 > trace.txt q4> include analyze.pl NOTE letter "el"_/ q4> run Analyze AU >> ana.txt NOTE: ctrl-c will interrupt q4 q4 exit Skip to STEP 6 ============================================================== STEP 5 ===== IF USING THE PATCHED VERSION OF Q4 =========== ============================================================== If MC/ServiceGuard is not loaded, skip to step 5.2 5.1) Type: # nm -xv /usr/lbin/cmcld | grep cl_log_cache Example return: cl_log_cache |0x4007a6b8|extern|data |$BSS$ ^^^^^^^^^^
Record the hex number returned for use in step 5.5 5.2) Type: # . /usr/contrib/Q4/bin/set_env Note the 'dot' at the beginning of the line. 5.3) If next steps cause "/var: file system full", move the core. or crash. directory to a file system with adequate space (approximately 2x the sum of the core.x.y.gz files) and continue at this point. Type: # /usr/contrib/Q4/bin/q4pxdb vmunix (Disregard "unnecessary" message) # /usr/contrib/Q4/bin/q4 -p . (note the "dot" at the end) 5.4) At the q4> prompt, type: q4> run Analyze AU > ana.txt q4> run WhatHappened -HANG > what.txt NOTE: ctrl-c can interrupt these two commands, which may take several minutes to process. 5.5) If ServiceGuard is loaded, recall the recorded address in 5.1 and type: q4> include cmcld.pl For ServiceGuard versions A.10.10 and above type: q4> run PrintCmcldLog > cmcld.log For ServiceGuard versions earlier than A.10.10 type: q4> run PrintCmcldLogOLD > cmcld.log 5.6) Type: q4> exit ============================================================== STEP 6 ===== COLLECT AND SEND DATA ======================== ============================================================== Check to see if a hardware problem induced the crash... Type: # grep HPMC ana.txt Do you see one of these? "crash event was an HPMC" or "Crash Event 0 (HPMC, struct crash_event_table_struct..." If so, then the processor detected a hardware failure and the need for hardware repair. Recent /var/tombstones/ts9* (if they exist) may have hardware fault codes that can aid in isolating the cause. If ServiceGuard is loaded, check q4.txt to see if it induced the reboot: "MC/ServiceGuard: Unable to maintain contact with cmcld daemon. Performing TOC to ensure data integrity." If so, type: # grep E_T /etc/cmcluster If NODE_TIMEOUT is set to 2 microseconds (2000000), increase the value to 5-8 seconds in the file and perform a cmapplyconf with the cluster down. Also, read this article in the Electronic Support Center technical database for more details on dealing with causes: UXSGLVKBAN00000010 Generate a patch list: # /usr/sbin/swlist -l product | grep PH > patches.txt Send the following files to hpcu@atl.hp.com using the SOFTWARE CASE ID as the subject: patches.txt, ana.txt, trace.txt, /etc/shutdownlog what.txt (if created) cmcld.log (if created) /var/tombstones/ts* (if HPMC was detected) NOTES: - The hpcu E-Mail box has a 3MB maximum mail size! - Perform this procedure prior to opening future dump cases. Get a case ID and email these files to hpcu@atl.hp.com with the newly assigned SOFTWARE CASE ID as the subject, and inform the engineer that email is enroute. *** END ***