Diagnosing Checkstops


Contents

About this document
    Related documentation
How to determine if a checkstop has occurred
What is a "machine check"?
What is a checkstop?
How to proceed
Possible software resolutions for checkstop conditions

About this document

This document discusses checkstops, a machine check that occurs during another machine check. This document applies to AIX Versions 3.2, 4.1, 4.2 and 4.3.

Related documentation

For more in-depth coverage of this subject, the following IBM publications are recommended:

The product documentation library is also available:
http://www.rs6000.ibm.com/resource/aix_resource/Pubs/index.html


How to determine if a checkstop has occurred

A checkstop is indicated by an LED value of 185, 186, or 187 on the LED display of the main unit. If the machine does not have an LED display or the machine has been rebooted, then evidence of a checkstop should exist in the system error report. Look for an entry labeled CHECKSTOP in the error report to determine if a checkstop occurred.


What is a "machine check"?

A machine check is an error logged by the machine check handler. Causes of a machine check could be:

A non-maskable interrupt (NMI) is generated. The operating system logs the machine check, including various error logging registers reporting the cause of the machine check, and a system dump initiates.


What is a checkstop?

A checkstop is a machine check that occurs during another machine check. A checkstop also occurs when the machine--usually a processor but sometimes a cache, memory, or I/O bus controller--determines that something is in an "impossible" state. An error occurs that cannot be isolated to a particular bus transfer in progress, or a processor detects no progress being made. The processor is not able to complete any instructions for some period of time.

When a system checkstops, the clocks in the machine are frozen within a few cycles after the error and the service processor saves the part of the state of the CPUs in NVRAM. It then attempts to do a full hardware reset and restart the system a number of times.

When the system reboots, the data is copied to a file in the /usr/lib/ras directory (ras stands for reliability and service). Two file names are used, checkstop.A and checkstop.B, in a rotating manner. The total number of checkstops that occurred during the reboot attempts, before the system came up successfully, is logged in the error log entry along with the file name.

If a second machine check occurs before the operating system completes logging the error to NVRAM and initiates a complete hardware reset or halts, the processor will checkstop.


How to proceed

Checkstops are inherently hardware phenomena. They do not necessarily indicate a solid failure of a component, so diagnostics will rarely determine that a problem exists. The checkstop file that is generated is required to determine the cause of the checkstop and the corrective actions needed to resolve the situation. This file would be examined by your hardware service organization. For further information, contact one of the following:

Use the following instructions to package these files for hardware service examination.

Gather system information by performing the following steps:

  1. Run the command snap -g. This will put general information about the machine in the directory /tmp/ibmsupt.

  2. Copy the checkstop files in /usr/lib/ras to the directory /tmp/ibmsupt/testcase. Enter:
       cp /usr/lib/ras/checkstop* /tmp/ibmsupt/testcase 
    
  3. Make a file called customer_info in the directory /tmp/ibmsupt/other. In this file, include the following information:

    main contact
    telephone number of main contact
    machine type having the problem (examples: 7011, 7012, 7015)
    serial number of machine
    location of machine (physical location of machine including address)
    description of activity of machine prior to event

  4. Put the testcase on diskette:
       tar -cvf /dev/fd0 /tmp/ibmsupt 
    
    fd0 is the floppy device.

  5. Label the tape or diskette with the following information:

    customer name
    customer number
    incident#
    the command used to copy the information to tape

    Very important: If the person sending in this testcase is not the person who reported the problem, be sure to include the name of the person who reported it. If the proper information is not on the package, then it takes valuable time to process and delays solving your problem. The incident# will be the reference number that your hardware service organization assigns to this problem.


Possible software resolutions for checkstop conditions

Listed below are some possible software resolutions for checkstop conditions as of this document's last update. To check for the latest checkstop-related software fixes, go to the TechSupport Online databases; select the APAR databases and search on the keyword CHECKSTOP.

AIX Version 3.2

APAR      DESCRIPTION                            HARDWARE
APARS are no longer being written for 3.2.x.  Upgrading to the latest level of
the OS will resolve any problems that can be fixed via APAR.
IX53114   3.2.5.101 Upgrade from 3.2.5.1
IX60081   3.2.5.102 Upgrade from 3.2.5.1

AIX Version 4.1

APAR      DESCRIPTION                               HARDWARE
APARS are no longer being written for 4.1.x.  Upgrading to the latest level of
the OS will resolve any problems that can be fixed via APAR.
IX88586   Latest AIX 4.1.5 Fixes as of March 1999

AIX Version 4.2

APAR      DESCRIPTION                               HARDWARE
IX69143   CHECKSTOP 185/186 ON GXT500D/GXT500
           WITH X -BS OPTION                        GXT500
IX70175   NEED SW WORKAROUND FOR PEGASUS 6XX        7012-G30, 7012-G40, 7012-G50
           BUS LIVELOCK                             7013-J30, 7013-J40, 7013-J50,
                                                    7015-R30, 7015-R40, 7015-R50
IX62156   UNALIGNED TRANSFERS ON 825A CAN CAUSE
           MACHINE CHECK                            PCI F/W SCSI Adap.
IX66931   PCI SCSI ADAPTER CAUSES MACHINE CHECK
           IN WILDCAT
IX61252   GXT500 CHECKSTOPS DOING SOLID MODEL
           ROTATION IN CATIA                        GXT500
IX83745   CHECKSTOP ON SPHINX                       43P-260
IX74688   ROBUST RISC CHECKSTOP ANALYSIS            7012-G30/G40,7013-J30/J40/J50                                                   
                                                    7015-R30/R40/R50
IX75066   THE CHECKSTOP ERROR ISN'T ACCURATE 
          FOR PAL MACHINES
IX89142   NEED TO RENAME 'CHECKSTOP' FILE TO 
          SOMETHING BENIGN ON CHRP BOX
IX75637   ADD FUNCTIONALITY TO SNAP TO COLLECT 
          CHECKSTOP FILES

AIX Version 4.3

APAR      DESCRIPTION                               HARDWARE
IX72262   APACHE DEADLOCK AVOIDANCE WORKAROUNDS     7017-S70
IX83586   CHECKSTOP ON SPHINX                       43P-260
IX89790   699:SPHINX2 CHECKSTOP W/MTN&2MIR WHEN     GXT2000P,GXT3000P
           CATIA CHAINSAW GRAPER FCN
IY04927   CHECKSTOP WITH GIGABIT ETHERNET WITH      604e+GigaBit ETHERNET
           604e PROCESSOR
IX74019   ROBUST RISC CHECKSTOP ANALYSIS            7012-G30/G40,7013-J30/J40/J50                                                   
                                                    7015-R30/R40/R50
IX84430   NEED TO RENAME 'CHECKSTOP' FILE TO 
          SOMETHING BENIGN ON CHRP BOX
IX74302   ADD FUNCTIONALITY TO SNAP TO COLLECT 
          CHECKSTOP FILES



[ Doc Ref: 90605198514836     Publish Date: Oct. 17, 2000     4FAX Ref: 6523 ]