This document discusses checkstops, a machine check that occurs during another machine check. This document applies to AIX Versions 3.2, 4.1, 4.2 and 4.3.
For more in-depth coverage of this subject, the following IBM publications are recommended:
The product documentation library is also available:
http://www.rs6000.ibm.com/resource/aix_resource/Pubs/index.html
A checkstop is indicated by an LED value of 185, 186, or 187 on the LED display of the main unit. If the machine does not have an LED display or the machine has been rebooted, then evidence of a checkstop should exist in the system error report. Look for an entry labeled CHECKSTOP in the error report to determine if a checkstop occurred.
A machine check is an error logged by the machine check handler. Causes of a machine check could be:
A non-maskable interrupt (NMI) is generated. The operating system logs the machine check, including various error logging registers reporting the cause of the machine check, and a system dump initiates.
A checkstop is a machine check that occurs during another machine check. A checkstop also occurs when the machine--usually a processor but sometimes a cache, memory, or I/O bus controller--determines that something is in an "impossible" state. An error occurs that cannot be isolated to a particular bus transfer in progress, or a processor detects no progress being made. The processor is not able to complete any instructions for some period of time.
When a system checkstops, the clocks in the machine are frozen within a few cycles after the error and the service processor saves the part of the state of the CPUs in NVRAM. It then attempts to do a full hardware reset and restart the system a number of times.
When the system reboots, the data is copied to a file in the /usr/lib/ras directory (ras stands for reliability and service). Two file names are used, checkstop.A and checkstop.B, in a rotating manner. The total number of checkstops that occurred during the reboot attempts, before the system came up successfully, is logged in the error log entry along with the file name.
If a second machine check occurs before the operating system completes logging the error to NVRAM and initiates a complete hardware reset or halts, the processor will checkstop.
Checkstops are inherently hardware phenomena. They do not necessarily indicate a solid failure of a component, so diagnostics will rarely determine that a problem exists. The checkstop file that is generated is required to determine the cause of the checkstop and the corrective actions needed to resolve the situation. This file would be examined by your hardware service organization. For further information, contact one of the following:
Use the following instructions to package these files for hardware service examination.
Gather system information by performing the following steps:
cp /usr/lib/ras/checkstop* /tmp/ibmsupt/testcase
tar -cvf /dev/fd0 /tmp/ibmsuptfd0 is the floppy device.
Very important: If the person sending in this testcase is not the person who reported the problem, be sure to include the name of the person who reported it. If the proper information is not on the package, then it takes valuable time to process and delays solving your problem. The incident# will be the reference number that your hardware service organization assigns to this problem.
Listed below are some possible software resolutions for checkstop conditions as of this document's last update. To check for the latest checkstop-related software fixes, go to the TechSupport Online databases; select the APAR databases and search on the keyword CHECKSTOP.
APAR DESCRIPTION HARDWARE APARS are no longer being written for 3.2.x. Upgrading to the latest level of the OS will resolve any problems that can be fixed via APAR. IX53114 3.2.5.101 Upgrade from 3.2.5.1 IX60081 3.2.5.102 Upgrade from 3.2.5.1
APAR DESCRIPTION HARDWARE APARS are no longer being written for 4.1.x. Upgrading to the latest level of the OS will resolve any problems that can be fixed via APAR. IX88586 Latest AIX 4.1.5 Fixes as of March 1999
APAR DESCRIPTION HARDWARE IX69143 CHECKSTOP 185/186 ON GXT500D/GXT500 WITH X -BS OPTION GXT500 IX70175 NEED SW WORKAROUND FOR PEGASUS 6XX 7012-G30, 7012-G40, 7012-G50 BUS LIVELOCK 7013-J30, 7013-J40, 7013-J50, 7015-R30, 7015-R40, 7015-R50 IX62156 UNALIGNED TRANSFERS ON 825A CAN CAUSE MACHINE CHECK PCI F/W SCSI Adap. IX66931 PCI SCSI ADAPTER CAUSES MACHINE CHECK IN WILDCAT IX61252 GXT500 CHECKSTOPS DOING SOLID MODEL ROTATION IN CATIA GXT500 IX83745 CHECKSTOP ON SPHINX 43P-260 IX74688 ROBUST RISC CHECKSTOP ANALYSIS 7012-G30/G40,7013-J30/J40/J50 7015-R30/R40/R50 IX75066 THE CHECKSTOP ERROR ISN'T ACCURATE FOR PAL MACHINES IX89142 NEED TO RENAME 'CHECKSTOP' FILE TO SOMETHING BENIGN ON CHRP BOX IX75637 ADD FUNCTIONALITY TO SNAP TO COLLECT CHECKSTOP FILES
APAR DESCRIPTION HARDWARE IX72262 APACHE DEADLOCK AVOIDANCE WORKAROUNDS 7017-S70 IX83586 CHECKSTOP ON SPHINX 43P-260 IX89790 699:SPHINX2 CHECKSTOP W/MTN&2MIR WHEN GXT2000P,GXT3000P CATIA CHAINSAW GRAPER FCN IY04927 CHECKSTOP WITH GIGABIT ETHERNET WITH 604e+GigaBit ETHERNET 604e PROCESSOR IX74019 ROBUST RISC CHECKSTOP ANALYSIS 7012-G30/G40,7013-J30/J40/J50 7015-R30/R40/R50 IX84430 NEED TO RENAME 'CHECKSTOP' FILE TO SOMETHING BENIGN ON CHRP BOX IX74302 ADD FUNCTIONALITY TO SNAP TO COLLECT CHECKSTOP FILES
[ Doc Ref: 90605198514836 Publish Date: Oct. 17, 2000 4FAX Ref: 6523 ]