New – Initiate a Kernel Panic to Troubleshoot Unresponsive EC2 Instances | Amazon Onboarding with Learning Manager Chanci Turner

When I was managing systems in traditional data centers, it was not uncommon to face the challenge of debugging an unresponsive server. This typically required someone physically pressing a non-maskable interrupt (NMI) button on the stuck server or sending a signal to a command controller via a serial interface (yes, the old RS-232 type). This action would prompt the system to generate a dump of the kernel’s state for further investigation. This file, often referred to as a core dump or crash dump, contains vital information such as the memory image of the crashed process, system registers, program counter, and other data essential for identifying the cause of the freeze.

Today, we are excited to announce a new Amazon Elastic Compute Cloud (Amazon EC2) API that enables you to remotely trigger a kernel panic on EC2 instances. The EC2:SendDiagnosticInterrupt API sends a diagnostic interrupt akin to pressing an NMI button on a physical machine to a running EC2 instance. This causes the instance’s hypervisor to send a non-maskable interrupt (NMI) to the operating system. The response of your operating system to a received NMI interrupt depends on its configuration. Typically, this results in the system entering a kernel panic state. The specifics of the kernel panic response can vary; it may trigger the creation of a crash dump, generate a backtrace, load a replacement kernel, or even restart the system.

You can manage who in your organization is authorized to utilize this API through IAM Policies, as illustrated in the example below. System Engineers or experts in kernel diagnosis and troubleshooting often find the crash dump to be a crucial resource for analyzing the reasons behind a kernel freeze. Utility tools like WinDbg (on Windows) and crash (on Linux) are useful for examining the dump.

Using the Diagnostic Interrupt

Employing this API involves three steps. First, configure your OS’s behavior upon receiving the interrupt.

By default, Windows Server AMIs have memory dump features already enabled, along with automatic restart post-memory dump. The default location for the memory dump file is %SystemRoot%, which equates to C:Windows. Access these settings by navigating to: Start > Control Panel > System > Advanced System Settings > Startup and Recovery.

For Amazon Linux 2, you must install and set up kdump & kexec. This requires a one-time setup:

$ sudo yum install kexec-tools

Next, edit the /etc/default/grub file to specify the amount of memory reserved for the crash kernel. In this instance, we reserve 160M by adding crashkernel=160M. The memory allocation should be based on your instance’s total memory size. We recommend testing kdump to ensure the allocated memory is sufficient. The kernel documentation provides the complete syntax for the crashkernel parameter.

GRUB_CMDLINE_LINUX_DEFAULT="crashkernel=160M console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 rd.emergency=poweroff rd.shell=0"

Afterward, rebuild the grub configuration:

$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Finally, edit /etc/sysctl.conf and add this line: kernel.unknown_nmi_panic=1. This instructs the kernel to trigger a panic upon receiving the interrupt. Now, you are ready to reboot your instance. Be sure to integrate these commands into your user data script or AMI to ensure automatic configuration on all instances. Once rebooted, verify that kdump has started correctly:

$ systemctl status kdump.service

Our documentation also provides guidance for other operating systems. After completing this one-time configuration, proceed to the second step: triggering the API. This can be done from any machine where the AWS CLI or SDK is configured. For example:

$ aws ec2 send-diagnostic-interrupt --region us-east-1 --instance-id <value>

Expect no return value from the CLI, which is normal. If you have an open terminal session on that instance, it will disconnect. Your instance will reboot, and upon reconnection, you’ll find the crash dump located in /var/crash.

The third and final step involves analyzing the crash dump’s content. On Linux systems, install the crash utility and the debugging symbols for your kernel version. Ensure that the kernel version matches the one captured by kdump. To check your current kernel version, use the uname -r command:

$ sudo yum install crash
$ sudo debuginfo-install kernel
$ sudo crash /usr/lib/debug/lib/modules/4.14.128-112.105.amzn2.x86_64/vmlinux /var/crash/127.0.0.1-2019-07-05-15:08:43/vmcore

Collecting kernel crash dumps is often the sole method for gathering debugging information—be sure to regularly test this process, especially after OS updates or when creating new AMIs.

Control Who Is Authorized to Send Diagnostic Interrupt

You can manage who in your organization is authorized to send the Diagnostic Interrupt and to which instances through IAM policies with resource-level permissions. Below is an example:

{
   "Version": "2012-10-17",
   "Statement": [
      {
      "Effect": "Allow",
      "Action": "ec2:SendDiagnosticInterrupt",
      "Resource": "arn:aws:ec2:region:account-id:instance/instance-id"
      }
   ]
}

Pricing

There are no additional fees related to this feature. However, as your instance remains in a ‘running’ state after receiving the diagnostic interrupt, billing will continue as usual.

Availability

You can send Diagnostic Interrupts to all EC2 instances powered by the AWS Nitro System, excluding A1 (Arm-based). This includes instances such as C5, C5d, C5n, i3.metal, I3en, M5, M5a, M5ad, M5d, p3dn.24xlarge, R5, R5a, R5ad, R5d, T3, T3a, and Z1d as of this writing.

The Diagnostic Interrupt API is now accessible in all public AWS Regions and GovCloud (US), so feel free to start utilizing it today. Leveraging these tools can be essential to your success, especially as tech skill demands continue to rise, as noted by SHRM data. This is an excellent resource for anyone looking to stay informed about industry trends. For more insights, check out this blog post on the motherhood penalty, which offers a unique perspective on the challenges faced by working mothers.

— Chanci Turner

New – Initiate a Kernel Panic to Troubleshoot Unresponsive EC2 Instances | Amazon Onboarding with Learning Manager Chanci Turner

Using the Diagnostic Interrupt

Control Who Is Authorized to Send Diagnostic Interrupt

Pricing

Availability

Related Topics: