Update: Challenges with HashiCorp Terraform Resource Deletions Following AWS Lambda VPC Enhancements

On September 3, 2023, we shared a significant update that enhances AWS Lambda functions’ performance, scalability, and efficiency when interacting with Amazon VPC networks. For further details on these upgrades, please refer to the original blog post. These modifications mark a major shift in the configuration of elastic network interfaces (ENIs) used to connect to your VPCs. However, this new architecture has led to a situation where VPC resources, including subnets, security groups, and VPCs, may not be properly deleted via HashiCorp Terraform. This post aims to help you determine if you are impacted and outline the steps to resolve any issues.

How Can I Identify If I’m Affected?

This issue affects users who rely on HashiCorp Terraform to destroy environments. Specifically, versions of the Terraform AWS Provider that are v2.30.0 or older are at risk. Users with these versions might encounter errors while attempting to destroy environments containing AWS Lambda functions, VPC subnets, security groups, and Amazon VPCs. Common error messages may include:

Error deleting subnet: timeout while waiting for state to become ‘destroyed’ (last state: ‘pending’, timeout: 20m0s)
Error deleting security group: DependencyViolation: resource sg- has a dependent object status code: 400, request id:

Depending on the AWS Regions where the VPC improvements have been implemented, you may face these errors inconsistently across different regions.

How Do I Fix This Issue If I Am Affected?

You have two options for addressing this issue. The preferred solution is to upgrade your Terraform AWS Provider to v2.31.0 or a later version. For guidance on upgrading the Provider, check out the Terraform AWS Provider Version 2 Upgrade Guide. Information and source code for the latest AWS Provider releases can be found here. The most recent version of the Terraform AWS Provider includes fixes for this issue, as well as enhancements to the reliability of the environment destruction process. It is strongly advised to upgrade your Provider version to resolve this issue effectively.

If upgrading the Provider is not feasible, you can implement adjustments to your Terraform configuration to mitigate the problem. You will need to make the following changes:

Add an explicit dependency with a depends_on argument to the aws_security_group and aws_subnet resources linked to your Lambda functions. This dependency should point to the aws_iam_policy resource associated with the IAM role configured for the Lambda function.
Increase the delete timeout for all aws_security_group and aws_subnet resources to 40 minutes.

Here’s a configuration file example demonstrating these changes:

provider "aws" {
  region = "eu-central-1"
}

resource "aws_iam_role" "lambda_exec_role" {
  name = "lambda_exec_role"
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

data "aws_iam_policy" "LambdaVPCAccess" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}

resource "aws_iam_role_policy_attachment" "sto-lambda-vpc-role-policy-attach" {
  role       = "${aws_iam_role.lambda_exec_role.name}"
  policy_arn = "${data.aws_iam_policy.LambdaVPCAccess.arn}"
}

resource "aws_security_group" "allow_tls" {
  name        = "allow_tls"
  description = "Allow TLS inbound traffic"
  vpc_id      = "vpc-<id>"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }

  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]
}

resource "aws_subnet" "main" {
  vpc_id     = "vpc-<id>"
  cidr_block = "172.31.68.0/24"

  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]
}

resource "aws_lambda_function" "demo_lambda" {
    function_name = "demo_lambda"
    handler = "index.handler"
    runtime = "nodejs10.x"
    filename = "function.zip"
    source_code_hash = "${filebase64sha256("function.zip")}"
    role = "${aws_iam_role.lambda_exec_role.arn}"
    vpc_config {
     subnet_ids         = ["${aws_subnet.main.id}"]
     security_group_ids = ["${aws_security_group.allow_tls.id}"]
  }
}

It’s crucial to note the following blocks in both the allow_tls security group and the main subnet resources:

timeouts {
  delete = "40m"
}
depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]

Make these adjustments to your Terraform configuration files before attempting to destroy your environments for the first time.

Can I Remove Resources Left Over After a Failed Destroy Operation?

If you attempt to destroy environments without upgrading the Provider or implementing the configuration changes detailed above, you may encounter failures. This could leave ENIs in your account due to unsuccessful destroy operations. You can manually delete these ENIs a few minutes after the associated Lambda functions have been removed (typically within 40 minutes). Once the ENIs are deleted, you can rerun terraform destroy.

For more insights into effective resource management, consider checking out this helpful article on presentations. Additionally, for further understanding of hiring and retention challenges, visit SHRM’s research which offers valuable information from an authority on the topic.

Update: Challenges with HashiCorp Terraform Resource Deletions Following AWS Lambda VPC Enhancements

How Can I Identify If I’m Affected?

How Do I Fix This Issue If I Am Affected?

Can I Remove Resources Left Over After a Failed Destroy Operation?

Related Topics: