Network Diagnosis Demo

IP exhaustion and ALB 502 error diagnosis walkthrough with diagnostic command execution and resolution process.

Scenario Overview

This demo covers two common network issues:

IP Exhaustion - Pods stuck in Pending due to insufficient VPC IPs
ALB 502 Errors - Application Load Balancer returning 502 Bad Gateway

Scenario 1: IP Exhaustion

Problem Report

User reports:

New pods are stuck in Pending state. kubectl describe shows "failed to assign IP address" errors.

Diagnosis Workflow

Step 1: Identify the Problem

# Check pending pods
kubectl get pods -A --field-selector=status.phase=Pending

Output:

NAMESPACE   NAME                        READY   STATUS    RESTARTS   AGE
backend     api-server-7f8b9-abc        0/1     Pending   0          15m
backend     api-server-7f8b9-def        0/1     Pending   0          15m
backend     worker-5c6d7-ghi            0/1     Pending   0          10m
frontend    web-app-8e9f0-jkl           0/1     Pending   0          8m

# Check pod events
kubectl describe pod api-server-7f8b9-abc -n backend | grep -A 10 "Events:"

Output:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  14m   default-scheduler  0/3 nodes are available: 3 Insufficient pods.
  Warning  FailedCreatePodSandBox  13m  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox: plugin type="aws-cni" name="aws-cni" failed: add cmd: failed to assign an IP address to container

Identified: IP assignment failure from VPC CNI.

Step 2: Check Subnet IP Availability

# Get cluster subnets and available IPs
aws ec2 describe-subnets --filters "Name=tag:kubernetes.io/cluster/prod-cluster,Values=*" \
  --query 'Subnets[].{SubnetId:SubnetId,AZ:AvailabilityZone,CIDR:CidrBlock,Available:AvailableIpAddressCount}'

Output:

[
    {
        "SubnetId": "subnet-0a1b2c3d4e5f",
        "AZ": "us-west-2a",
        "CIDR": "10.0.1.0/24",
        "Available": 3
    },
    {
        "SubnetId": "subnet-1b2c3d4e5f6g",
        "AZ": "us-west-2b",
        "CIDR": "10.0.2.0/24",
        "Available": 5
    },
    {
        "SubnetId": "subnet-2c3d4e5f6g7h",
        "AZ": "us-west-2c",
        "CIDR": "10.0.3.0/24",
        "Available": 2
    }
]

Finding: All subnets critically low on IPs (total 10 available across 3 AZs).

Step 3: Check ENI Allocation

# Check per-node IP usage via IPAMD
kubectl exec -n kube-system ds/aws-node -c aws-node -- curl -s http://localhost:61678/v1/enis 2>/dev/null | jq '.ENIs | length'

Output:

# Check node allocatable pods
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, allocatable_pods:.status.allocatable.pods, capacity_pods:.status.capacity.pods}'

Output:

{"name":"ip-10-0-1-100.ec2.internal","allocatable_pods":"17","capacity_pods":"17"}
{"name":"ip-10-0-2-150.ec2.internal","allocatable_pods":"17","capacity_pods":"17"}
{"name":"ip-10-0-3-200.ec2.internal","allocatable_pods":"17","capacity_pods":"17"}

Finding: t3.medium instances with max 17 pods each. Total capacity: 51 pods.

Step 4: Check IPAMD Logs

# Check for IP allocation errors
kubectl logs -n kube-system -l k8s-app=aws-node -c aws-node --tail=50 | grep -i "insufficient\|error\|failed"

Output:

{"level":"error","ts":"2026-03-22T10:15:23.456Z","msg":"InsufficientFreeAddressesInSubnet: The subnet 'subnet-0a1b2c3d4e5f' has insufficient free addresses to satisfy the request"}
{"level":"error","ts":"2026-03-22T10:15:45.789Z","msg":"Failed to allocate IP address for pod backend/api-server-7f8b9-abc"}

Root Cause Analysis

## Root Cause Analysis

### Summary
IP exhaustion in all EKS subnets preventing new pod scheduling.

### Details
- VPC CIDR: 10.0.0.0/16
- Subnet CIDRs: 3 x /24 (254 usable IPs each, 762 total)
- Current allocation: ~752 IPs in use
- Available: 10 IPs total (critical)

### Contributing Factors
1. Subnet sizing too small for workload growth
2. No prefix delegation enabled
3. WARM_IP_TARGET not tuned (over-allocation)

Resolution

Immediate Fix: Enable Prefix Delegation

# Enable prefix delegation for 16x more IPs per slot
kubectl set env daemonset aws-node -n kube-system \
  ENABLE_PREFIX_DELEGATION=true \
  WARM_PREFIX_TARGET=1

# Restart aws-node to apply
kubectl rollout restart daemonset/aws-node -n kube-system

# Verify rollout
kubectl rollout status daemonset/aws-node -n kube-system

Output:

daemonset "aws-node" successfully rolled out

Verify Fix

# Check pods are now scheduling
kubectl get pods -A --field-selector=status.phase=Pending

Output:

No resources found

# Verify all pods running
kubectl get pods -n backend

Output:

NAME                        READY   STATUS    RESTARTS   AGE
api-server-7f8b9-abc        1/1     Running   0          20m
api-server-7f8b9-def        1/1     Running   0          20m
worker-5c6d7-ghi            1/1     Running   0          15m

Long-term Fix: Add Secondary CIDR

# Add secondary CIDR for dedicated pod subnets
aws ec2 associate-vpc-cidr-block --vpc-id vpc-0123456789abcdef0 --cidr-block 100.64.0.0/16

# Create new subnets in secondary CIDR
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 --cidr-block 100.64.0.0/19 --availability-zone us-west-2a
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 --cidr-block 100.64.32.0/19 --availability-zone us-west-2b
aws ec2 create-subnet --vpc-id vpc-0123456789abcdef0 --cidr-block 100.64.64.0/19 --availability-zone us-west-2c

Scenario 2: ALB 502 Errors

Problem Report

User reports:

Application returns 502 Bad Gateway intermittently. Started after recent deployment.

Diagnosis Workflow

Step 1: Check Target Group Health

# Get target group ARN from Ingress
kubectl get ingress api-ingress -n backend -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Output:

k8s-backend-apiingr-abc123-456789.us-west-2.elb.amazonaws.com

# Find target group ARN
TG_ARN=$(aws elbv2 describe-target-groups --query "TargetGroups[?contains(TargetGroupName, 'backend')].TargetGroupArn" --output text)

# Check target health
aws elbv2 describe-target-health --target-group-arn $TG_ARN

Output:

{
    "TargetHealthDescriptions": [
        {
            "Target": {"Id": "10.0.1.45", "Port": 8080},
            "HealthCheckPort": "8080",
            "TargetHealth": {"State": "unhealthy", "Reason": "Target.FailedHealthChecks", "Description": "Health checks failed with these codes: [503]"}
        },
        {
            "Target": {"Id": "10.0.2.78", "Port": 8080},
            "HealthCheckPort": "8080",
            "TargetHealth": {"State": "unhealthy", "Reason": "Target.FailedHealthChecks", "Description": "Health checks failed with these codes: [503]"}
        },
        {
            "Target": {"Id": "10.0.3.112", "Port": 8080},
            "HealthCheckPort": "8080",
            "TargetHealth": {"State": "healthy"}
        }
    ]
}

Finding: 2 of 3 targets unhealthy, health checks returning 503.

Step 2: Check Pod Status

# Check backend pods
kubectl get pods -n backend -l app=api-server -o wide

Output:

NAME                        READY   STATUS    RESTARTS   AGE   IP           NODE
api-server-7f8b9-abc        1/1     Running   0          30m   10.0.1.45    ip-10-0-1-100.ec2.internal
api-server-7f8b9-def        1/1     Running   0          30m   10.0.2.78    ip-10-0-2-150.ec2.internal
api-server-7f8b9-ghi        1/1     Running   0          30m   10.0.3.112   ip-10-0-3-200.ec2.internal

# Test health endpoint from within pod
kubectl exec -it api-server-7f8b9-abc -n backend -- curl -s localhost:8080/health

Output:

{"status": "healthy", "version": "2.1.0"}

Finding: Pods are running and health endpoint works locally.

Step 3: Check Security Groups

# Get node security group
NODE_SG=$(aws ec2 describe-instances --filters "Name=private-ip-address,Values=10.0.1.100" --query 'Reservations[].Instances[].SecurityGroups[].GroupId' --output text)

# Check inbound rules
aws ec2 describe-security-group-rules --filter Name=group-id,Values=$NODE_SG --query 'SecurityGroupRules[?!IsEgress].{FromPort:FromPort,ToPort:ToPort,Source:CidrIpv4,SourceSG:ReferencedGroupInfo.GroupId}'

Output:

[
    {"FromPort": 443, "ToPort": 443, "Source": null, "SourceSG": "sg-alb12345"},
    {"FromPort": 10250, "ToPort": 10250, "Source": "10.0.0.0/16", "SourceSG": null}
]

Finding: Port 8080 not allowed from ALB security group!

Step 4: Check Health Check Configuration

# Check Ingress annotations
kubectl get ingress api-ingress -n backend -o yaml | grep -A 10 "annotations:"

Output:

annotations:
  alb.ingress.kubernetes.io/scheme: internet-facing
  alb.ingress.kubernetes.io/target-type: ip
  alb.ingress.kubernetes.io/healthcheck-path: /health
  alb.ingress.kubernetes.io/healthcheck-port: "8080"

Finding: Health check configured for port 8080, but SG blocks it.

Root Cause Analysis

## Root Cause Analysis

### Summary
ALB health checks failing due to missing Security Group rule for port 8080.

### Timeline
- T-1h: Deployment updated to use target-type: ip (previously instance)
- T-45m: First 502 errors reported
- T-30m: Target health degraded to 1/3 healthy

### Root Cause
Security Group migration incomplete when switching from instance to ip target type.
- Instance mode: ALB routes to NodePort, SG allows NodePort range
- IP mode: ALB routes directly to pod IP:port, SG must allow pod port

### Impact
- 66% of requests hitting unhealthy targets
- Intermittent 502 errors for end users

Resolution

Fix Security Group

# Get ALB security group
ALB_SG=$(aws ec2 describe-security-groups --filters "Name=tag:ingress.k8s.aws/stack,Values=backend/api-ingress" --query 'SecurityGroups[].GroupId' --output text)

# Add rule allowing ALB to reach pod port 8080
aws ec2 authorize-security-group-ingress \
  --group-id $NODE_SG \
  --protocol tcp \
  --port 8080 \
  --source-group $ALB_SG \
  --description "ALB to backend pods"

Output:

{
    "Return": true,
    "SecurityGroupRules": [
        {
            "SecurityGroupRuleId": "sgr-0abc123def456",
            "GroupId": "sg-node12345",
            "IpProtocol": "tcp",
            "FromPort": 8080,
            "ToPort": 8080,
            "ReferencedGroupInfo": {"GroupId": "sg-alb12345"}
        }
    ]
}

Verify Resolution

# Wait for health checks to pass (30-60 seconds)
sleep 60

# Check target health
aws elbv2 describe-target-health --target-group-arn $TG_ARN --query 'TargetHealthDescriptions[].TargetHealth.State'

Output:

["healthy", "healthy", "healthy"]

# Test ALB endpoint
curl -s -o /dev/null -w "%{http_code}" https://k8s-backend-apiingr-abc123-456789.us-west-2.elb.amazonaws.com/api/health

Output:

Summary Report

## Network Diagnosis Report

### Scenario 1: IP Exhaustion
- **Issue**: Pods stuck in Pending due to IP exhaustion
- **Layer**: L3 (IP)
- **Severity**: P2 - High
- **Root Cause**: Subnet CIDR too small, no prefix delegation
- **Resolution**: Enabled prefix delegation, planned secondary CIDR
- **Prevention**: Monitor subnet IP availability, alert at < 20%

### Scenario 2: ALB 502 Errors
- **Issue**: Intermittent 502 Bad Gateway
- **Layer**: L7 (Application)
- **Severity**: P2 - High
- **Root Cause**: Security Group missing rule for pod port
- **Resolution**: Added SG rule for ALB to pod communication
- **Prevention**: Include SG updates in deployment checklist for target-type changes

Key Points

IP Planning

Plan subnet CIDRs for 3-5x expected pod count. Enable prefix delegation early - it's non-disruptive and provides significant capacity increase.

Target Type Migration

When switching from instance to ip target type, Security Groups must be updated to allow ALB direct access to pod ports. This is a common oversight.

Health Check Testing

Always test health check paths from both inside the pod (localhost) and from external sources (another pod, ALB) to identify network/SG issues.

Scenario Overview​

Scenario 1: IP Exhaustion​

Problem Report​

Diagnosis Workflow​

Step 1: Identify the Problem​

Step 2: Check Subnet IP Availability​

Step 3: Check ENI Allocation​

Step 4: Check IPAMD Logs​

Root Cause Analysis​

Resolution​

Immediate Fix: Enable Prefix Delegation​

Verify Fix​

Long-term Fix: Add Secondary CIDR​

Scenario 2: ALB 502 Errors​

Problem Report​

Diagnosis Workflow​

Step 1: Check Target Group Health​

Step 2: Check Pod Status​

Step 3: Check Security Groups​

Step 4: Check Health Check Configuration​

Root Cause Analysis​

Resolution​

Fix Security Group​

Verify Resolution​

Summary Report​

Key Points​

Scenario Overview

Scenario 1: IP Exhaustion

Problem Report

Diagnosis Workflow

Step 1: Identify the Problem

Step 2: Check Subnet IP Availability

Step 3: Check ENI Allocation

Step 4: Check IPAMD Logs

Root Cause Analysis

Resolution

Immediate Fix: Enable Prefix Delegation

Verify Fix

Long-term Fix: Add Secondary CIDR

Scenario 2: ALB 502 Errors

Problem Report

Diagnosis Workflow

Step 1: Check Target Group Health

Step 2: Check Pod Status

Step 3: Check Security Groups

Step 4: Check Health Check Configuration

Root Cause Analysis

Resolution

Fix Security Group

Verify Resolution

Summary Report

Key Points