본문으로 건너뛰기

Ops Network Diagnosis

AWS/EKS deep network diagnosis skill.

Description

Provides deep network diagnostics for VPC CNI, load balancers, and DNS.

Trigger Keywords

  • "network issue"
  • "network error"
  • "connectivity"
  • "DNS failure"

Diagnosis Workflow

Step 1: Layer Identification

  • L3 (IP): IP exhaustion, subnet, routing, VPC peering
  • L4 (Transport): Security Groups, NACLs, port connectivity
  • L7 (Application): Load Balancer, Ingress, target health
  • DNS: CoreDNS, Route 53, external-dns

Step 2: Layer-specific Diagnosis

Use detailed commands and decision trees for each layer.

Step 3: Resolution Verification

Test end-to-end connectivity after applying fixes.

Quick Connectivity Tests

# Pod-to-pod
kubectl exec -it <pod1> -- curl -s <pod2-ip>:<port>

# Pod-to-service
kubectl exec -it <pod> -- curl -s <service>.<namespace>.svc.cluster.local:<port>

# DNS resolution
kubectl exec -it <pod> -- nslookup <service>.<namespace>.svc.cluster.local

# External connectivity
kubectl exec -it <pod> -- curl -s -o /dev/null -w "%{http_code}" https://aws.amazon.com

VPC CNI Deep Guide

Architecture Overview

VPC CNI assigns real VPC IP addresses to each Pod via two components:

  • IPAMD (L-IPAM Daemon): Pre-allocates and manages ENIs and IPs on each node
  • CNI Binary: Called by kubelet to assign IPs and configure pod network namespaces

IP Allocation Modes

ModeAllocation UnitPod DensityRecommended For
Secondary IPIndividual IPsLimited by IPs per ENISmall clusters
Prefix Delegation/28 prefix (16 IPs)Much higherLarge clusters

Instance Type Limits

Instance TypeMax ENIsIPv4 per ENIMax Pods (secondary IP)
t3.medium3617
t3.large31235
m5.large31029
m5.xlarge41558
c5.4xlarge830234

Max Pods = (ENIs x IPs per ENI) - ENIs

Key Environment Variables

VariableDescriptionDefault
WARM_IP_TARGETSpare IPs to pre-allocateNot set
MINIMUM_IP_TARGETMinimum IPs on nodeNot set
WARM_ENI_TARGETSpare ENIs to pre-allocate1
ENABLE_PREFIX_DELEGATIONEnable prefix delegationfalse
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFGEnable custom networkingfalse

VPC CNI Diagnostic Commands

# IPAMD logs
kubectl logs -n kube-system -l k8s-app=aws-node -c aws-node | grep -i "insufficient\|error\|failed"

# Per-node IP usage
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, allocatable_pods:.status.allocatable.pods}'

# Subnet available IPs
aws ec2 describe-subnets --subnet-ids <subnet-id> --query 'Subnets[].{ID:SubnetId,CIDR:CidrBlock,Available:AvailableIpAddressCount}'

# ENI details per node
aws ec2 describe-network-interfaces --filters Name=attachment.instance-id,Values=<instance-id> --query 'NetworkInterfaces[].{ID:NetworkInterfaceId,PrivateIPs:PrivateIpAddresses|length(@)}'

# IPAMD metrics endpoint
kubectl exec -n kube-system ds/aws-node -c aws-node -- curl -s http://localhost:61678/v1/enis 2>/dev/null | jq .

IP Exhaustion Solutions

Problem: Pods stuck in Pending with IP allocation failure

Solutions:

  1. Enable Prefix Delegation:
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
  1. Add Secondary CIDR:
aws ec2 associate-vpc-cidr-block --vpc-id <vpc-id> --cidr-block 100.64.0.0/16
  1. Tune WARM_IP_TARGET:
kubectl set env daemonset aws-node -n kube-system WARM_IP_TARGET=2 MINIMUM_IP_TARGET=4
  1. Recommended Prefix Delegation Settings:
kubectl set env daemonset aws-node -n kube-system \
ENABLE_PREFIX_DELEGATION=true \
WARM_PREFIX_TARGET=1 \
WARM_IP_TARGET=5 \
MINIMUM_IP_TARGET=2

Subnet CIDR Planning Best Practice

VPC CIDR: 10.0.0.0/16
├── 10.0.0.0/19 - Node subnet (AZ-a)
├── 10.0.32.0/19 - Node subnet (AZ-b)
├── 10.0.64.0/19 - Node subnet (AZ-c)
└── Secondary CIDR: 100.64.0.0/16
├── 100.64.0.0/19 - Pod subnet (AZ-a)
├── 100.64.32.0/19 - Pod subnet (AZ-b)
└── 100.64.64.0/19 - Pod subnet (AZ-c)

VPC CNI Error Solutions

ErrorSolution
InsufficientFreeAddressesInSubnetAdd secondary CIDR or enable prefix delegation
SecurityGroupLimitExceededClean up unused SGs or consolidate
ENI limit reachedLarger instance type or prefix delegation
Failed to create ENIAdd ENI creation permissions to node role
Timeout waiting for pod IPRestart IPAMD: kubectl rollout restart ds/aws-node -n kube-system

Load Balancer Troubleshooting

Prerequisites Checklist

  1. LB Controller installed: kubectl get deployment -n kube-system aws-load-balancer-controller
  2. IRSA configured: Service account has correct IAM role annotation
  3. Subnet tags: Public subnets tagged kubernetes.io/role/elb=1, private subnets tagged kubernetes.io/role/internal-elb=1
  4. IngressClass: kubectl get ingressclass shows alb class

Key Annotations Reference

ALB (Ingress)

AnnotationDescriptionDefault
alb.ingress.kubernetes.io/schemeinternet-facing or internalinternal
alb.ingress.kubernetes.io/target-typeip or instanceinstance
alb.ingress.kubernetes.io/subnetsSubnet IDsAuto-detect
alb.ingress.kubernetes.io/certificate-arnACM cert ARN-
alb.ingress.kubernetes.io/healthcheck-pathHealth check path/
alb.ingress.kubernetes.io/group.nameShare ALB across Ingresses-

NLB (Service)

AnnotationDescriptionDefault
service.beta.kubernetes.io/aws-load-balancer-typeexternal (NLB)-
service.beta.kubernetes.io/aws-load-balancer-nlb-target-typeip or instanceinstance
service.beta.kubernetes.io/aws-load-balancer-schemeinternet-facing or internalinternal
service.beta.kubernetes.io/aws-load-balancer-ssl-certACM cert ARN-

ALB Not Created - Debugging

# Check controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=50

# Check Ingress events
kubectl describe ingress <name> -n <namespace>

# Common causes:
# 1. Missing IngressClass: add spec.ingressClassName: alb
# 2. Missing subnet tags
# 3. IAM permission insufficient
# 4. Invalid annotation values

Targets Unhealthy - Debugging

# Check target health
aws elbv2 describe-target-health --target-group-arn <arn>

# Test health check from pod
kubectl exec -it <pod> -- curl -s localhost:<port><health-path>

# Common causes:
# 1. Health check path returns non-200
# 2. Security group blocks ALB to Pod traffic
# 3. Pod not ready/running
# 4. Wrong targetPort

502 Bad Gateway Resolution

Root causes:

  1. Pod not ready - Check pod status and readiness probe
  2. Target deregistering - Target group draining in progress
  3. Health check failing - Verify health check path and timeout
  4. Security group - ALB SG must allow traffic to pod CIDR
# Debugging steps
kubectl get pods -l app=<app>
aws elbv2 describe-target-health --target-group-arn <arn>
kubectl describe ingress <name>

Subnet Tagging

# Public subnets (internet-facing LB)
aws ec2 create-tags --resources <subnet-id> --tags Key=kubernetes.io/role/elb,Value=1

# Private subnets (internal LB)
aws ec2 create-tags --resources <subnet-id> --tags Key=kubernetes.io/role/internal-elb,Value=1

# Cluster tag (optional)
aws ec2 create-tags --resources <subnet-id> --tags Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=shared

Cost Optimization - ALB Sharing

Share a single ALB across multiple services using Ingress groups:

# In each Ingress:
alb.ingress.kubernetes.io/group.name: shared-alb
alb.ingress.kubernetes.io/group.order: "1" # Priority within group

DNS Deep Diagnosis

CoreDNS Architecture in EKS

CoreDNS runs as a Deployment in kube-system namespace and provides:

  • Service discovery (.svc.cluster.local)
  • Pod DNS (.pod.cluster.local)
  • External DNS forwarding (to VPC DNS resolver at 169.254.169.253)

DNS Diagnostic Commands

# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl get svc -n kube-system kube-dns

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=30

# CoreDNS config
kubectl get configmap coredns -n kube-system -o yaml

# DNS resolution test
kubectl run -it --rm dns-test --image=busybox:1.28 --restart=Never -- nslookup kubernetes.default.svc.cluster.local

# DNS latency test
kubectl run -it --rm dns-test --image=busybox:1.28 --restart=Never -- sh -c 'for i in $(seq 1 10); do time nslookup kubernetes.default 2>&1 | grep real; done'

DNS Resolution Timeout Solutions

Symptoms: nslookup: can't resolve, dial tcp: lookup: no such host

CauseSolution
CoreDNS not runningkubectl rollout restart deployment/coredns -n kube-system
CoreDNS overloadedScale replicas or enable autoscaling
ndots too highSet ndots=2 in pod spec (see below)
VPC DNS throttlingVPC resolver has 1024 packets/sec/ENI limit

ndots Optimization (default ndots=5 causes 5 DNS lookups for external names):

# Pod spec optimization
spec:
dnsConfig:
options:
- name: ndots
value: "2"

CoreDNS Scaling

# Check current replicas
kubectl get deployment coredns -n kube-system

# Manual scale
kubectl scale deployment coredns -n kube-system --replicas=3

# Enable autoscaling (proportional-autoscaler)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/cluster-proportional-autoscaler/master/examples/dns-autoscaler.yaml

CoreDNS ConfigMap Customization

apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
# Custom zone forwarding
example.com:53 {
forward . 10.0.0.2
cache 30
}

External DNS Not Resolving from Pods

# Check pod's resolv.conf
kubectl exec -it <pod> -- cat /etc/resolv.conf

# Expected:
# nameserver 172.20.0.10 (kube-dns service IP)
# search <namespace>.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# If nameserver is wrong, check kubelet --cluster-dns setting

Route 53 Integration - external-dns

# Check external-dns status
kubectl get pods -n kube-system -l app=external-dns
kubectl logs -n kube-system -l app=external-dns --tail=20

# Verify DNS record created
aws route53 list-resource-record-sets --hosted-zone-id <zone-id> --query "ResourceRecordSets[?Name=='<domain>.']"

DNS Error Solutions

SymptomLikely CauseSolution
All DNS failsCoreDNS downRestart CoreDNS
External DNS slowHigh ndotsSet ndots=2 in pod spec
Service discovery failsWrong namespaceUse FQDN: svc.namespace.svc.cluster.local
Route53 record not createdexternal-dns IAMFix IRSA for external-dns
Intermittent failuresDNS throttlingScale CoreDNS, use NodeLocal DNSCache

Layer-specific Diagnosis Guide

L3 - IP/Routing

# Subnet available IPs
aws ec2 describe-subnets --subnet-ids <subnet-id> --query 'Subnets[].{CIDR:CidrBlock,Available:AvailableIpAddressCount}'

# VPC CNI IP usage
kubectl exec -n kube-system ds/aws-node -c aws-node -- curl -s http://localhost:61678/v1/enis | jq .

# Route table
aws ec2 describe-route-tables --filters Name=vpc-id,Values=<vpc-id>

L4 - Security Groups

# Node SG
aws ec2 describe-instances --instance-ids <id> --query 'Reservations[].Instances[].SecurityGroups'

# SG rules check
aws ec2 describe-security-group-rules --filter Name=group-id,Values=<sg-id>

# Pod Security Groups
kubectl get securitygrouppolicies -A

L7 - Load Balancer

# LB Controller status
kubectl get deployment -n kube-system aws-load-balancer-controller
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=30

# Target health
aws elbv2 describe-target-health --target-group-arn <tg-arn>

Usage Examples

IP Exhaustion Issue

Pod is Pending and IP allocation fails. Please diagnose network.

Network Diagnosis skill runs automatically:

  1. Identify as L3 layer
  2. Check subnet available IPs
  3. Verify VPC CNI IPAMD status
  4. Recommend Prefix Delegation or Secondary CIDR

ALB 502 Error

ALB returns 502 errors.

Network Diagnosis performs:

  1. Identify as L7 layer
  2. Check target group health
  3. Verify Security Group rules
  4. Check pod readiness status

Reference Files

  • references/vpc-cni-troubleshooting.md - IP management, ENI, Prefix Delegation
  • references/load-balancer-troubleshooting.md - ALB/NLB setup, target health
  • references/dns-troubleshooting.md - CoreDNS, Route 53, resolution issues