본문으로 건너뛰기

Ops Troubleshoot

Systematic AWS/EKS troubleshooting workflow skill.

Description

Provides a systematic workflow: 5-minute triage → investigation → resolution → postmortem.

Trigger Keywords

  • "troubleshoot"
  • "debug"
  • "incident"
  • "problem solving"

Workflow Overview

Phase 1: Triage (5 minutes)

  1. Cluster Status - kubectl cluster-info, kubectl get nodes -o wide
  2. Failed Workloads - kubectl get pods -A --field-selector=status.phase!=Running
  3. Recent Events - kubectl get events -A --sort-by='.lastTimestamp' | tail -50
  4. System Pods - kubectl get pods -n kube-system
  5. Resource Usage - kubectl top nodes, kubectl top pods -A --sort-by=memory | head -20
  6. AWS Status - aws eks describe-cluster --name $CLUSTER_NAME --query 'cluster.status'

Phase 2: Investigation

  1. Identify symptom domain (network, auth, storage, compute, observability)
  2. Route to appropriate specialist agent
  3. Collect diagnostic data using domain-specific commands
  4. Cross-reference with known error patterns

Phase 3: Resolution

  1. Apply fix (configuration change, scaling, restart, etc.)
  2. Verify fix resolves the symptom
  3. Monitor for regression (5-15 minutes)

Phase 4: Postmortem

  1. Document incident (timeline, impact, root cause)
  2. Identify preventive measures
  3. Update runbooks if new pattern discovered

Severity Classification

LevelResponseCriteria
P1 Critical< 5 minService outage, data loss risk
P2 High< 30 minMajor degradation, high error rate
P3 Medium< 4 hrMinor impact, single component
P4 LowNext business dayWarning, optimization

Decision Trees (Extended)

Pod Not Starting Decision Tree

Node Not Ready Decision Tree

Network Connectivity Decision Tree

Storage Issue Decision Tree


Error to Solution Mapping Table

Cluster Errors

Error MessageRoot CauseSolution
Unable to connect to the serverAPI server unreachableCheck VPC endpoint, SG, kubeconfig
error: the server doesn't have a resource typeAPI version mismatchUpdate kubectl version
UnauthorizedInvalid/expired tokenaws eks update-kubeconfig --name <cluster>
certificate signed by unknown authorityWrong CAUpdate kubeconfig with correct cluster

Node Errors

Error MessageRoot CauseSolution
NodeNotReadykubelet stopped, network issueCheck kubelet: journalctl -u kubelet -n 100
MemoryPressureNode memory exhaustionEvict pods, increase node size
DiskPressureDisk fullClean images: crictl rmi --prune, expand disk
PIDPressureToo many processesIncrease PID limit, check for fork bombs
NetworkUnavailableCNI plugin failureRestart aws-node DaemonSet
Taint node.kubernetes.io/not-readyNode not readyFix underlying condition
node has insufficient CPU/memoryResource exhaustionScale out node group

Pod Errors

Error MessageRoot CauseSolution
CrashLoopBackOffApp crash, OOM, config errorkubectl logs <pod> --previous
ImagePullBackOffWrong image/tag, auth failureCheck image name, imagePullSecrets
ErrImagePullECR login expired, networkRefresh ECR token, check SG
OOMKilledContainer exceeded memory limitIncrease memory limit
EvictedNode resource pressureSet resource limits, increase node capacity
CreateContainerConfigErrorMissing ConfigMap/SecretVerify ConfigMap/Secret exists
FailedScheduling: Insufficient cpuNo node has enough CPUScale out or reduce resource requests
FailedScheduling: pod has unbound PVCsPVC not boundCheck StorageClass, CSI driver

Network Errors

Error MessageRoot CauseSolution
InsufficientFreeAddressesInSubnetIP exhaustionAdd secondary CIDR, enable prefix delegation
ENI limit reachedInstance type ENI limitUse larger instance or prefix delegation
dial tcp: lookup <service>: no such hostDNS failureCheck CoreDNS, ndots setting
connection refusedService not listeningCheck pod port, targetPort, service selector
context deadline exceededTimeoutCheck SG rules, network policy, routing
502 Bad Gateway (ALB)Target unhealthyCheck pod readiness, health check path
503 Service Temporarily UnavailableNo healthy targetsCheck target group registration

Storage Errors

Error MessageRoot CauseSolution
FailedAttachVolumeAZ mismatch, volume busyMatch PV AZ, force detach old
FailedMountMount point error, SGCheck mount target SG (EFS), device path
MountVolume.SetUp failed for volumeFilesystem errorCheck fsType, securityContext
volume already attached to another nodeStale attachmentDelete VolumeAttachment, force detach

IAM/Auth Errors

Error MessageRoot CauseSolution
AccessDeniedMissing IAM permissionAdd policy to role
Forbidden: User "system:anonymous"Authentication failedFix aws-auth ConfigMap or access entries
could not get token (IRSA)OIDC provider issueVerify OIDC provider, trust policy
WebIdentityErrTrust policy mismatchFix condition in IAM trust policy
AccessDenied when calling AssumeRoleWithWebIdentityIRSA misconfiguredCheck SA annotation, OIDC, trust policy

Real-world Scenarios

Scenario 1: CrashLoopBackOff

Symptom: Pod keeps restarting with CrashLoopBackOff status.

Diagnosis Steps:

# 1. Check pod status and restart count
kubectl get pod <pod-name> -n <namespace>

# 2. Check pod events
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Events:"

# 3. Check current container logs
kubectl logs <pod-name> -n <namespace>

# 4. Check previous container logs (crashed container)
kubectl logs <pod-name> -n <namespace> --previous

# 5. Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

Common Causes and Resolutions:

CauseLog PatternResolution
OOMKilledExit code 137, "OOMKilled" in terminated reasonIncrease resources.limits.memory
Config error"file not found", "env variable not set"Fix ConfigMap/Secret mounting
Dependency failure"connection refused", "ECONNREFUSED"Check dependent service availability
App bugApplication-specific stack traceFix application code
Health check fail"Liveness probe failed" in eventsAdjust probe timing or fix health endpoint

Resolution Example (OOMKilled):

# Before: Insufficient memory
resources:
limits:
memory: "128Mi"

# After: Increased memory with buffer
resources:
limits:
memory: "512Mi"
requests:
memory: "256Mi"

Scenario 2: ImagePullBackOff

Symptom: Pod stuck in ImagePullBackOff or ErrImagePull status.

Diagnosis Steps:

# 1. Check pod events for specific error
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Events:"

# 2. Verify image name and tag
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# 3. For ECR, check node IAM permissions
aws sts get-caller-identity # Run from node via SSM

# 4. Check imagePullSecrets
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.imagePullSecrets}'

# 5. Test ECR login manually
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com

Common Causes and Resolutions:

CauseResolution
Image doesn't existVerify image:tag in registry
ECR permissionsAdd ecr:GetDownloadUrlForLayer, ecr:BatchGetImage to node role
Private registry no secretCreate imagePullSecret and reference in pod spec
Network blockedCheck security groups allow outbound 443 to registry

Scenario 3: OOMKilled Investigation

Symptom: Container terminates with OOMKilled reason.

Full Diagnosis Walkthrough:

# 1. Confirm OOMKilled
kubectl get pod <pod> -n <ns> -o json | jq '.status.containerStatuses[] | {name:.name, restartCount:.restartCount, lastState:.lastState}'

# 2. Check current memory usage
kubectl top pod <pod> -n <ns> --containers

# 3. Check memory limits
kubectl get pod <pod> -n <ns> -o json | jq '.spec.containers[] | {name:.name, limits:.resources.limits}'

# 4. Check node memory pressure
kubectl describe node <node> | grep -A 5 "Conditions:"

# 5. Historical memory usage (if metrics-server available)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<ns>/pods/<pod>" | jq '.containers[].usage'

Resolution Strategy:

  1. Short-term: Increase memory limit to 1.5-2x current peak usage
  2. Medium-term: Profile application to find memory leaks
  3. Long-term: Implement proper memory management in application
# Recommended resource configuration
resources:
requests:
memory: "256Mi" # Average usage
cpu: "100m"
limits:
memory: "512Mi" # Peak usage + 25% buffer
cpu: "500m" # Allow burst

Usage Example

Pods keep CrashLoopBackOff. Please troubleshoot.

Workflow starts automatically:

  1. 5-minute triage for overall situation assessment
  2. Route to eks-agent for pod log analysis
  3. Identify root cause and propose solution
  4. Verify and recommend recurrence prevention

Reference Files

  • references/troubleshooting-framework.md - Systematic approach and commands
  • references/incident-response.md - 5-minute checklist, severity matrix
  • references/decision-trees.md - Mermaid decision trees for common scenarios
  • references/common-errors.md - Error message to solution mapping