Health Check Walkthrough

A complete cluster health check demonstration with 6 domain check results and report template.

Scenario

Perform a comprehensive health check on a production EKS cluster to identify potential issues before they cause incidents.

Health Check Workflow

Step 1: Initiate Health Check

User request:

Please perform a full cluster health check.

ops-health-check skill activates and begins systematic domain checks.

Step 2: Cluster Health Check

# API server responsiveness
time kubectl get --raw /healthz

Output:

ok
real    0m0.089s

# Cluster version and status
aws eks describe-cluster --name prod-cluster --query 'cluster.{status:status,version:version,platformVersion:platformVersion}'

Output:

{
    "status": "ACTIVE",
    "version": "1.29",
    "platformVersion": "eks.8"
}

# Add-on status
aws eks list-addons --cluster-name prod-cluster --output table

Output:

----------------------------------
|          ListAddons            |
+--------------------------------+
|  amazon-cloudwatch-observability|
|  coredns                        |
|  kube-proxy                     |
|  vpc-cni                        |
+--------------------------------+

Result: Cluster OK - API server responsive (89ms), version current, all add-ons active.

Step 3: Node Health Check

# Node status
kubectl get nodes -o wide

Output:

NAME                            STATUS   ROLES    AGE   VERSION   INTERNAL-IP    OS-IMAGE         KERNEL-VERSION
ip-10-0-1-100.ec2.internal     Ready    <none>   45d   v1.29.0   10.0.1.100     Amazon Linux 2   5.10.199-190.747.amzn2.x86_64
ip-10-0-2-150.ec2.internal     Ready    <none>   45d   v1.29.0   10.0.2.150     Amazon Linux 2   5.10.199-190.747.amzn2.x86_64
ip-10-0-3-200.ec2.internal     Ready    <none>   45d   v1.29.0   10.0.3.200     Amazon Linux 2   5.10.199-190.747.amzn2.x86_64

# Resource utilization
kubectl top nodes

Output:

NAME                            CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-1-100.ec2.internal     450m         22%    2100Mi          54%
ip-10-0-2-150.ec2.internal     380m         19%    1850Mi          47%
ip-10-0-3-200.ec2.internal     520m         26%    2400Mi          62%

# Node conditions
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, conditions:[.status.conditions[] | select(.status!="False") | .type]}'

Output:

{"name":"ip-10-0-1-100.ec2.internal","conditions":["Ready"]}
{"name":"ip-10-0-2-150.ec2.internal","conditions":["Ready"]}
{"name":"ip-10-0-3-200.ec2.internal","conditions":["Ready"]}

Result: Nodes OK - 3/3 Ready, CPU < 30%, Memory < 65%.

Step 4: Workload Health Check

# Unhealthy pods
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | head -20

Output:

NAMESPACE     NAME                      READY   STATUS             RESTARTS   AGE
backend       api-worker-7b9f4-x2k9l    0/1     CrashLoopBackOff   15         2h
monitoring    prometheus-node-exp-abc   0/1     Pending            0          30m

# Deployment health
kubectl get deployments -A -o json | jq '.items[] | select(.status.unavailableReplicas > 0) | {name:.metadata.name, ns:.metadata.namespace, unavailable:.status.unavailableReplicas}'

Output:

{"name":"api-worker","ns":"backend","unavailable":1}

# DaemonSet health
kubectl get daemonsets -A -o json | jq '.items[] | select(.status.desiredNumberScheduled != .status.numberReady) | {name:.metadata.name, ns:.metadata.namespace, desired:.status.desiredNumberScheduled, ready:.status.numberReady}'

Output:

{"name":"prometheus-node-exporter","ns":"monitoring","desired":3,"ready":2}

Result: Workloads WARNING - 2 unhealthy pods detected.

Issue Details

CrashLoopBackOff Pod Analysis:

kubectl logs api-worker-7b9f4-x2k9l -n backend --previous | tail -20

Output:

Error: FATAL: password authentication failed for user "api_user"
Connection to database failed, exiting...

Root Cause: Database credential mismatch in Secret.

Pending Pod Analysis:

kubectl describe pod prometheus-node-exp-abc -n monitoring | grep -A 5 "Events:"

Output:

Events:
  Warning  FailedScheduling  30m  default-scheduler  0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector.

Root Cause: Node selector mismatch for monitoring namespace.

Step 5: Network Health Check

# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide

Output:

NAME                      READY   STATUS    RESTARTS   AGE   IP           NODE
coredns-5d78c9869d-abc    1/1     Running   0          15d   10.0.1.45    ip-10-0-1-100.ec2.internal
coredns-5d78c9869d-def    1/1     Running   0          15d   10.0.2.78    ip-10-0-2-150.ec2.internal

# VPC CNI status
kubectl get pods -n kube-system -l k8s-app=aws-node -o wide

Output:

NAME             READY   STATUS    RESTARTS   AGE   IP           NODE
aws-node-abc     2/2     Running   0          45d   10.0.1.100   ip-10-0-1-100.ec2.internal
aws-node-def     2/2     Running   0          45d   10.0.2.150   ip-10-0-2-150.ec2.internal
aws-node-ghi     2/2     Running   0          45d   10.0.3.200   ip-10-0-3-200.ec2.internal

# Subnet IP availability
aws ec2 describe-subnets --subnet-ids subnet-abc subnet-def subnet-ghi \
  --query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount}'

Output:

[
    {"ID": "subnet-abc", "AZ": "us-west-2a", "Available": 245},
    {"ID": "subnet-def", "AZ": "us-west-2b", "Available": 198},
    {"ID": "subnet-ghi", "AZ": "us-west-2c", "Available": 312}
]

Result: Network OK - CoreDNS running, VPC CNI healthy, sufficient IPs.

Step 6: Storage Health Check

# PVC status
kubectl get pvc -A --field-selector status.phase!=Bound

Output:

NAMESPACE   NAME              STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
analytics   data-volume-pvc   Pending                                      gp3

# CSI driver status
kubectl get pods -n kube-system -l app=ebs-csi-controller -o wide

Output:

NAME                                  READY   STATUS    RESTARTS   AGE
ebs-csi-controller-5f4b7c8d9-abc     6/6     Running   0          30d
ebs-csi-controller-5f4b7c8d9-def     6/6     Running   0          30d

# Check pending PVC events
kubectl describe pvc data-volume-pvc -n analytics | grep -A 5 "Events:"

Output:

Events:
  Warning  ProvisioningFailed  5m  ebs.csi.aws.com  failed to provision volume: could not create volume in EC2: UnauthorizedOperation

Result: Storage WARNING - 1 PVC pending due to IAM permissions.

Step 7: Security Health Check

# Privileged containers
kubectl get pods -A -o json | jq '[.items[] | select(.spec.containers[].securityContext.privileged==true) | {name:.metadata.name, ns:.metadata.namespace}]'

Output:

[
  {"name":"aws-node-abc","ns":"kube-system"},
  {"name":"aws-node-def","ns":"kube-system"},
  {"name":"aws-node-ghi","ns":"kube-system"}
]

# Network policies
kubectl get networkpolicies -A

Output:

NAMESPACE   NAME              POD-SELECTOR   AGE
backend     backend-policy    app=api        60d
frontend    frontend-policy   app=web        60d

# Namespaces without network policies
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  policies=$(kubectl get networkpolicies -n $ns 2>/dev/null | tail -n +2 | wc -l)
  if [ "$policies" -eq "0" ]; then
    echo "WARNING: $ns has no network policies"
  fi
done

Output:

WARNING: analytics has no network policies
WARNING: monitoring has no network policies
WARNING: default has no network policies

Result: Security WARNING - 3 namespaces lack network policies.

Final Health Report

# Infrastructure Health Report

## Summary
- **Overall**: WARNING
- **Checked**: 2026-03-22 14:30:00 UTC
- **Cluster**: prod-cluster (v1.29)

## Results

| Domain | Status | Details |
|--------|--------|---------|
| Cluster | OK | API server responsive (89ms), EKS v1.29, all add-ons active |
| Nodes (3/3 ready) | OK | CPU max 26%, Memory max 62%, all conditions healthy |
| Workloads | WARNING | 2 unhealthy pods: 1 CrashLoopBackOff, 1 Pending |
| Network | OK | CoreDNS 2/2, VPC CNI 3/3, IPs available (755 total) |
| Storage | WARNING | 1 PVC pending (IAM permission issue) |
| Security | WARNING | 3 namespaces without network policies |

## Issues Found

### Issue 1: CrashLoopBackOff Pod (P2 - High)
- **Pod**: backend/api-worker-7b9f4-x2k9l
- **Cause**: Database authentication failure - credential mismatch
- **Fix**: Update Secret `backend/db-credentials` with correct password

### Issue 2: Pending Pod (P3 - Medium)
- **Pod**: monitoring/prometheus-node-exp-abc
- **Cause**: Node affinity mismatch
- **Fix**: Update DaemonSet node selector or add labels to nodes

### Issue 3: PVC Provisioning Failed (P3 - Medium)
- **PVC**: analytics/data-volume-pvc
- **Cause**: EBS CSI driver IAM permission issue
- **Fix**: Add `ec2:CreateVolume` permission to EBS CSI IRSA role

### Issue 4: Missing Network Policies (P4 - Low)
- **Namespaces**: analytics, monitoring, default
- **Risk**: No network segmentation in these namespaces
- **Fix**: Deploy default-deny NetworkPolicy to affected namespaces

## Recommendations

1. **Immediate** (P2): Fix database credentials for api-worker deployment
2. **Today** (P3): Update node affinity for prometheus-node-exporter
3. **Today** (P3): Fix IAM permissions for EBS CSI driver
4. **This Week** (P4): Deploy network policies to all namespaces
5. **Maintenance**: Consider enabling Pod Security Standards (Restricted)

Key Points

Systematic Approach

Health checks should follow a consistent order: Cluster → Nodes → Workloads → Network → Storage → Security. This ensures no domain is missed and issues are correlated properly.

Priority Assessment

Not all warnings are equal. CrashLoopBackOff affects service availability (P2), while missing network policies is a security best practice (P4). Prioritize fixes by impact.

Automation

Consider running health checks on a schedule (daily/weekly) and alerting on status changes. This enables proactive issue detection before user impact.

Scenario​

Health Check Workflow​

Step 1: Initiate Health Check​

Step 2: Cluster Health Check​

Step 3: Node Health Check​

Step 4: Workload Health Check​

Issue Details​

Step 5: Network Health Check​

Step 6: Storage Health Check​

Step 7: Security Health Check​

Final Health Report​

Key Points​

Scenario

Health Check Workflow

Step 1: Initiate Health Check

Step 2: Cluster Health Check

Step 3: Node Health Check

Step 4: Workload Health Check

Issue Details

Step 5: Network Health Check

Step 6: Storage Health Check

Step 7: Security Health Check

Final Health Report

Key Points