Ops Health Check

Comprehensive AWS/EKS infrastructure health check skill.

Description

Checks overall infrastructure health including cluster, nodes, workloads, network, storage, and security.

Trigger Keywords

"health check"
"cluster health"
"infrastructure check"

Health Check Domains

1. Cluster Health

kubectl cluster-info
aws eks describe-cluster --name $CLUSTER_NAME --query 'cluster.{status:status,version:version,platformVersion:platformVersion}'
kubectl get componentstatuses 2>/dev/null

2. Node Health

kubectl get nodes -o wide
kubectl top nodes
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, ready:[.status.conditions[] | select(.type=="Ready") | .status][0], cpu:.status.allocatable.cpu, memory:.status.allocatable.memory}'

3. Workload Health

kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded | head -20
kubectl get deployments -A -o json | jq '.items[] | select(.status.unavailableReplicas > 0) | {name:.metadata.name, ns:.metadata.namespace, unavailable:.status.unavailableReplicas}'
kubectl get daemonsets -A -o json | jq '.items[] | select(.status.desiredNumberScheduled != .status.numberReady) | {name:.metadata.name, ns:.metadata.namespace, desired:.status.desiredNumberScheduled, ready:.status.numberReady}'

4. Network Health

kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl get pods -n kube-system -l k8s-app=aws-node -o wide
kubectl get svc -A --field-selector spec.type=LoadBalancer

5. Storage Health

kubectl get pvc -A --field-selector status.phase!=Bound
kubectl get pv --field-selector status.phase!=Bound,status.phase!=Released
kubectl get csidrivers

6. Security Health

kubectl get pods -A -o json | jq '[.items[] | select(.spec.containers[].securityContext.privileged==true) | {name:.metadata.name, ns:.metadata.namespace}]'
kubectl get networkpolicies -A
kubectl get podsecuritypolicies 2>/dev/null

Threshold Tables

Cluster Level Thresholds

Metric	Warning	Critical	Action
`cluster_cpu_utilization`	> 70%	> 85%	Scale out nodes
`cluster_memory_utilization`	> 75%	> 90%	Scale out nodes
`cluster_failed_node_count`	> 0	>= 2	Investigate node issues
EKS version age	> 6 months	> 12 months	Plan upgrade
Add-on version lag	> 1 minor	> 2 minor	Update add-ons

Node Level Thresholds

Metric	Warning	Critical	Action
`node_cpu_utilization`	> 80%	> 95%	Scale or optimize
`node_memory_utilization`	> 80%	> 95%	Scale or optimize
`node_filesystem_utilization`	> 80%	> 90%	Expand disk, clean images
Node age	> 30 days	> 90 days	Rotate nodes
kubelet restarts	> 2/day	> 5/day	Investigate kubelet

Pod Level Thresholds

Metric	Warning	Critical	Action
`pod_cpu_utilization`	> 80% of limit	> 95% of limit	Increase limit or optimize
`pod_memory_utilization`	> 85% of limit	> 95% of limit	Increase limit or optimize
Restart count	> 3/hour	> 10/hour	Investigate crashes
Container age (restartless)	< 1 hour	< 10 min	CrashLoop investigation

Network Level Thresholds

Metric	Warning	Critical	Action
Subnet IP availability	< 30%	< 10%	Add CIDR, prefix delegation
CoreDNS CPU	> 70%	> 90%	Scale CoreDNS
DNS latency p99	> 100ms	> 500ms	Optimize CoreDNS, ndots
LB target unhealthy	> 0	> 50%	Investigate targets

Storage Level Thresholds

Metric	Warning	Critical	Action
PVC utilization	> 80%	> 90%	Expand volume
Unbound PVCs	> 0	> 3	Fix binding
EBS burst balance	< 30%	< 10%	Upgrade to gp3 or io2
EFS burst credits	< 30%	< 10%	Switch to provisioned throughput

Security Level Thresholds

Check	Warning	Critical	Action
Privileged containers	> 0 (non-system)	Any in workloads	Remove privileged flag
Containers as root	> 0 (non-system)	Any in workloads	Set runAsNonRoot
No network policies	Some namespaces	All namespaces	Create network policies
Secrets not rotated	> 90 days	> 180 days	Rotate secrets

Domain-specific Detailed Procedures

1. Cluster Health Check Procedures

EKS Control Plane

# API server responsiveness
time kubectl get --raw /healthz
time kubectl get --raw /readyz

# Cluster version and status
aws eks describe-cluster --name $CLUSTER_NAME --query 'cluster.{status:status,version:version,platformVersion:platformVersion,endpoint:endpoint}'

# Control plane logging status
aws eks describe-cluster --name $CLUSTER_NAME --query 'cluster.logging.clusterLogging'

# Add-on status
aws eks list-addons --cluster-name $CLUSTER_NAME --output table
for addon in $(aws eks list-addons --cluster-name $CLUSTER_NAME --output text); do
  echo "--- $addon ---"
  aws eks describe-addon --cluster-name $CLUSTER_NAME --addon-name $addon --query 'addon.{version:addonVersion,status:status,health:health.issues}'
done

Interpretation:

API response time > 500ms: Control plane may be overloaded
Add-on status DEGRADED: Check add-on pod logs
Logging not enabled: Enable audit + api + authenticator at minimum

2. Node Health Check Procedures

# Node conditions check
kubectl get nodes -o json | jq '.items[] | {
  name: .metadata.name,
  ready: [.status.conditions[] | select(.type=="Ready") | .status][0],
  memory_pressure: [.status.conditions[] | select(.type=="MemoryPressure") | .status][0],
  disk_pressure: [.status.conditions[] | select(.type=="DiskPressure") | .status][0],
  pid_pressure: [.status.conditions[] | select(.type=="PIDPressure") | .status][0]
}'

# Node resource utilization
kubectl top nodes --sort-by=cpu
kubectl top nodes --sort-by=memory

# Node ages (find stale nodes)
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, age:.metadata.creationTimestamp, instance_type:.metadata.labels["node.kubernetes.io/instance-type"]}'

Interpretation:

Ready=False: Node cannot schedule pods, investigate kubelet
MemoryPressure=True: Node evicting pods, add capacity
DiskPressure=True: Clean container images, expand disk
Node age > 90 days: Plan node rotation for security patches

3. Workload Health Check Procedures

# Deployment health
kubectl get deployments -A -o json | jq '.items[] | select(.status.replicas != .status.readyReplicas) | {name:.metadata.name, ns:.metadata.namespace, replicas:.status.replicas, ready:.status.readyReplicas}'

# StatefulSet health
kubectl get statefulsets -A -o json | jq '.items[] | select(.status.replicas != .status.readyReplicas) | {name:.metadata.name, ns:.metadata.namespace, replicas:.status.replicas, ready:.status.readyReplicas}'

# DaemonSet health
kubectl get daemonsets -A -o json | jq '.items[] | select(.status.desiredNumberScheduled != .status.numberReady) | {name:.metadata.name, ns:.metadata.namespace, desired:.status.desiredNumberScheduled, ready:.status.numberReady}'

# Jobs stuck for more than 1 hour
kubectl get jobs -A -o json | jq '.items[] | select(.status.active > 0) | select(now - (.status.startTime | fromdateiso8601) > 3600) | {name:.metadata.name, ns:.metadata.namespace, active:.status.active}'

Interpretation:

Deployment replicas mismatch: Check pod scheduling or crash issues
DaemonSet not fully scheduled: Check node taints, resource availability
Stuck jobs: Check pod logs, may need manual cleanup

4. Network Health Check Procedures

# CoreDNS health
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=10

# VPC CNI health
kubectl get pods -n kube-system -l k8s-app=aws-node -o wide
kubectl logs -n kube-system -l k8s-app=aws-node -c aws-node --tail=10

# Subnet IP availability
for subnet in $(aws ec2 describe-subnets --filters Name=tag:kubernetes.io/cluster/$CLUSTER_NAME,Values=owned Name=tag:kubernetes.io/cluster/$CLUSTER_NAME,Values=shared --query 'Subnets[].SubnetId' --output text 2>/dev/null); do
  aws ec2 describe-subnets --subnet-ids $subnet --query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount,CIDR:CidrBlock}'
done

Interpretation:

CoreDNS pods not Running: DNS resolution will fail cluster-wide
aws-node errors: IP allocation will fail, pods stuck in Pending
Subnet available IPs < 10%: Enable prefix delegation or add subnets

5. Storage Health Check Procedures

# PVC status
kubectl get pvc -A -o json | jq '.items[] | {ns:.metadata.namespace, name:.metadata.name, status:.status.phase, size:.spec.resources.requests.storage, storageClass:.spec.storageClassName}'

# CSI driver health
kubectl get pods -n kube-system -l app=ebs-csi-controller -o wide
kubectl get pods -n kube-system -l app=efs-csi-controller -o wide

# Unused PVs
kubectl get pv --field-selector status.phase=Released

Interpretation:

PVC Pending: Check StorageClass exists and CSI driver running
CSI controller not Running: Volume provisioning will fail
Released PVs: Consider cleanup or reclaim policy change

6. Security Health Check Procedures

# Privileged containers
kubectl get pods -A -o json | jq '[.items[] | select(.spec.containers[].securityContext.privileged==true) | {name:.metadata.name, ns:.metadata.namespace}]'

# Pods running as root
kubectl get pods -A -o json | jq '[.items[] | select(.spec.securityContext.runAsUser==0 or .spec.containers[].securityContext.runAsUser==0) | {name:.metadata.name, ns:.metadata.namespace}]'

# Network policies coverage
kubectl get networkpolicies -A

# Secrets count (for encryption audit)
kubectl get secrets -A -o json | jq '.items | length'

Interpretation:

Privileged containers in non-system namespaces: Security risk, review necessity
Root containers in workloads: Implement Pod Security Standards
No network policies: Implement default-deny policies

Usage Example

Please check entire cluster health.

Health Check skill runs automatically:

Check all domains sequentially
Evaluate status per domain (OK/WARNING/CRITICAL)
Provide recommended actions when issues found

Output Format

# Infrastructure Health Report

## Summary
- Overall: HEALTHY / WARNING / CRITICAL
- Checked: [timestamp]
- Cluster: [name] (v[version])

## Results

| Domain | Status | Details |
|--------|--------|---------|
| Cluster | OK/WARN/CRIT | [summary] |
| Nodes (N/N ready) | OK/WARN/CRIT | [summary] |
| Workloads | OK/WARN/CRIT | [N unhealthy pods] |
| Network | OK/WARN/CRIT | [summary] |
| Storage | OK/WARN/CRIT | [N unbound PVCs] |
| Security | OK/WARN/CRIT | [summary] |

## Recommendations
1. [Action item]
2. [Action item]

Team Mode

"health check" requests may trigger team-based parallel checks:

Trigger	Team Name	Composition
Full check request	`ops-health-check`	eks + network + iam + storage + cloudwatch parallel

Reference Files

references/health-check-procedures.md - Detailed procedures per domain
references/metrics-thresholds.md - Warning/Critical thresholds

Description​

Trigger Keywords​

Health Check Domains​

1. Cluster Health​

2. Node Health​

3. Workload Health​

4. Network Health​

5. Storage Health​

6. Security Health​

Threshold Tables​

Cluster Level Thresholds​

Node Level Thresholds​

Pod Level Thresholds​

Network Level Thresholds​

Storage Level Thresholds​

Security Level Thresholds​

Domain-specific Detailed Procedures​

1. Cluster Health Check Procedures​

EKS Control Plane​

2. Node Health Check Procedures​

3. Workload Health Check Procedures​

4. Network Health Check Procedures​

5. Storage Health Check Procedures​

6. Security Health Check Procedures​

Usage Example​

Output Format​

Team Mode​

Reference Files​

Description

Trigger Keywords

Health Check Domains

1. Cluster Health

2. Node Health

3. Workload Health

4. Network Health

5. Storage Health

6. Security Health

Threshold Tables

Cluster Level Thresholds

Node Level Thresholds

Pod Level Thresholds

Network Level Thresholds

Storage Level Thresholds

Security Level Thresholds

Domain-specific Detailed Procedures

1. Cluster Health Check Procedures

EKS Control Plane

2. Node Health Check Procedures

3. Workload Health Check Procedures

4. Network Health Check Procedures

5. Storage Health Check Procedures

6. Security Health Check Procedures

Usage Example

Output Format

Team Mode

Reference Files