AIOps on AWS

Intelligent Cloud Operations for AnyCompany

Junseok Oh

Sr. Solutions Architect

AWS

오늘의 아젠다

총 90min 세션

1

AIOps Foundations & AWS Observability 30분

2

ML-Powered Operations & Anomaly Detection 30분

☕

Break 5분

3

Implementation Strategies & Best Practices 25분

300 레벨 — 개념보다 실전 중심 | 실시간 Q&A 환영 | 각 블록 후 브레이크

AIOps란 무엇인가?

정의

AI for IT Operations — ML과 빅데이터 분석을 활용하여 IT 운영을 자동화하고 향상시키는 접근법

핵심 요소

Observe — 전체 스택 가시성 확보
Detect — 이상 징후 자동 탐지
Diagnose — 근본 원인 분석 (RCA)
Respond — 자동화된 대응 및 복구

Gartner 정의 (2024)

"AIOps platforms combine big data and ML to automate IT operations processes, including event correlation, anomaly detection, and causality determination."

왜 지금 AIOps인가?

마이크로서비스 → 복잡성 기하급수적 증가
분당 수백만 이벤트 → 사람이 처리 불가
MTTR 단축 압박 → 수동 분석의 한계
Gen AI 발전 → 자연어 기반 운영 가능

Observability vs Monitoring

Known-unknowns — 미리 정의한 임계값 기반 알람
대시보드 중심, 사후 분석
"CPU > 80% 이면 알람" → 증상만 탐지

Unknown-unknowns — 예상치 못한 문제를 발견하는 능력
세 가지 신호(Three Pillars): Metrics, Logs, Traces
"왜 이 API가 느려졌는지" 역추적 가능

ML 기반 이상 탐지 — 시즌별/시간대별 패턴 학습 후 편차 감지
이벤트 상관 분석 — 수천 개 알람을 소수의 인시던트로 그룹핑
예측 — 리소스 고갈, 장애 전조 사전 경고

AWS Observability Stack — 전체 구성도

CloudWatch — AIOps의 데이터 허브

Metrics

고해상도 메트릭 — 1초 간격 수집 가능
Embedded Metric Format (EMF) — 구조화된 로그에서 메트릭 자동 추출
Metric Math — 실시간 연산 (에러율 = Errors / Invocations × 100)
Contributor Insights — Top-N 기여자 실시간 분석

Logs Insights

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by bin(5m)
| sort errorCount desc

Anomaly Detection

2주간 메트릭 패턴 학습 (시간대, 요일, 계절)
Band 모델 — 정상 범위를 밴드로 표현
밴드 이탈 시 알람 → 정적 임계값 없이 동적 탐지
ANOMALY_DETECTION_BAND(metric, stddev) 함수

Cross-Account Observability

Monitoring Account 하나에서 모든 계정 관측
Organization 전체 로그/메트릭/트레이스 통합
AnyCompany 멀티 계정 환경에 필수

ADOT — 통합 텔레메트리 수집

AWS Distro for OpenTelemetry

벤더 중립 — CNCF 표준 기반
하나의 에이전트로 Metrics + Traces + Logs 수집
EKS DaemonSet 또는 Sidecar 배포
Kubernetes 메타데이터 자동 태깅 (pod, namespace, node)

수집 파이프라인

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  prometheus:
    config:
      scrape_configs:
        - job_name: 'k8s-pods'
          kubernetes_sd_configs:
            - role: pod

Exporters 구성

exporters:
  awsxray:
    region: ap-northeast-2
  awsemf:
    namespace: AnyCompany/App
    region: ap-northeast-2
  prometheusremotewrite:
    endpoint: https://aps-workspaces...
    auth:
      authenticator: sigv4auth

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [awsxray]
    metrics:
      receivers: [otlp, prometheus]
      exporters: [awsemf, prometheusremotewrite]

핵심 이점

X-Ray + CloudWatch + AMP에 동시 전송
애플리케이션 코드 변경 최소화

AMP & AMG — 시각화와 분석

AMP 핵심 기능

완전 관리형 — PromQL 호환, 스케일링/패치 자동
150일 보존 — 장기 트렌드 분석 가능
Cross-region replication — DR 지원
IAM 인증 — Prometheus 네이티브 인증 대비 보안 강화

비용 고려

항목	가격
샘플 수집	$0.003/1K samples
쿼리 처리	$0.10/10억 samples
스토리지	$0.03/GB/month

팁: cardinality 관리가 비용의 핵심. label 조합이 100만 개 넘으면 비용 급증

AMG로 AIOps 대시보드 구축

다중 데이터소스: CloudWatch, AMP, X-Ray, Athena
Alerting: Grafana Alerting → SNS → PagerDuty/Slack
ML Plugin: Prophet 기반 메트릭 예측 시각화

권장 대시보드 구성

Overview — 서비스 헬스맵, 핵심 SLI/SLO
Deep Dive — 서비스별 상세 메트릭 (p99, error rate)
AIOps — Anomaly Detection 밴드, DevOps Guru 인사이트
Cost — 리소스 사용량 vs 비용 트렌드

이상 탐지 PromQL 예시

# SLO burn rate (1시간 윈도우)
(
  sum(rate(http_requests_total{code=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) / 0.001 > 1

# P99 latency 이동 편차
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m]))
  by (le, service)
) > 2 * histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[1h]))
  by (le, service)
)

# CPU throttling 비율
sum(rate(container_cpu_cfs_throttled_periods_total[5m]))
by (pod) /
sum(rate(container_cpu_cfs_periods_total[5m]))
by (pod) > 0.25

AIOps 데이터 파이프라인 — 실시간 흐름

Block 1 — Key Takeaways

핵심 포인트

AIOps = Observe + Detect + Diagnose + Respond 4단계 자동화
Three Pillars — Metrics, Logs, Traces 통합 수집이 전제 조건
ADOT — 벤더 중립 통합 수집기, EKS 환경의 표준
CloudWatch Anomaly Detection — 정적 임계값의 대안, ML 기반 밴드 모델
AMP + AMG — 오픈소스 호환 관리형 분석/시각화 스택

AnyCompany 체크포인트

[ ] CloudWatch Agent 전체 워크로드 배포 여부
[ ] ADOT 또는 기존 Prometheus 수집 파이프라인 존재
[ ] Cross-Account Observability 설정 유무
[ ] SLI/SLO 정량적 정의 및 대시보드 운영
[ ] 로그 보존 정책 및 아카이빙 전략

다음 블록 예고

Block 2에서는 이 데이터를 활용하는 ML 서비스들 — DevOps Guru, CodeGuru, Lookout for Metrics, 그리고 Gen AI 기반 운영을 다룹니다.

Thank You

수고하셨습니다!

← 목차로 돌아가기