Observability Platform
Full-stack observability for cloud-native workloads: metrics, logs and alerts configured before the first deployment reaches production.
Overview
This architecture combines Prometheus for metrics collection, Grafana for visualization, Fluent Bit for log forwarding and AlertManager for notifications. CloudWatch Container Insights provides node-level metrics from EKS without additional instrumentation.
The rule: observability is configured before the first deployment, not after the first incident. Prometheus is deployed via kube-prometheus-stack Helm chart. Alert rules cover CPU, memory, error rates, pod restarts and network latency. On-call channels receive structured notifications with runbook links.
Architecture Diagram
Architecture Diagram
Observability Platform — Data Flow
Observability Platform — Data Flow
Components
Prometheus + AlertManager
Deployed via kube-prometheus-stack Helm chart. Scrapes all pods with /metrics endpoint at 30s intervals. Stores 15 days of time-series data. AlertManager routes firing alerts to Slack and email with runbook links.
Grafana
Persistent dashboards for node, pod and application metrics. Pre-built dashboards from grafana.com for common stacks (nginx, postgres, kafka). Alerting via Grafana Alerting as backup.
Fluent Bit DaemonSet
Runs on every EKS node. Reads container stdout/stderr, parses structured JSON logs, enriches with pod metadata (namespace, pod name, container name). Forwards to CloudWatch Logs grouped by service.
CloudWatch Container Insights
Node-level CPU, memory, disk and network metrics without any agent beyond the CloudWatch agent. Integrated with CloudWatch Alarms. Alarms at CPU >80%, Memory >85% with 5-minute evaluation period.
Alert Rules
| Alert | Threshold | Duration | Severity |
|---|---|---|---|
| High CPU Usage | >80% | 5 min | warning |
| High Memory Usage | >85% | 5 min | warning |
| Pod CrashLoop | restart > 3 | 15 min | critical |
| High Error Rate | >1% 5xx | 5 min | critical |
| Node Not Ready | any node | 2 min | critical |
| Disk Usage High | >80% | 10 min | warning |
Stack
Need this implemented?
We configure full observability stacks — alerting configured before first deployment.
Start a conversation →