Observability Platform

Full-stack observability for cloud-native workloads: metrics, logs and alerts configured before the first deployment reaches production.

Overview

This architecture combines Prometheus for metrics collection, Grafana for visualization, Fluent Bit for log forwarding and AlertManager for notifications. CloudWatch Container Insights provides node-level metrics from EKS without additional instrumentation.

The rule: observability is configured before the first deployment, not after the first incident. Prometheus is deployed via kube-prometheus-stack Helm chart. Alert rules cover CPU, memory, error rates, pod restarts and network latency. On-call channels receive structured notifications with runbook links.

Architecture Diagram

Observability Platform — Data Flow

How metrics, logs and alerts flow from application containers to dashboards and on-call channels.

Components

Prometheus + AlertManager

Deployed via kube-prometheus-stack Helm chart. Scrapes all pods with /metrics endpoint at 30s intervals. Stores 15 days of time-series data. AlertManager routes firing alerts to Slack and email with runbook links.

Grafana

Persistent dashboards for node, pod and application metrics. Pre-built dashboards from grafana.com for common stacks (nginx, postgres, kafka). Alerting via Grafana Alerting as backup.

Fluent Bit DaemonSet

Runs on every EKS node. Reads container stdout/stderr, parses structured JSON logs, enriches with pod metadata (namespace, pod name, container name). Forwards to CloudWatch Logs grouped by service.

CloudWatch Container Insights

Node-level CPU, memory, disk and network metrics without any agent beyond the CloudWatch agent. Integrated with CloudWatch Alarms. Alarms at CPU >80%, Memory >85% with 5-minute evaluation period.

Alert Rules

Alert	Threshold	Duration	Severity
High CPU Usage	>80%	5 min	warning
High Memory Usage	>85%	5 min	warning
Pod CrashLoop	restart > 3	15 min	critical
High Error Rate	>1% 5xx	5 min	critical
Node Not Ready	any node	2 min	critical
Disk Usage High	>80%	10 min	warning

Stack

Prometheus Grafana AlertManager Fluent Bit CloudWatch CloudWatch Alarms SNS kube-prometheus-stack

Need this implemented?

We configure full observability stacks — alerting configured before first deployment.

Start a conversation →