New engagements · 24h
Skip to main content
Home / Architectures / Observability Platform
Production Tested Observability

Observability Platform

Full-stack observability for cloud-native workloads: metrics, logs and alerts configured before the first deployment reaches production.

Overview

This architecture combines Prometheus for metrics collection, Grafana for visualization, Fluent Bit for log forwarding and AlertManager for notifications. CloudWatch Container Insights provides node-level metrics from EKS without additional instrumentation.

The rule: observability is configured before the first deployment, not after the first incident. Prometheus is deployed via kube-prometheus-stack Helm chart. Alert rules cover CPU, memory, error rates, pod restarts and network latency. On-call channels receive structured notifications with runbook links.

Architecture Diagram

Architecture Diagram

Observability Platform — Data Flow

Applications expose /metrics stdout logs health probes EKS pods scrape Prometheus 15-day retention 30s scrape interval kube-prometheus-stack AlertManager CPU >80% Memory >85% · Error rate >1% fire alert Grafana Dashboards Node · Pod · App views Prometheus datasource forward Fluent Bit DaemonSet on every node structured parsing CloudWatch Container Insights 30-day log retention Notifications Slack #on-call Email (severity HIGH) Runbook link in alert CW Alarms CPU >80% Memory >85% SNS → Email/Slack
How metrics, logs and alerts flow from application containers to dashboards and on-call channels.

Components

Prometheus + AlertManager

Deployed via kube-prometheus-stack Helm chart. Scrapes all pods with /metrics endpoint at 30s intervals. Stores 15 days of time-series data. AlertManager routes firing alerts to Slack and email with runbook links.

Grafana

Persistent dashboards for node, pod and application metrics. Pre-built dashboards from grafana.com for common stacks (nginx, postgres, kafka). Alerting via Grafana Alerting as backup.

Fluent Bit DaemonSet

Runs on every EKS node. Reads container stdout/stderr, parses structured JSON logs, enriches with pod metadata (namespace, pod name, container name). Forwards to CloudWatch Logs grouped by service.

CloudWatch Container Insights

Node-level CPU, memory, disk and network metrics without any agent beyond the CloudWatch agent. Integrated with CloudWatch Alarms. Alarms at CPU >80%, Memory >85% with 5-minute evaluation period.

Alert Rules

Alert Threshold Duration Severity
High CPU Usage >80% 5 min warning
High Memory Usage >85% 5 min warning
Pod CrashLoop restart > 3 15 min critical
High Error Rate >1% 5xx 5 min critical
Node Not Ready any node 2 min critical
Disk Usage High >80% 10 min warning

Stack

Prometheus Grafana AlertManager Fluent Bit CloudWatch CloudWatch Alarms SNS kube-prometheus-stack

Need this implemented?

We configure full observability stacks — alerting configured before first deployment.

Start a conversation →