Skip to main content

Risks

Requirements Logging, Monitoring and Alarming

Logging

Logging is essential for understanding application behavior, troubleshooting issues, and monitoring cluster activity. However, relying solely on node-level logging mechanisms is insufficient because logs may be lost if a container crashes or a node becomes unreachable. Hence, a robust cluster-level logging solution is necessary to store, analyze, and query logs independently of nodes, pods, or containers.

Node-Level Logging

Logging Configuration Example:

{ "log-driver": "json-file", "log-opts": 
    { "max-size": "10m", "max-file": "3", "labels": "production_status", "env": "os,customer" } 
}

Features to Consider:

  • Log Rotation: Ensures that old logs are periodically deleted to free up space.
  • Log Format: JSON format is recommended for log messages to facilitate easier processing and querying.

Cluster-Level Logging

Approaches:

  1. Logging Applications: Use daemonsets to deploy logging applications on each node, which access container log directories.
  2. Sidecar Containers: Add a sidecar container to each pod to handle logging.
  3. Direct Logging: Applications send logs directly to a backend, though this is not recommended as it is outside the scope of Kubernetes.

Logging Through Applications (Daemonsets)

  • Easiest to implement and deploy using daemonsets.
  • No changes needed in the applications being logged.
  • Use PodSecurityPolicies, SecurityContexts, NetworkPolicies to secure the logging implementation.

Logging Through Sidecars

  • Adds an additional container to each pod.
  • Resource-intensive but offers better scalability.
  • Two Variants:
    • Streaming container sends logs to stdout and stderr.
    • Logging application container reads and processes logs.

Direct Logging to Backend

  • Each application sends logs directly to a logging backend.
  • Not recommended as it falls outside the Kubernetes ecosystem.

Recommended Logging Solution

  • Use the EFK stack (Elasticsearch, Fluentd, Kibana) for logging.
  • Implement log rotation and filtering.
  • Allocate specific nodes for logging and monitoring using taints, tolerations, and node affinities.
  • Restrict kubectl logs command to admins.
  • Use JSON format for log messages.
  • Ensure VMs hosting Kubernetes components (Kubelet, API server, etcd) are also logged.

Monitoring

Unlike logging, there is a recommended approach for monitoring in Kubernetes using daemonsets to ensure each node has a monitoring pod.

Recommended Monitoring Solution

  • Use Prometheus for monitoring and Grafana for dashboards.
  • Separate monitoring and logging applications into different namespaces.

Monitoring Focus Areas

  • Utilization of the entire cluster.
  • Load on each node.
  • Load on each namespace.
  • Workload of pods, especially control plane pods.
  • Number of pods and containers.
  • Number of running pods and containers.

Prometheus and Grafana Configuration

  • Use Prometheus to collect metrics and Grafana to visualize them.
  • Deploy Prometheus using a daemonset to ensure each node is monitored.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: prometheus-node-exporter
spec:
  selector:
    matchLabels:
      app: prometheus-node-exporter
  template:
    metadata:
      labels:
        app: prometheus-node-exporter
    spec:
      containers:
      - name: prometheus-node-exporter
        image: quay.io/prometheus/node-exporter
        ports:
        - containerPort: 9100

Alarming

Set up alarming to notify administrators of critical events and anomalies detected through logging and monitoring.

Logging Alarms

  • Trigger alarms on stderr messages initially.
  • Refine alarms over time based on specific error messages and patterns.

Monitoring Alarms

  • High cluster utilization.
  • High node utilization.
  • High namespace utilization.
  • High pod utilization, especially control plane pods.
  • Unexpected number of pods or containers (higher or lower than expected).

Prometheus Alertmanager Configuration Example

groups:
- name: KubernetesMonitoring
  rules:
  - alert: HighClusterUtilization
    expr: sum(rate(container_cpu_usage_seconds_total[5m])) / sum(machine_cpu_cores) * 100 > 80
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High Cluster CPU Utilization"
      description: "Cluster CPU utilization is above 80% for more than 5 minutes."

Conclusion

A comprehensive logging, monitoring, and alarming strategy is crucial for maintaining the health and security of a Kubernetes cluster. Implementing cluster-level logging with solutions like the EFK stack, monitoring with Prometheus and Grafana, and setting up robust alarms ensures that administrators can quickly detect and respond to issues. Regularly review and update logging and monitoring policies to adapt to evolving requirements and best practices.


follow these measures